AI Infrastructure (Operations) Engineer

BharatGen

Mumbai Metropolitan RegionFull-timeMid LevelOn-site

Job Description

About the Role BharatGen is building sovereign AI technologies for India at national scale—foundation models across text, speech, vision, and documents, designed for Indian languages, culture, and context from the ground up. We are building: Foundational Models (text, speech, images, documents, and beyond) Datasets & Benchmarks for multilingual and India-centric evaluation Agentic Technologies and tools for grounded reasoning and complex task orchestration Platforms for Training & Inference that are frugally scalable Human–AI Workflows and Applications for Various Sectors including education, governance, agriculture, finance, health, law, and more BharatGen’s work heavily depends on infrastructure that is reliable, visible and usable by multiple technical teams. As the scale of experiments, data, GPU usage and directly managed, cloud-hosted or externally provisioned environments grows, the infrastructure function needs stronger day to day ownership, operational discipline and technical coordination.

We are looking for an AI Infrastructure Engineer to support this layer. This is a hands-on systems and infrastructure operations role focused on keeping shared AI infrastructure running well, reducing friction for model, data and application teams, improving visibility into usage and reliability, and ensuring that access, storage, compute, cloud and external infrastructure dependencies are handled in a structured manner. Key Responsibilities Shared AI Infrastructure Reliability: Own the day to day reliability, usability and operational health of BharatGen’s shared AI infrastructure across bare metal, cloud hosted, GPU and HPC style environments.

This includes Linux systems, user access, compute environments, scheduler usage, troubleshooting, resource allocation and routine system hygiene. The person should also deploy, configure, maintain and troubleshoot Linux, GPU, storage and scheduler components across BharatGen’s shared AI infrastructure where applicable. The role is expected to ensure that model, data and application teams can use shared infrastructure with fewer interruptions, clearer processes and faster issue resolution.

The person should also help establish first-line incident handling for shared infrastructure, including triage, impact assessment, escalation, resolution tracking and recurring issue analysis. For recurring issues, the person should help drive root cause analysis and convert learnings into runbooks, automation, monitoring checks or process improvements. Compute, Cloud and Capacity Operations: Manage infrastructure resources across directly managed, cloud hosted and externally provisioned environments.

The role includes provisioning and deprovisioning compute resources, coordinating access, tracking operational capacity, following up on infrastructure issues, and ensuring that BharatGen retains visibility and control across all environments. The person should provide operational inputs for compute, GPU, storage and cloud capacity planning by tracking current utilization, idle resources, bottlenecks, provisioning timelines and constraints surfaced by track leads, while helping translate team requirements into practical infrastructure plans, acceptance criteria and operating expectations. HPC and GPU Workflow Enablement: Support HPC and GPU based workflows used by BharatGen’s technical teams.

This includes configuring and supporting Slurm or similar schedulers, helping users with job submission and environment issues, supporting GPU usage visibility, and debugging problems across nodes, jobs, containers, VMs, storage and networked systems. The responsibility is not only to fix individual issues, but to reduce repeated friction by improving how technical teams access and use the infrastructure. The person should also help coordinate troubleshooting of network and connectivity issues affecting compute, storage, data movement, cloud access and user workflows.

Storage, Backup and Data Movement: Support the storage and data movement layer that underpins BharatGen’s datasets, experiments, model checkpoints, logs, artifacts and shared workspaces. This includes working with file systems, NFS, object storage, cloud storage and other networked storage systems; monitoring utilization and capacity risks; supporting backup, recovery, retention, archival and cleanup practices; and enabling secure, reliable data movement across internal systems, cloud environments, external storage and approved third party setups. Observability, Metrics and Cost Discipline: Build and maintain the visibility layer for BharatGen’s infrastructure.

The role should help build and maintain monitoring, alerting and dashboarding systems that surface infrastructure health, utilization, bottlenecks, incidents, cost drivers and failure patterns. The role should own key dashboards, operational metrics and usage indicators across compute, GPU, storage, cloud and externally provisioned environments. These dashboards should support different levels of visibility for leadership, core engineering, application teams and operations.

The person should help track utilization, incidents, bottlenecks, idle or over provisioned resources, cost drivers and capacity risks so that infrastructure and allocation decisions are based on reliable operational data. Automation and Operational Maturity: Improve the repeatability and maturity of infrastructure operations. This includes automating recurring tasks, creating scripts and checks, improving environment consistency, maintaining practical runbooks and SOPs, documenting onboarding and escalation processes, and helping move recurring operational knowledge out of individual memory into shared systems.

The person should use scripting, configuration management and version-controlled infrastructure practices where appropriate to make provisioning, configuration and operational changes more repeatable. The role should help BharatGen reduce ad hoc infrastructure handling and build a more dependable operating rhythm around access, provisioning, incidents, maintenance, usage reporting, support intake, change communication and cloud or external infrastructure coordination. Security Hygiene and Access Controls: Support security hygiene across BharatGen’s infrastructure layer.

This includes access control, least privilege practices, MFA, SSH key hygiene, credential handling, account lifecycle, patching, secure data movement, audit readiness and implementation of agreed infrastructure security controls. The person should maintain clear access and account lifecycle practices across shared systems, storage, cloud environments and externally provisioned infrastructure. Given BharatGen’s work with sensitive AI systems, enterprise clients and regulated sectors, the person should be able to identify obvious security gaps, implement practical controls, and escalate issues that require deeper security, compliance, legal, vendor or leadership input.

Required Qualifications and Experience We are looking for candidates with strong hands-on infrastructure experience. Formal degrees and certifications are useful, but practical ability to operate, troubleshoot and improve real systems is more important for this role. 3 or more years of hands-on experience in Linux system administration, HPC systems, cloud infrastructure, GPU infrastructure, research computing, DevOps, infrastructure operations or a closely related area. B.Tech, M.Tech, MS or Ph.D. in Computer Science, Engineering or a related technical field is preferred.

Equivalent practical experience in infrastructure operations will also be considered. Prior experience supporting technical users, engineering teams, research teams, AI or ML teams, or multi user compute environments is strongly valued. Strong Linux system administration skills, including users, groups, permissions, SSH, shell environments, package management, patching and troubleshooting.

Exposure to HPC or shared compute environments, including Slurm or similar schedulers, job queues, resource allocation and operational debugging. Practical awareness of GPU infrastructure for AI or ML workloads, including NVIDIA GPU systems, drivers, CUDA environments, GPU monitoring or related operational issues. Experience with directly managed, cloud hosted or externally provisioned infrastructure, preferably including AWS or similar cloud platforms, GPU cloud environments, hyperscalers or managed infrastructure providers.

Working knowledge of storage, backups, data movement, file systems, NFS or networked storage, object storage, cloud storage, retention and archival practices. Familiarity with monitoring, logging, dashboards, Grafana, Prometheus or similar infrastructure visibility systems. Scripting and automation ability using Bash, Python or similar tools.

Familiarity with containers, virtual machines and basic environment reproducibility practices. Ability to troubleshoot under uncertainty, support technical users, coordinate with vendors, maintain documentation and follow issues through to closure. Strongly Preferred Experience with AI, ML, research computing or GPU heavy environments will be especially useful.

Familiarity with Kubernetes or similar orchestration systems, infrastructure cost tracking, usage governance, capacity planning, large scale storage, backup systems, archival workflows and regulated or security sensitive environments will be considered an advantage. Relevant certifications such as RHCE, AWS, CCNA, NVIDIA, Linux Foundation, Kubernetes or cloud infrastructure certifications will be considered a plus. Strong hands on infrastructure experience and practical problem solving ability will be valued more than certifications alone.

Posted 1 weeks ago

Related Jobs

Related Searches

Apply Now