NVIDIA Logo

NVIDIA

Principal Site Reliability Engineer, AI Infrastructure

Posted 3 Days Ago
Be an Early Applicant
In-Office or Remote
6 Locations
Expert/Leader
In-Office or Remote
6 Locations
Expert/Leader
Lead the architecture, implementation, and scaling of globally distributed systems for AI/ML and HPC. Drive reliability strategies and innovation at NVIDIA, mentoring global teams and contributing to cross-organizational efforts.
The summary above was generated by AI

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you! NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for over 30 years. It’s an outstanding legacy of innovation that’s fueled by phenomenal technology and exceptional people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and exceptional talent. As an NVIDIAN, you’ll be immersed in a diverse, encouraging environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world.
 

What You Will Be Doing:

  • Architect, lead, and scale globally distributed production systems supporting AI/ML, HPC, and critical engineering platforms across hybrid and multi-cloud environments.

  • Design and lead implementation of automation frameworks that reduce manual tasks, promote resilience, and uphold standard methodologies for system health, change safety, and release velocity.

  • Define and evolve platform-wide reliability metrics, capacity forecasting strategies, and uncertainty testing approaches for sophisticated distributed systems.

  • Lead cross-organizational efforts to assess operational maturity, address systemic risks, and establish long-term reliability strategies in collaboration with engineering, infrastructure, and product teams.

  • Pioneer initiatives that influence NVIDIA’s AI platform roadmap, participating in co-development efforts with internal partners and external vendors, and staying ahead of academic and industry advances.

  • Publish technical insights (papers, patents, whitepapers) and drive innovation in production engineering and system design.

  • Lead and mentor global teams in a technical capacity, participating in recruitment, design reviews, and developing standard methodologies in incident response, observability, and system architecture.

What We Need to See:

  • 15+ years of experience in SRE, Production Engineering, or Cloud Infrastructure, with a strong track record of leading platform-scale efforts and high-impact programs.

  • Deep expertise in Linux/Unix systems engineering and public/private cloud platforms (AWS, GCP, Azure, OCI).

  • Expert-level programming in Python and one or more languages such as C++, Go or Rust.

  • Demonstrated experience with Kubernetes at scale, CPU/GPU scheduling, microservice orchestration, and container lifecycle management in production.

  • Hands-on expertise in observability frameworks (Prometheus, Grafana, ELK, Loki, etc.) and Infrastructure as Code (Terraform, CDK, Pulumi).

  • Proficiency in Site Reliability Engineering concepts like error budgets, SLOs, distributed tracing, and architectural fault tolerance.

  • Ability to influence multi-functional collaborators and drive technical decisions through effective written and verbal communication.

  • Proven track record to complete long-term, forward-looking platform strategies.

  • Degree in Computer Science or related field, or equivalent experience

Ways to Stand Out from the Crowd:

  • Hands-on experience building platforms for large-scale AI training, inferencing, and data movement pipelines.

  • Familiarity with deep learning frameworks (e.g., PyTorch, TensorFlow, JAX) and orchestration frameworks (e.g., Ray, Kubeflow).

  • Expertise in hardware fleet observability, predictive failure analysis, and power/resource-aware scheduling.

  • Experience leading operational readiness efforts and reliability engineering in GPU-heavy environments.

  • Track record of driving cultural improvements in incident management, root cause analysis, and postmortem processes across large teams.

Join us and build the infrastructure that powers the world’s most advanced AI. Apply now and make your mark at NVIDIA! Widely considered to be one of the technology world’s most desirable employers, NVIDIA offers highly competitive salaries and a comprehensive benefits package.

Top Skills

AWS
Azure
C++
Cdk
Elk
GCP
Go
Grafana
Jax
Kubeflow
Kubernetes
Linux
Loki
Oci
Prometheus
Pulumi
Python
PyTorch
Ray
Rust
TensorFlow
Terraform
Unix

NVIDIA Pune, Mahārāshtra, IND Office

Survey No.144 145, Commerzone No.5, Off, Airport Rd, Yerawada, Pune, Maharashtra, India, 411006

Similar Jobs

An Hour Ago
Remote or Hybrid
India
Senior level
Senior level
Fintech • Information Technology • Insurance • Financial Services • Big Data Analytics
The Unit Manager will lead business process excellence activities, implement lean methodologies, manage stakeholder relationships, and drive continuous improvement initiatives within the organization.
Top Skills: Generative AiLean Six SigmaMicrosoft Power Platform ToolsMS OfficePower AppsPower AutomatePower BIPower PagesRpa ToolsTableauVba Macros
An Hour Ago
Remote or Hybrid
India
Junior
Junior
Fintech • Information Technology • Insurance • Financial Services • Big Data Analytics
The Manager of Business Excellence will oversee operations, manage teams, and ensure optimal performance to enhance service delivery and efficiency within MetLife.
2 Hours Ago
Remote or Hybrid
Hyderabad, Telangana, IND
Expert/Leader
Expert/Leader
Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation
In this role, you will integrate AI into security processes, advocate for security, manage projects, and work with AWS, Docker, and Kubernetes.
Top Skills: Ai-Powered ToolsAWSDockerJenkinsKubernetesServicenow

What you need to know about the Pune Tech Scene

Once a far-out concept, AI is now a tangible force reshaping industries and economies worldwide. While its adoption will automate some roles, AI has created more jobs than it has displaced, with an expected 97 million new roles to be created in the coming years. This is especially true in cities like Pune, which is emerging as a hub for companies eager to leverage this technology to develop solutions that simplify and improve lives in sectors such as education, healthcare, finance, e-commerce and more.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account