NVIDIA

Senior Site Reliability Engineer

Reposted 3 Days Ago

Be an Early Applicant

In-Office

2 Locations

Senior level

In-Office

2 Locations

Senior level

As a Senior Site Reliability Engineer, you will design and implement Kubernetes architecture, automate workflows, and ensure high availability of systems while collaborating with various teams on complex infrastructure needs.

The summary above was generated by AI

NVIDIA is looking for a world class engineer to join its multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working as a Senior Devops and SRE Engineer. The position will be part of a fast-paced crew that develops and maintains sophisticated build & test environments for a multitude of hardware platforms both NVIDIA GPUs and Tegra Processors along with various operating systems (Windows/Linux/Android). The team works with various other business units within NVIDIA Software such as Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence, Robotics and Driverless Cars to cater to their infrastructure & system’s needs.

What you’ll be doing:

End-to-end Implementation of the Kubernetes architecture - design, deploy, hardening, networking, sizing, scaling etc.
Implementing high availability clusters and disaster recovery solutions
Strong System Admin experience using Configuration as Code, infrastructure-as-code with tools such as ansible, puppet, chef & terraform.
Design and implement logging & monitoring solution to gain more insight into applications and system health. Implement critical metric using various analytics methods and dashboards.
Craft and develop tools needed for automating workflows. Reuse AI techniques to extract useful signals about machines and jobs from the data generated.
Take part in prototyping, crafting and developing cloud infrastructure for Nvidia.
Participating in on-call support and critical issue coverage as a SRE engineer.

What we need to see:

Solid programming background in python/Go and/or similar scripting languages.
Excellent debugging, problem solving and analytical skills.
Strong understanding of architectural requirements and development processes involved in building reliable, robust, scalable data products and pipelines.
Proficient in configuration management & IaC tools like Ansible, Puppet, Chef, Terraform
Strong background with Gitlab, Jenkins, Flux, ArgoCD and/or other tools to build secure CI/CD systems.
Strong expertise in Kubernetes architecture, networking, RBAC, persistent storage solutions like Trident, Ceph, EBS, Longhorn, etc.
Proficient in secret management tools like hashicorp vault, aws secrets manager, etc.
Proficient in data analytics/visualization & monitoring tools like Kibana, Grafana, Splunk, Zabbix, Prometheus and/or similar systems.
5+ years of proven experience.
Bachelor’s or master’s degree in computer science, Software Engineering, or equivalent experience.

Ways to stand out from the crowd:

Thrives in a multi-tasking environment with constantly evolving priorities.
Prior experience with large scale operations team. Experience with using and improving data centers. Expertise with windows server infrastructure.
Outstanding interpersonal skills and communication with all levels of management.
Ability to analyze complex problems into simple sub problems and then reuse available solutions to implement most of those. Ability to design simple systems that can work efficiently without needing much support.
Ability to leverage AI/ML to proactively detect & resolve incidents, automated alert triaging, log analysis and automate repetitive workflows.

With competitive salaries and a generous benefits package, we are widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us and, due to outstanding growth, our exclusive engineering teams are rapidly growing. If you’re a creative and autonomous engineer with a real passion for technology, we want to hear from you.

Top Skills

Ansible

Argocd

AWS

Chef

Flux

Gitlab

Grafana

Hashicorp Vault

Jenkins

Kibana

Kubernetes

Prometheus

Puppet

Python

Splunk

Terraform

Zabbix

Survey No.144 145, Commerzone No.5, Off, Airport Rd, Yerawada, Pune, Maharashtra, India, 411006

Similar Jobs

CrowdStrike

Site Reliability Engineer

8 Days Ago

Remote or Hybrid

Senior level

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity

Responsible for designing and maintaining monitoring solutions for IT infrastructure, focusing on reliability and automation, and participating in incident management.

Top Skills: AnsibleAws CloudwatchBashDatadogDockerElk StackGCPKubernetesLogicmonitorLogscalePowershellPrometheusPythonSplunkTerraformThousandeyesZscaler Digital Experience

Pattern Bioscience

Senior Site Reliability Engineer

8 Days Ago

In-Office

Pune, Maharashtra, IND

Senior level

Biotech

The Senior Site Reliability Engineer will enhance system reliability and scalability, manage AWS resources, automate processes, and establish monitoring systems while collaborating with development teams.

Top Skills: AnsibleAWSCloudFormationDatadogDockerDynatraceGithub ActionsGrafanaKinesis StreamsLinuxNewrelicPostgresPrometheusPuppetPythonRedisS3ShellTerraform

NVIDIA

Senior Site Reliability Engineer

13 Days Ago

In-Office or Remote

Senior level

Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse

The Senior Site Reliability Engineer will design, build and maintain large-scale systems for data analytics and machine learning, ensuring reliability and performance while applying SRE principles.

Top Skills: Ci/CdElkGithub ActionsGoJenkinsKafkaKubernetesOpenstackPerlPrometheusPythonRubySpark

What you need to know about the Pune Tech Scene

Once a far-out concept, AI is now a tangible force reshaping industries and economies worldwide. While its adoption will automate some roles, AI has created more jobs than it has displaced, with an expected 97 million new roles to be created in the coming years. This is especially true in cities like Pune, which is emerging as a hub for companies eager to leverage this technology to develop solutions that simplify and improve lives in sectors such as education, healthcare, finance, e-commerce and more.

NVIDIA

Senior Site Reliability Engineer

Top Skills

NVIDIA Pune, Mahārāshtra, IND Office

Similar Jobs

Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

What you need to know about the Pune Tech Scene