Hitachi Digital Services

SRE Lead

Posted 11 Days Ago

Be an Early Applicant

3 Locations

Senior level

3 Locations

Senior level

Lead the SRE team to manage application and infrastructure reliability in cloud and on-premises environments. Responsibilities include overseeing incident response, defining and implementing observability solutions, driving automation, and mentoring junior team members, while promoting SRE best practices.

The summary above was generated by AI

Our Company

We’re Hitachi Digital Services, a global digital solutions and transformation business. Our expertise, innovation and technology unlock potential – from taking theme park fans on magical rides, conserving natural resources, protecting rainforests, and saving lives. We automate, modernize, optimize, and accelerate. Our people are trusted transformers, with deep engineering expertise, focused on a sustainable future for all.

Imagine the sheer breadth of talent it takes to inspire the future. We don’t expect you to ‘fit’ every requirement – your life experience, character, perspective, and passion for achieving great things are equally important to us.

The team

We’re a leader in cutting-edge innovation, the transformative power of cloud technology, and converged and hyperconverged solutions. Our mission is to empower clients to securely store, manage, and modernize their digital core, unlocking valuable insights and driving data-driven value.

This strong, diverse, and collaborative group of technology professionals collaborate with teams to support our customers as they store, enrich, activate, and monetise their data, bringing value to every line of their business.

The role:

SRE Lead
• Deep understanding of SRE principles and experience in anomaly detection, root cause analysis, and predictive maintenance.
• Working Knowledge on Automation first approach, defining SLI/SLO/Error Budgets
• Experience in leading an operations team in Application Production Environment
• Experience in Scripting Languages (Java, Python, PowerShell, VBScript)
• Working knowledge of Kubernetes and Opentelemetry
• Knowledge on the Generative AI concepts, LLM fundamentals and Responsible AI concepts
• Knowledge of DevOps methodologies, tools and automation –CICD pipelines, tools (GitHub, Terraform, ArgoCD, Helm etc) and infrastructure automation
• Experience in working with Public / Private cloud (AWS, Azure, GCP, Rancher etc.,)
• Proficiency in incident response, change and release process, application monitoring, and platform optimization.
• Knowledge on the fine-tuning models, prompt engineering, RAG fundamentals and cost optimisation is a plus

What you’ll be doing
• Lead a team of platform, application and incident SREs to debug/troubleshoot application and infrastructure issues in cloud-based environment/on-premises environment. Handle live production incidents.
• Lead efforts to improve the performance, availability and reliability of applications.
• Define and implement effective observability solutions to proactively identify and resolve issues and drive optimisation
• Define and manage incident process, change and release management process, deployment process, on-call and escalation process.
• Develop automation (IaC, Alert as code, dashboard as code etc) to increase efficiency and reduce toil
• Conduct POC to implement tools and solutions to support Generative AI application platform
• Analyse operational performance (Incidents, Problems and Alerts trends) and drive optimisation
• Follow and implement SRE best practices and standards within the team
• Document SOPs, processes, critical system information, KB articles, POCs, standards and best practices for current and future references
• Provide technical guidance and mentorship to junior SRE team members
• Stay updated with the latest advancements in Generative AI space
What you bring to the team
• Experience in SRE principles & best practices to manage on-premises and cloud applications
• Working knowledge on the Generative AI applications
• Ability to lead the team for continuous improvement, estimate work and escalate issues on time
• Strong analytical skills to identify and resolve complex technical issues to ensure system reliability and minimize downtimes
• Strong communication and interpersonal skills to effectively collaborate with cross-functional teams
Mandatory Skills:
• Deep understanding of SRE principles and experience in anomaly detection, root cause analysis, and predictive maintenance.
• Working Knowledge on Automation first approach, defining SLI/SLO/Error Budgets
• Experience in leading an operations team in Application Production Environment
• Experience in Scripting Languages (Java, Python, PowerShell, VBScript)
• Working knowledge of Kubernetes and Opentelemetry
• Knowledge on the Generative AI concepts, LLM fundamentals and Responsible AI concepts
• Knowledge of DevOps methodologies, tools and automation –CICD pipelines, tools (GitHub, Terraform, ArgoCD, Helm etc) and infrastructure automation
• Experience in working with Public / Private cloud (AWS, Azure, GCP, Rancher etc.,)
• Proficiency in incident response, change and release process, application monitoring, and platform optimization.
• Knowledge on the fine-tuning models, prompt engineering, RAG fundamentals and cost optimisation is a plus
Experience and Education:
• A Bachelor’s degree in Computer Science or related field, with 5+ years of experience in leading a team of SRE engineers

About us

We’re a global, team of innovators. Together, we harness engineering excellence and passion to co-create meaningful solutions to complex challenges. We turn organizations into data-driven leaders that can make a positive impact on their industries and society. If you believe that innovation can bring a better tomorrow closer to today, this is the place for you.

Championing diversity, equity, and inclusion

Diversity, equity, and inclusion (DEI) are integral to our culture and identity. Diverse thinking, a commitment to allyship, and a culture of empowerment help us achieve powerful results. We want you to be you, with all the ideas, lived experience, and fresh perspective that brings. We support your uniqueness and encourage people from all backgrounds to apply and realize their full potential as part of our team.

How we look after you

We help take care of your today and tomorrow with industry-leading benefits, support, and services that look after your holistic health and wellbeing. We’re also champions of life balance and offer flexible arrangements that work for you (role and location dependent). We’re always looking for new ways of working that bring out our best, which leads to unexpected ideas. So here, you’ll experience a sense of belonging, and discover autonomy, freedom, and ownership as you work alongside talented people you enjoy sharing knowledge with.

We’re proud to say we’re an equal opportunity employer and welcome all applicants for employment without attention to race, colour, religion, sex, sexual orientation, gender identity, national origin, veteran, age, disability status or any other protected characteristic. Should you need reasonable accommodations during the recruitment process, please let us know so that we can do our best to set you up for success.

Top Skills

Java

Powershell

Python

Vbscript

Tower VII Magarpatta City SEZ, Hadapsar, , Pune, India, 411001

Similar Jobs

Morningstar

Lead Site Reliability Engineer

Be an Early Applicant

17 Days Ago

Navi Mumbai, Thane, Maharashtra, IND

Hybrid

12,700 Employees

Senior level

Apply

12,700 Employees

Senior level

Enterprise Web • Fintech • Financial Services

As a Lead Site Reliability Engineer, you'll design and implement system enhancements to boost performance and reliability. You will lead a skilled team, improve deployment processes, and optimize cloud solutions while ensuring system visibility and customer satisfaction.

Morningstar

Site Reliability Engineer

Be an Early Applicant

2 Days Ago

Navi Mumbai, Thane, Maharashtra, IND

Hybrid

12,700 Employees

Mid level

Apply

12,700 Employees

Mid level

Enterprise Web • Fintech • Financial Services

The Site Reliability Engineer will onboard users to observability platforms, ensure best practices are followed, collaborate with teams to educate on observability features, assist with anomaly analysis, automate tasks, and maintain operational documentation.

JPMorganChase

Site Reliability Engineer III

Be an Early Applicant

2 Days Ago

Mumbai, Maharashtra, IND

Hybrid

289,097 Employees

Mid level

Apply

289,097 Employees

Mid level

Financial Services

As a Site Reliability Engineer III at JPMorgan Chase, you will optimize applications and infrastructure, develop deployment strategies using CI/CD, and enhance reliability and scalability. You will collaborate with teams to solve complex problems, support SRE best practices, and implement infrastructure as code.

What you need to know about the Pune Tech Scene

Once a far-out concept, AI is now a tangible force reshaping industries and economies worldwide. While its adoption will automate some roles, AI has created more jobs than it has displaced, with an expected 97 million new roles to be created in the coming years. This is especially true in cities like Pune, which is emerging as a hub for companies eager to leverage this technology to develop solutions that simplify and improve lives in sectors such as education, healthcare, finance, e-commerce and more.

Hitachi Digital Services

SRE Lead

Top Skills

Hitachi Digital Services Pune, Mahārāshtra, IND Office

Similar Jobs

Lead Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer III

What you need to know about the Pune Tech Scene