AppDirect Logo

AppDirect

Senior Site Reliability Engineer

Posted 2 Days Ago
Be an Early Applicant
In-Office
Pune, Maharashtra
Senior level
In-Office
Pune, Maharashtra
Senior level
Lead the SRE efforts for India's DevOps team, enhancing application reliability, managing incidents, and mentoring engineers while promoting SRE practices and automating workflows.
The summary above was generated by AI

About AppDirect

Become a digital, global citizen and enable the new generation of digital entrepreneurs around the world. AppDirect offers a subscription commerce platform to sell any product, through any channel, on any device - as a service. We power millions of subscriptions worldwide for organizations. We do this by our values-driven culture - one that enables you to Be Seen, Be Yourself, and Do Your Best Work.
About the DevOps Platform Team

Our mission is to provide a robust Internal Developer Platform to AppDirect’s engineering teams, which makes it easy, safe and fun to design, implement, release and maintain the world’s leading subscription commerce platform. We are proud to be core contributors and maintainers of AppDirect’s Software Development Lifecycle (SDLC), through close alignment with Reliability, Quality, Data, InfoSec, Cloud, and other technology leadership.
We enable DevOps culture through our self-service, automated CI/CD platform. Currently, teams are leveraging the platform to make more than 3000 code deliveries every month, to 700 applications, on AWS, Azure, and on-premise environments, while remaining ISO27001, SOC2 and PCI compliant. Our Datadog instrumentation allows teams to have clear insights, monitoring, and alerting, in order to maintain the availability of their experiences.
What you'll do and how you'll have an impact

  • Be the founding SRE for India within the DevOps Platform Team, establishing operating rhythms, guardrails, and best practices that raise reliability across hundreds of services and 30+ Kubernetes clusters.
  • Lead global incident management from India time zones: triage and drive resolution as Incident Commander, coordinate war rooms, manage stakeholder communications, and publish timely status page updates.
  • Maintain automations to enable on-call rotations, escalation policies, and incident workflows in PagerDuty, Datadog and Slack.
  • Create actionable runbooks to reduce MTTA/MTTR.
  • Define and operationalize SLIs/SLOs and error budgets with product and engineering teams; coach teams on using error budgets for release decisions and reliability trade-offs.
  • Create high-signal observability: instrument services, tune alerts to reduce noise, and build reliability dashboards in Datadog.
  • Own planned maintenance: plan and schedule maintenance windows, coordinate execution across teams and environments (AWS, Azure, on-prem), communicate broadly, and verify recovery with clear rollback plans.
  • Eliminate toil through automation: build ChatOps, status page automation, auto-remediation workflows, and runbooks-as-code; integrate incident and maintenance workflows into CI/CD (Jenkins, Argo).
  • Drive production readiness: define PRR checklists, bake reliability gates into pipelines, and improve deployment strategies (blue/green, progressive delivery).
  • Partner with DevOps Platform Engineers to harden the Internal Developer Platform and improve developer experience while maintaining compliance requirements (e.g., ISO27001, SOC2, PCI).
  • Lead blameless postmortems, track corrective actions, and maintain a reliability backlog that measurably improves availability, latency, and change success rate.
  • Mentor engineers and evangelize SRE principles through documentation, training, and a reliability guild/community of practice.

What we're looking for

  • 4+ years in SRE/Production Engineering/DevOps operating distributed systems and microservices at scale, including Kubernetes and containerized workloads.
  • Proven incident response leadership: incident triage and coordination, clear stakeholder/customer communications, status page management, and creation of robust runbooks.
  • Strong observability skills: ideally in Datadog (metrics, logs, traces, dashboards, monitors) or familiarity with Prometheus/Grafana, NewRelic, Dynatrace, or similar tools.
  • Expertise designing actionable alerts tied to SLIs/SLOs and managing error budgets.
  • Hands-on with CI/CD and release engineering: GitHub Actions, Argo (or similar), progressive delivery, feature flags, and safe rollout/rollback patterns.
  • Proficiency in at least one programming language (Golang preferred) plus Bash.
  • Ability to automate incident workflows, status page updates, and remediation tasks via APIs and ChatOps.
  • Solid foundations in Linux, networking, web protocols, DNS/TLS, load balancers/CDNs, and performance/capacity analysis.
  • Experience with databases and messaging systems is a plus.
  • Cloud fluency in Kubernetes, AWS and/or Azure – understanding of multi-tenant, multi-region, and hybrid/on-prem environments.
  • Security-minded and comfortable working within compliance frameworks.
  • Infrastructure as Code experience (Terraform, Ansible, Kubernetes or similar) and Git-centric workflows.
  • Excellent written and verbal communication skills. Able to translate technical detail into concise business updates under pressure.
  • Self-starter comfortable with ambiguity and a founding-role mindset: high ownership, bias for action, data-driven decision making, and a passion for eliminating toil.
  • Willingness to participate in on-call during India hours and collaborate with global teams for follow-the-sun coverage.

At AppDirect, we believe that innovation thrives in an environment that houses diversity of excellence, experience and thought. We respect each AppDirector as their own fingerprint; unique with no one alike. We foster an environment of inclusion without regard to race, religion, age, sexual orientation, or gender identity enabling AppDirectors to embrace their uniqueness to do their best work. As such, we strongly encourage applications from Indigenous peoples, racialized people, people with disabilities, people from gender and sexually diverse communities, and/or people with intersectional identities.

At AppDirect we take privacy very seriously. For more information about our use and handling of personal data from job applicants, please read our Candidate Privacy Policy. For more information of our general privacy practices, please see AppDirect Privacy Notice: https://www.appdirect.com/about/privacy-notice

 

Top Skills

Ansible
AWS
Azure
Bash
Datadog
Github Actions
Go
Jenkins
Kubernetes
Terraform

AppDirect Pune, Mahārāshtra, IND Office

Magarpatta Road, Pune, Maharashtra , India, 411028

Similar Jobs

9 Days Ago
In-Office
7 Locations
Senior level
Senior level
Information Technology • Software
The Senior Site Reliability Engineer is responsible for managing cloud infrastructure, optimizing database performance, automating system processes, and ensuring the reliability of integration solutions across enterprise systems.
Top Skills: .NetAzureAzure DevopsC#DockerKubernetesMs SqlOopPowershellTerraformWeb Api
19 Days Ago
Hybrid
Pune, Maharashtra, IND
Senior level
Senior level
Artificial Intelligence • Cloud • Sales • Security • Software • Cybersecurity • Data Privacy
The Sr Observability Engineer will ensure system reliability and performance by collaborating with development teams, improving operational processes, and participating in incident management and on-call duties.
Top Skills: AnsibleAWSAzureBashCloudFormationDockerGCPGoGrafanaKafkaKubernetesPrometheusPythonTerraform
17 Days Ago
In-Office
Pune, Maharashtra, IND
Senior level
Senior level
Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
The Senior Site Reliability Engineer will focus on architecting and operationalizing secure identity and access platforms, integrating automation, and leading incident management efforts within cloud-native environments.
Top Skills: Argo CdGitlab CiGoGrafanaKubernetesLdapMtlsOauth2OidcPrometheusPythonSpireTeleportTerraformVaultX.509

What you need to know about the Pune Tech Scene

Once a far-out concept, AI is now a tangible force reshaping industries and economies worldwide. While its adoption will automate some roles, AI has created more jobs than it has displaced, with an expected 97 million new roles to be created in the coming years. This is especially true in cities like Pune, which is emerging as a hub for companies eager to leverage this technology to develop solutions that simplify and improve lives in sectors such as education, healthcare, finance, e-commerce and more.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account