SolarWinds

Sr. Staff Site Reliability Engineer

Posted 10 Days Ago

Be an Early Applicant

In-Office

Bangalore, Bengaluru Urban, Karnataka

Senior level

In-Office

Bangalore, Bengaluru Urban, Karnataka

Senior level

The role involves owning and operating ClickHouse infrastructure, leading SRE initiatives, architecting cloud-native infrastructure, driving automation, and mentoring team members, while ensuring service reliability and performance for the observability platform.

The summary above was generated by AI

At SolarWinds, we’re a people-first company. Our purpose is to enrich the lives of the people we serve—including our employees, customers, shareholders, partners, and communities. Join us in our mission to help customers accelerate business transformation with simple, powerful, and secure solutions.

The ideal candidate thrives in an innovative, fast-paced environment and is collaborative, accountable, ready, and empathetic. We’re looking for individuals who believe they can accomplish more as a team and create lasting growth for themselves and others. We hire based on attitude, competency, and commitment. Solarians are ready to advance our world-class solutions in a fast-paced environment and accept the challenge to lead with purpose. If you’re looking to build your career with an exceptional team, you’ve come to the right place. Join SolarWinds and grow with us!

Your Role

We are seeking a Sr. Staff Site Reliability Engineer with deep expertise in ClickHouse, Kubernetes, GitOps, AWS/Azure, and large-scale SaaS infrastructure. This role is critical to the reliability and performance of our Observability Platform—specifically the SaaS Logs and high-throughput data pipelines powered by ClickHouse.

You will lead reliability strategy, architecture, and execution across distributed systems, helping shape how SolarWinds ingests, stores, queries, and scales massive observability datasets. This includes owning ClickHouse production clusters, designing performance-optimized schemas, ensuring high availability, and driving automation around data platform operations.

Responsibilities

Own and operate ClickHouse infrastructure—including cluster provisioning, sharding, replication, performance tuning, storage optimization, and backup/restore automation.
Collaborate with engineering teams to shape data ingestion, storage, and query performance requirements for high-volume observability workloads.
Lead SRE initiatives around infrastructure reliability, SLAs/SLOs, observability, incident management, and post-incident learning.
Architect and implement scalable, cloud-native infrastructure using Kubernetes, Terraform, GitOps, and modern SRE practices.
Drive automation across provisioning, deployments, monitoring, and operational workflows.
Lead and mentor SRE team members, providing direction across distributed systems, reliability engineering, and data-platform operations.
Guide incident response for production issues, participate in on-call rotations, facilitate postmortems, and champion a culture of continuous improvement.
Establish and enforce best practices across monitoring, telemetry, capacity planning, security, and operational excellence.

Ideal Attributes

Deep customer orientation with a strong ownership mindset.
Experience influencing architecture and long-term technical direction.
Exceptional communication skills—able to translate complex technical topics to cross-functional stakeholders.
Bias for action, data-driven decision making, and problem-solving under pressure.
Collaborative, empathetic, and committed to the growth of the team.

Qualifications

Required:

Expert-level experience operating ClickHouse at scale—including performance tuning, schema design, cluster operations, replication, partitioning, RBAC, and storage optimization.
10+ years designing, building, and maintaining large-scale SaaS infrastructure.
8+ years hands-on experience with AWS and/or Azure using Terraform.
5+ years deploying, scaling, and operating Kubernetes clusters in production.
Strong experience with data platform infrastructure and distributed systems.
Proficiency in Python or Go; solid skills in shell scripting and SQL.
Strong background in observability (metrics, logs, tracing), system health, and proactive monitoring.
Experience with SQL/NoSQL database technologies.
Experience with GitOps (Flux/ArgoCD), CI/CD, and automated deployment workflows.
Understanding of security operations, encryption, key management, and cloud security principles.
Demonstrated experience mentoring engineers and driving team-wide engineering excellence.

Nice to Have:

Experience with large-scale observability platforms, monitoring pipelines, or log analytics systems.
Experience with ClickHouse Keeper, tiered storage, or multi-cluster architectures.

SolarWinds is an Equal Employment Opportunity Employer. SolarWinds will consider all qualified applicants for employment without regard to race, color, religion, sex, age, national origin, sexual orientation, gender identity, marital status, disability, veteran status or any other characteristic protected by law.

All applications are treated in accordance with the SolarWinds Privacy Notice: https://www.solarwinds.com/applicant-privacy-notice

Top Skills

AWS

Azure

Clickhouse

Gitops

Kubernetes

NoSQL

Python

SQL

Terraform

Similar Jobs

Palo Alto Networks

Site Reliability Engineer

12 Days Ago

In-Office

Bangalore, Bengaluru Urban, Karnataka, IND

Senior level

Cybersecurity

As a Senior Staff Site Reliability Engineer, you'll enhance IT infrastructure reliability and availability using automation tools, lead cross-functional collaboration, and implement CI/CD processes while developing monitoring solutions to ensure system performance.

Top Skills: AnsibleDockerGitGoJavaKubernetesLinuxPythonShell/BashTerraform

Zoom

Senior Site Reliability Engineer

2 Days Ago

In-Office or Remote

Bangalore, Bengaluru Urban, Karnataka, IND

Senior level

Artificial Intelligence • Information Technology • Software

The Senior Site Reliability Engineer will enhance system reliability and operations for Kubernetes platforms, improve incident response and automation, and collaborate across various engineering teams.

Top Skills: AirflowArgocdAWSAzureDatadogGCPGithub ActionsGoGrafanaJenkinsKafkaKubernetesPagerdutyPrometheusPythonShellSparkTrino/Presto

Truecaller

Senior Site Reliability Engineer

11 Days Ago

Easy Apply

In-Office

Bangalore, Bengaluru Urban, Karnataka, IND

Easy Apply

Senior level

Software

As a Senior Site Reliability Engineer, you will manage and maintain infrastructure services, improve system performance, and ensure reliability and availability.

Top Skills: AWSAzureDatadogDockerGCPGoGrafanaKubernetesLinuxNew RelicPrometheusStackdriver

What you need to know about the Pune Tech Scene

Once a far-out concept, AI is now a tangible force reshaping industries and economies worldwide. While its adoption will automate some roles, AI has created more jobs than it has displaced, with an expected 97 million new roles to be created in the coming years. This is especially true in cities like Pune, which is emerging as a hub for companies eager to leverage this technology to develop solutions that simplify and improve lives in sectors such as education, healthcare, finance, e-commerce and more.