SigNoz Logo

SigNoz

Sr Site Reliability Engineer

Posted 9 Hours Ago
Be an Early Applicant
Remote
Hiring Remotely in India
Senior level
Remote
Hiring Remotely in India
Senior level
Own reliability, scalability, and operability of a petabyte-scale observability SaaS. Improve SLOs/SLIs, incident response, and on-call practices; scale and tune ingest pipelines and ClickHouse; manage Kubernetes clusters, autoscaling, multi-tenancy, and upgrades; build infra-as-code, CI/CD, capacity planning, and observability for the platform.
The summary above was generated by AI

About SigNoz

SigNoz is an open-source observability platform that helps modern engineering teams monitor, debug, and optimize their applications with deep visibility into metrics, traces, and logs — all in one place. We're built natively on OpenTelemetry and offer both self-hosted and cloud options, so teams can run observability the way they want, without vendor lock-in.

We are growing fast and building core developer infra products. And we are not fooling around:

  • 27,000+ GitHub stars

  • 800+ customers

  • 7,000+ members in our Slack community

Role: Sr Site Reliability Engineer (SRE)

We're looking for an SRE to own the reliability, scalability, and operability of the SigNoz cloud platform. You'll keep a petabyte-scale observability system fast and dependable — making sure the people who trust us to watch their systems can always trust ours. The platform team handles infra, scalability of SaaS, ingest pipelines, staging environments, automation, and the operational backbone of the product.

This is a deeply hands-on role for someone who understands what actually breaks in production at scale — and enjoys fixing it for good.

What we're looking for

  • Kubernetes at scale — not just "I've deployed to k8s," but real fluency with the nuances and gotchas: resource tuning, autoscaling behavior, networking, stateful workloads, upgrades, and the failure modes that only show up under load

  • Working knowledge of ClickHouse — operating it, tuning queries, and understanding its behavior at scale — is a strong plus

  • Knowledge of Golang is a plus (most of our stack and tooling is in Go)

  • Familiarity with OpenTelemetry and running large-scale data ingest pipelines is a plus

What you'll work on

You'll work with a high-caliber team across areas like:

  • Reliability of the SigNoz cloud platform: SLOs/SLIs, error budgets, incident response, and on-call practices that don't burn people out

  • Scaling the ingest path — making it robust to bursts while maintaining data freshness

  • SaaS auto-scalability and capacity planning across a petabyte-scale system

  • Operating and tuning ClickHouse and the data layer for performance and cost

  • Kubernetes infrastructure: cluster operations, upgrades, multi-tenancy, and the automation that keeps it boring

  • Observability of SigNoz itself — we dogfood our own product, so you'll help make it world-class

  • Infrastructure-as-code, CI/CD, and the tooling that lets a small team operate big systems

What will make you successful

  • 5–8 years in SRE, infrastructure, or platform/backend roles operating production systems at scale

  • Deep, practical Kubernetes experience — you know where the bodies are buried

  • Strong grasp of distributed systems failure modes, performance debugging, and capacity planning

  • Comfortable in code (Go preferred) — you automate and fix things, not just configure them

  • Loves open source — ideally with prior contributions to OSS projects (any size)

  • Comfortable in a high-ownership, fast-moving, remote-first environment

  • Strong communication — can write clear runbooks and tech docs and explain trade-offs

Nice-to-haves

  • Past experience on platform/infra/SRE teams of Series B+ startups

  • Hands-on experience operating ClickHouse, Kafka, or similar high-throughput data systems

  • Experience in observability (monitoring / logging / tracing) and with OpenTelemetry

Why you'll love working at SigNoz

  • Work on a globally used open-source project that engineers actually love

  • Huge scope and ownership — your work directly shapes how teams adopt SigNoz

  • Collaborate with a high-caliber team who just can't stop shipping

  • Remote-first, async-friendly culture

  • Opportunity to help define the future of open-source observability

Similar Jobs

13 Days Ago
Easy Apply
Remote
India
Easy Apply
Senior level
Senior level
Artificial Intelligence • Consumer Web • Digital Media • Information Technology • Social Impact • Software
Lead SRE work to keep Circle highly available and performant: respond to incidents, own monitoring/alerting/log management, manage and optimize MySQL/Postgres/ClickHouse/Redis databases, maintain server infrastructure and deployment pipelines, collaborate with engineering teams, and build internal SRE tooling and automation.
Top Skills: AWSClickhouseKubernetesLlm-Based Tools (Copilots)MySQLPostgresRedis
Yesterday
Remote
Shri Bhrigukshetra, BLR, Uttar Pradesh, IND
Senior level
Senior level
Fintech • Analytics
Own availability, performance, and scalability of cloud-hosted services. Maintain SLOs, design automation for AWS/Azure, enable cloud migrations, improve observability (Datadog/CloudWatch/Monitor), run CI/CD, participate in on-call incident response and postmortems, and partner with dev teams to drive reliability and scalability.
Top Skills: Aks)Aws (Ec2Aws CloudformationAws CloudwatchAws CodepipelineAzure Arm TemplatesAzure DevopsAzure EquivalentsAzure MonitorDatadogDockerGitIamKubernetes (EksLambda)PythonRdsS3ShellTerraformVpc
12 Days Ago
Remote
Shri Bhrigukshetra, BLR, Uttar Pradesh, IND
Senior level
Senior level
Fintech • Analytics
Owner of availability, performance, and scalability for cloud-hosted shared services. Maintain SLOs, design automation for AWS/Azure, enable cloud migration, improve observability (Datadog/CloudWatch/Azure Monitor), run on-call rotations and incident response, implement IaC (Terraform/CloudFormation/ARM), CI/CD pipelines, and partner with dev teams to improve reliability and operational practices.
Top Skills: AksAWSAws CloudformationAws CloudwatchAws CodepipelineAzureAzure Arm TemplatesAzure DevopsAzure MonitorAzure SqlDatadogDevOpsDockerDynamoDBEc2EksGitIamKmsKubernetesLambdaPythonRdsS3Secrets ManagerShellTerraformVpc

What you need to know about the Pune Tech Scene

Once a far-out concept, AI is now a tangible force reshaping industries and economies worldwide. While its adoption will automate some roles, AI has created more jobs than it has displaced, with an expected 97 million new roles to be created in the coming years. This is especially true in cities like Pune, which is emerging as a hub for companies eager to leverage this technology to develop solutions that simplify and improve lives in sectors such as education, healthcare, finance, e-commerce and more.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account