Who we are
We're Redis. We built the product that runs the fast apps our world runs on. (If you checked the weather, used your credit card, or looked at your flight status online today, you’re welcome.) At Redis, you’ll work with the fastest, simplest technology in the business—whether you’re building it, telling its story, or selling it to our 10,000+ worldwide customers. We’re creating a faster world with simpler experiences. You in?
We are seeking an analytical, data-driven and process-oriented Problem Manager to join our growing service operations team within CloudOps. In this role, you’ll be responsible for managing and driving resolution of recurring or complex technical problems, ensuring long-term solutions are implemented to improve the reliability and performance of our services. You’ll work closely with operations, support, R&D, developers, and product teams to prevent incidents and maintain service excellence.
Responsibilities
- Own the end-to-end problem management process including identification, categorization, root cause analysis (RCA) and tracking of corrective actions.
- Work closely with DevOps, engineering, product and support teams to analyze and identify emerging trends, and drive resolution of systemic issues.
- Facilitate and moderate value-driven post-mortems for high-impact incidents, and maintain a database of known errors and workarounds.
- Collaborate with incident management teams to distinguish problems from incidents and prioritize accordingly.
- Ensure timely completion of preventive measures and continuous improvement initiatives.
- Proven ability to create new programs and/or processes, including designing workflows, defining goals and KPIs, and implementing continuous improvements based on data-driven insights.
- Develop Tableau reporting dashboards to measure problem management effectiveness.
- Drive process improvements and automation opportunities for faster RCA and prevention.
Required Qualifications
- 3–5 years experience in problem management, site reliability engineering, DevOps, or incident response roles in SaaS, cloud, or tech environments.
- Strong technical background with knowledge of distributed systems, networking, Databases, and cloud infrastructure (Azure).
- Demonstrated experience facilitating post-mortems and driving RCA to completion.
- Familiarity with ITIL Problem Management or similar frameworks (certification is a plus but not required).
- Excellent communication and coordination skills across technical and non-technical stakeholders.
- Experience using observability tools (e.g., Zabbix, Prometheus, Grafana), ticketing systems (Jira, Zendesk), incident management tools (FireHydrant, Squadcast) and reporting tools (Tableau, PostgreSQL).
Preferred Qualifications
- Experience in fast-paced environments such as startups or hyper-growth tech companies.
- Exposure to Redis or similar in-memory data stores.
- Experience automating operational tasks with scripting (Python, Bash, etc.) and AI driven tools.
#LI-BL1 #LI-Hybrid