Staff Engineer - Site Reliability Engineering
MontyCloud
Software Engineering
Bengaluru, Karnataka, India
Posted on May 11, 2026
Role Overview
MontyCloud is seeking a highly experienced Staff Site Reliability Engineer (SRE) to lead reliability, scalability, and operational excellence for our cloud-native, AI-driven SaaS platform. This role requires a strategic, organization-wide impact, combining deep expertise in distributed systems with modern practices in automation, observability, and AI-driven operations (AIOps). You will define reliability standards, influence system architecture, and build intelligent systems that enable engineering teams to operate efficiently and proactively.
As a Staff SRE, you will champion automation-first and AI-augmented reliability engineering, reducing operational toil, improving system resilience, and driving a culture of ownership and continuous improvement across teams.
Key Responsibilities
- Define and drive organization-wide reliability strategy, including SLIs, SLOs, SLAs, and error budgets.
- Influence system architecture to ensure high availability, scalability, fault tolerance, and operability.
- Design and build scalable automation frameworks and internal platforms to reduce operational toil and enable self-service capabilities.
- Leverage AI/ML-driven approaches to enhance observability, anomaly detection, and predictive incident prevention.
- Implement and optimize AI-assisted incident management, including alert triage, root cause analysis, and automated remediation workflows.
- Lead implementation of centralized observability (metrics, logs, traces) and define effective alerting and monitoring strategies.
- Drive proactive performance optimization, capacity planning, and system efficiency improvements using data-driven insights.
- Lead incident management, including critical incident response, resolution, and blameless postmortems with a focus on systemic fixes.
- Design and improve incident and change management workflows, integrating observability with ITSM tools (e.g., ServiceNow, Jira Service Management, PagerDuty).
- Automate incident detection, triage, escalation, and remediation workflows to minimize manual intervention.
- Champion resilience practices such as disaster recovery, chaos engineering, and failure testing.
- Partner with engineering teams to improve CI/CD reliability, release safety, and deployment strategies (e.g., canary, blue-green).
- Continuously reduce MTTR, change failure rate, and operational overhead through automation and engineering improvements.
- Drive cloud cost optimization and resource efficiency, including optimization of AI/ML workloads and inference costs.
- Collaborate with data and ML teams to ensure reliability, scalability, and observability of AI/ML systems, including monitoring for drift and performance degradation.
- Mentor engineers and act as a technical leader, influencing best practices and elevating reliability standards across teams.
- Foster a culture of ownership, automation-first mindset, and AI-augmented operational excellence.
Desired Skills and Requirements
Must Have
- Problem-solving skills
- Cloud: AWS
- Programming/ Scripting: Python, Go
- Containerization: Kubernetes, containers, microservices architectures
- Infrastructure as Code (IaC): Terraform, CloudFormation
- Automation/Configuration Management: Ansible, Puppet, Chef
- Monitoring/Observability: Datadog, Prometheus, Grafana, Splunk, AWS CloudWatch, AWS X-Ray
- Reliability Engineering: SLIs, SLOs, SLAs, error budgets
- Incident Management & Reliability Frameworks
- CI/CD and Release engineering: experience with Jenkins, GitLab CI, etc.
- ITSM & Incident Tools: ServiceNow, Jira Service Management, PagerDuty, Opsgenie
- AI/ML & AIOps for observability, alerting, incident analysis, and automation
- System Design, Scalability, Performance Engineering, and Reliability Trade-offs
- Distributed Systems expertise
Good-to-Have
- General Dev Experience: Internal Developer Platforms (IDP) & Platform Engineering
- Chaos Engineering Tools: e.g., Gremlin, Chaos Monkey etc.
- Resilience Testing
- Security, Compliance, and Governance in Cloud Environments
- Application Development
- Agile Methodology
- FinOps & Cloud Cost Optimization
Experience
- 8+ years of experience in Site Reliability Engineering / DevOps / Platform Engineering in SaaS platform environments.
- 3 years of experience specifically in managing and optimizing SaaS platforms.
- 3 years of expert knowledge and hands-on experience with AWS.
- 4 years of experience using automation tools like Ansible, Puppet, or Chef.
- 4 years of experience with scripting in Python or similar languages.
- 3 years of experience using tools like Splunk, New Relic, Datadog, AWS CloudWatch, or AWS X-Ray.
- 3 years of experience leading disaster recovery efforts in current and previous roles.
- 3 years of experience implementing chaos engineering practices in live environments.
- 4 years of active involvement in on-call rotations and incident management.
- 4+ years of end-to-end application development experience, showcasing familiarity with the complete software development lifecycle and a strong ability to design, implement, and deploy functional, scalable applications.
- 3 years of experience leading post-mortem analysis sessions following major incidents.
Education
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
- Equivalent practical experience in large-scale SaaS or cloud-native environments is highly valued.