We are seeking a Tech Lead – Site Reliability Engineering with expertise in DevOps, QA, and Cloud to lead reliability, automation, and performance engineering efforts across cloud-based systems. This role involves leading teams, establishing SRE best practices, and implementing scalable cloud architectures to ensure high availability, security, and efficiency.
Responsibilities
SRE & Cloud Reliability Engineering:
- Design and implement highly available, scalable cloud architectures. o Ensure uptime and system reliability through proactive monitoring and incident management.
- Automate infrastructure provisioning and scaling using Terraform, Ansible, Kubernetes, and Helm.
DevOps & Automation:
- Develop and maintain CI/CD pipelines for automated build, test, and deployment. o Implement GitOps workflows to streamline deployment processes.
- Optimize performance and cost-efficiency of cloud environments.
QA & Test Automation:
- Lead automated testing strategies for API reliability, performance, and security. o Implement Test-Driven Development (TDD) and Continuous Testing methodologies.
- Perform load testing, stress testing, and resilience testing to prevent failures.
Observability, Monitoring & Incident Response
- Set up monitoring and alerting dashboards using Prometheus, Grafana, Splunk, Datadog. Implement log aggregation and distributed tracing for deep observability.
- Lead incident response, root cause analysis, and post-mortem analysis.
Security & Compliance:
- Enforce cloud security best practices (IAM policies, Zero Trust, cloud encryption).
- Ensure compliance with regulatory standards (SOC 2, ISO 27001, GDPR, HIPAA).
- Implement threat detection and anomaly detection using AI-driven monitoring tools.
Leadership & SRE Strategy:
- Lead and mentor SRE engineers, ensuring adoption of best practices.
- Collaborate with DevOps, Security, and Cloud teams to implement scalable and secure cloud infrastructures.
- Establish SRE operational strategies, playbooks, and incident management frameworks.
Qualifications
- Bachelor’s/Master’s degree in Computer Science, IT, or a related field.
- 7+ years of experience in Site Reliability Engineering, DevOps, and Cloud infrastructure.
- Deep expertise in SRE methodologies, cloud reliability, and distributed systems.
- Proficiency in container orchestration, cloud automation, and infrastructure as code.
- Strong knowledge of observability, monitoring, and AI-powered performance tuning.
- Experience with security compliance, disaster recovery, and failure prevention.
- Proficiency in scripting and automation (Python, Bash, Go, YAML).
- Proven experience in leading SRE teams, defining strategies, and implementing best practices.
Must-Have Skills:
- Expertise in Site Reliability Engineering (SRE), DevOps, and Cloud Automation.
- Experience in cloud computing platforms (AWS, GCP, Azure, OpenStack).
- Proficiency in Infrastructure as Code (Terraform, Ansible, CloudFormation).
- Hands-on experience with Kubernetes, Docker, and OpenShift.
- Deep understanding of observability, monitoring, and logging (Prometheus, Grafana, ELK, Splunk, Datadog).
- Experience in CI/CD pipeline automation (Jenkins, GitHub Actions, ArgoCD, Tekton).
- Expertise in performance tuning, cloud scaling, and system optimization.
- Proficiency in QA automation, API reliability testing, and microservices validation.
- Strong background in networking, traffic routing, and load balancing.
- Experience in incident response, disaster recovery, and reliability planning.
- Leadership experience in mentoring, managing teams, and driving SRE best practices.
Preferred Skills: • Experience with Chaos Engineering and fault injection frameworks (Gremlin, LitmusChaos). • Knowledge of cloud cost optimization strategies and FinOps. • Understanding of Zero Trust Security, IAM policies, and cloud-native security practices. • Proficiency in scripting and automation (Python, Bash, Go). • Familiarity with ML-based anomaly detection for reliability monitoring