Site Reliability Engineer
Building resilient systems at scale

Ensuring 99.9% uptime through automated monitoring, incident response, and infrastructure-as-code. Passionate about eliminating toil and building self-healing systems.

kubectl get sre --status=ready --experience=3years

View Infrastructure Schedule On-Call

System Metrics

Uptime SLA 99.96%

MTTR < 11min

Error Budget 98.5%

Incidents/Month < 2

Status All Systems Operational

About

I'm a Site Reliability Engineer focused on building and maintaining highly available, scalable systems. My approach combines software engineering principles with operations expertise to create resilient infrastructure that scales with business needs.

Currently working with cloud-native technologies, Kubernetes, and observability stacks to ensure seamless user experiences and minimize downtime. I believe in treating operations as a software problem and automating everything that can be automated.

Technical Stack

Cloud Platforms

Azure (Expert)
AWS (Proficient)
Google Cloud Platform
Multi-cloud strategies

Container Orchestration

Kubernetes (CKA)
Docker & Containerd
Helm Charts
Service Mesh (Istio)

Infrastructure as Code

Terraform
ARM Templates
Ansible
Pulumi

Observability

Prometheus & Grafana
ELK Stack
Azure Monitor
Jaeger Tracing

CI/CD & GitOps

Azure DevOps
GitHub Actions
ArgoCD
Jenkins

Programming & Scripting

Python (Automation)
Go (Tools & Services)
Bash/PowerShell
YAML/JSON

Experience

Site Reliability Engineer

TCS

2021 — Present

Maintained 99.95% uptime for critical healthcare applications serving 100K+ users
Implemented comprehensive monitoring with Prometheus, Grafana, and custom alerting reducing MTTR by 60%
Migrated monolithic applications to microservices architecture on Kubernetes, improving scalability by 300%
Built CI/CD pipelines with automated testing, security scanning, and blue-green deployments
Developed infrastructure-as-code templates reducing provisioning time from days to hours
Led incident response procedures and post-mortem reviews, establishing blameless culture
Automated backup, disaster recovery, and compliance reporting processes

Infrastructure Projects

Multi-Cloud Kubernetes Platform

Built a production-ready Kubernetes platform spanning Azure and AWS with automated failover, centralized logging, and comprehensive monitoring stack.

Kubernetes
Terraform
Prometheus
Istio

Observability Stack

Designed and deployed end-to-end observability solution with metrics, logs, traces, and SLI/SLO monitoring for distributed systems.

Grafana
ELK
Jaeger
SLOs

Automated Incident Response

Developed chatops-driven incident management system with automated runbooks, escalation policies, and real-time status page updates.

Python
Slack API
PagerDuty
Automation

Cost Optimization Engine

Built intelligent cost optimization system using machine learning to right-size resources and identify waste, reducing cloud costs by 35%.

Python
Azure APIs
ML
FinOps

GitOps Pipeline

Implemented GitOps workflow with ArgoCD for declarative deployments, automated security scanning, and policy enforcement.

ArgoCD
Kustomize
OPA
Security

Disaster Recovery Automation

Automated disaster recovery procedures with cross-region replication, automated testing, and RTO/RPO compliance monitoring.

Terraform
Backup
DR
Compliance

Connect

Available for SRE opportunities, consulting, or just talking about reliability engineering.

LinkedIn GitHub Email Resume Blog Certifications

You may also be interested in

Site Reliability Engineer Building resilient systems at scale