Benito J D — SRE Portfolio

Site Reliability Engineer
Building resilient systems at scale

Ensuring 99.9% uptime through automated monitoring, incident response, and infrastructure-as-code. Passionate about eliminating toil and building self-healing systems.

kubectl get sre --status=ready --experience=3years

System Metrics

Uptime SLA 99.96%
MTTR < 11min
Error Budget 98.5%
Incidents/Month < 2
Status All Systems Operational

About

I'm a Site Reliability Engineer focused on building and maintaining highly available, scalable systems. My approach combines software engineering principles with operations expertise to create resilient infrastructure that scales with business needs.

Currently working with cloud-native technologies, Kubernetes, and observability stacks to ensure seamless user experiences and minimize downtime. I believe in treating operations as a software problem and automating everything that can be automated.

Technical Stack

Cloud Platforms

  • Azure (Expert)
  • AWS (Proficient)
  • Google Cloud Platform
  • Multi-cloud strategies

Container Orchestration

  • Kubernetes (CKA)
  • Docker & Containerd
  • Helm Charts
  • Service Mesh (Istio)

Infrastructure as Code

  • Terraform
  • ARM Templates
  • Ansible
  • Pulumi

Observability

  • Prometheus & Grafana
  • ELK Stack
  • Azure Monitor
  • Jaeger Tracing

CI/CD & GitOps

  • Azure DevOps
  • GitHub Actions
  • ArgoCD
  • Jenkins

Programming & Scripting

  • Python (Automation)
  • Go (Tools & Services)
  • Bash/PowerShell
  • YAML/JSON

Experience

Site Reliability Engineer

TCS
2021 — Present
  • Maintained 99.95% uptime for critical healthcare applications serving 100K+ users
  • Implemented comprehensive monitoring with Prometheus, Grafana, and custom alerting reducing MTTR by 60%
  • Migrated monolithic applications to microservices architecture on Kubernetes, improving scalability by 300%
  • Built CI/CD pipelines with automated testing, security scanning, and blue-green deployments
  • Developed infrastructure-as-code templates reducing provisioning time from days to hours
  • Led incident response procedures and post-mortem reviews, establishing blameless culture
  • Automated backup, disaster recovery, and compliance reporting processes

Infrastructure Projects

Multi-Cloud Kubernetes Platform

Built a production-ready Kubernetes platform spanning Azure and AWS with automated failover, centralized logging, and comprehensive monitoring stack.

  • Kubernetes
  • Terraform
  • Prometheus
  • Istio

Observability Stack

Designed and deployed end-to-end observability solution with metrics, logs, traces, and SLI/SLO monitoring for distributed systems.

  • Grafana
  • ELK
  • Jaeger
  • SLOs

Automated Incident Response

Developed chatops-driven incident management system with automated runbooks, escalation policies, and real-time status page updates.

  • Python
  • Slack API
  • PagerDuty
  • Automation

Cost Optimization Engine

Built intelligent cost optimization system using machine learning to right-size resources and identify waste, reducing cloud costs by 35%.

  • Python
  • Azure APIs
  • ML
  • FinOps

GitOps Pipeline

Implemented GitOps workflow with ArgoCD for declarative deployments, automated security scanning, and policy enforcement.

  • ArgoCD
  • Kustomize
  • OPA
  • Security

Disaster Recovery Automation

Automated disaster recovery procedures with cross-region replication, automated testing, and RTO/RPO compliance monitoring.

  • Terraform
  • Backup
  • DR
  • Compliance

Connect

Available for SRE opportunities, consulting, or just talking about reliability engineering.