Tech Stack
AWSCloudDistributed SystemsDockerGoGrafanaKubernetesLinuxMicroservicesOraclePrometheusPythonShell ScriptingTerraform
About the role
- Design, deploy, and operate large-scale distributed systems across compute, storage, networking, and AI/ML environments
- Lead projects from architecture to automation to intelligent monitoring
- Operate and optimize Kubernetes clusters, Istio service mesh, and Linux-based systems
- Automate workflows using Go, Python, and Shell scripting
- Build monitoring and observability solutions with Prometheus, Grafana, and Loki
- Troubleshoot complex networking, storage, and system performance issues
- Partner with AI/ML teams to ensure infrastructure readiness for model training and data pipelines
- Participate in on-call rotations and postmortem reviews to improve system resilience
- Collaborate with clients and teammates to build resilient, high-performing infrastructure
Requirements
- Experience with Google Cloud
- Experience with Infrastructure as Code tools (Terraform)
- Strong knowledge of microservices and containers (Kubernetes, Docker)
- Experience operating and optimizing Kubernetes clusters and Istio service mesh
- Hands-on experience with PKI and service mesh
- Linux systems administration experience
- Automation experience using Go, Python, and Shell scripting
- Experience building monitoring and observability solutions (Prometheus, Grafana, Loki)
- Troubleshooting complex networking, storage, and system performance issues
- SRE mindset with a focus on automation, scalability, and reliability
- Ability to partner with AI/ML teams to ensure infrastructure readiness
- Willingness to participate in on-call rotations and postmortem reviews