Pythian

Site Reliability Engineer

Pythian

full-time

Posted on:

Origin:  • 🇮🇳 India

Visit company website
AI Apply
Apply

Job Level

Mid-LevelSenior

Tech Stack

AWSCloudDistributed SystemsDockerGoGrafanaKubernetesLinuxMicroservicesOraclePrometheusPythonShell ScriptingTerraform

About the role

  • Design, deploy, and operate large-scale distributed systems across compute, storage, networking, and AI/ML environments
  • Lead projects from architecture to automation to intelligent monitoring
  • Operate and optimize Kubernetes clusters, Istio service mesh, and Linux-based systems
  • Automate workflows using Go, Python, and Shell scripting
  • Build monitoring and observability solutions with Prometheus, Grafana, and Loki
  • Troubleshoot complex networking, storage, and system performance issues
  • Partner with AI/ML teams to ensure infrastructure readiness for model training and data pipelines
  • Participate in on-call rotations and postmortem reviews to improve system resilience
  • Collaborate with clients and teammates to build resilient, high-performing infrastructure

Requirements

  • Experience with Google Cloud
  • Experience with Infrastructure as Code tools (Terraform)
  • Strong knowledge of microservices and containers (Kubernetes, Docker)
  • Experience operating and optimizing Kubernetes clusters and Istio service mesh
  • Hands-on experience with PKI and service mesh
  • Linux systems administration experience
  • Automation experience using Go, Python, and Shell scripting
  • Experience building monitoring and observability solutions (Prometheus, Grafana, Loki)
  • Troubleshooting complex networking, storage, and system performance issues
  • SRE mindset with a focus on automation, scalability, and reliability
  • Ability to partner with AI/ML teams to ensure infrastructure readiness
  • Willingness to participate in on-call rotations and postmortem reviews