Salary
💰 $180,000 - $220,000 per year
Tech Stack
AWSAzureCloudDNSGoogle Cloud PlatformGrafanaKubernetesLinuxPrometheusPythonTCP/IPTerraform
About the role
- Ensure infrastructure is reliable, observable, and easy to operate with emphasis on automation and operational excellence.
- Build, manage, and optimize infrastructure using Terraform, GitHub CI/CD, and Kubernetes.
- Create visualizations and alerts that provide actionable insights using tools like Grafana, Prometheus/Mimir, OpenSearch, and Sentry.
- Identify manual or error-prone processes and replace them with automated, repeatable systems.
- Diagnose and resolve production issues across application and infrastructure layers.
- Capture knowledge in runbooks, setup guides, and architecture diagrams to support operational maturity.
- Partner with engineers across teams to drive adoption of DevOps and infrastructure best practices.
- Help scale infrastructure and monitoring systems to meet growing demands.
- Participate in an on-call rotation and support incident response processes as needed.
- Attend weekly co-working days in the South San Francisco office (expected on Wednesdays).
Requirements
- Experience with metrics, logs, and traces using tools such as Grafana, Prometheus/Mimir, OpenSearch, Sentry, or similar.
- Proficient with Terraform, Kubernetes, and containerization tools.
- 5+ years of experience with Python.
- Comfortable working with Linux-based environments and writing shell scripts.
- Strong collaboration skills with a focus on asynchronous, written communication.
- Commitment to clear, comprehensive documentation and process standardization.
- Self-starter mindset with a proactive approach to solving operational challenges.
- Skilled in Git/GitHub-based workflows.
- Willingness to participate in an on-call rotation and support incident response processes.
- Nice-to-have: AWS (preferred), GCP, or Azure cloud infrastructure management.
- Nice-to-have: Familiarity with TCP/IP, DNS, routing, and load balancing concepts.
- Nice-to-have: Understanding of cloud and infrastructure security best practices.
- Nice-to-have: Experience tuning application or infrastructure performance in production environments.