Tech Stack
AWSCloudDockerGoGoogle Cloud PlatformGrafanaJavaKubernetesLinuxOpen SourcePrometheusPythonTerraformUnix
About the role
- Comet builds a development platform and experiment management for ML teams used by Netflix, Uber, and others.
- Design, implement, and manage scalable, secure, and reliable cloud-based infrastructure
- Build and maintain CI/CD pipelines for efficient and consistent application delivery
- Implement and manage Infrastructure as Code (IaC) to ensure consistency across environments
- Drive adoption of best practices in automation, observability, and system reliability
- Ensure security and compliance across infrastructure and deployments
- Optimize cost management of cloud infrastructure
- Collaborate with teams to improve processes and ensure operability
- Troubleshoot, investigate, and resolve production issues affecting customers
Requirements
- 5+ years of experience in a DevOps, SRE, or related role, including significant production experience
- Proven remote work experience and strong collaboration skills in distributed teams
- Deep understanding of DevOps practices, automation, CI/CD, and infrastructure-as-code
- Passion for troubleshooting and root cause analysis
- Strong experience with cloud platforms (AWS preferred, GCP a plus) and managing infrastructure with Terraform
- Solid understanding of networking, security, and infrastructure best practices
- Significant hands-on experience with containerization and orchestration (Docker, Kubernetes, Helm)
- Experience with observability tools such as Prometheus, Grafana, or NewRelic
- Strong background in Linux/Unix system administration
- Proficiency in scripting (Bash, Python)
- Experience in software development (Java, Python, Go) - a plus
- Knowledge of database management and performance optimization - a plus