FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.
Tech Stack
Tools & technologiesCloudDistributed SystemsDockerGoGoogle Cloud PlatformGrafanaKubernetesMicroservicesPrometheusPythonTerraform
About the role
Key responsibilities & impact- Monitor, maintain, and improve the reliability, availability, and performance of production systems and services.
- Build and maintain infrastructure as code (IaC), deployment pipelines, and automation to support continuous delivery, scalability, and disaster recovery.
- Respond to incidents, perform root-cause analysis, and drive postmortems to ensure lessons learned are applied.
- Implement and enforce operational best practices: observability, logging, metrics, alerting, capacity planning, failover strategies, and backups.
- Collaborate with Engineering, Product, Compliance, and Operations teams to ensure infrastructure meets reliability, compliance, and security standards.
- Support service scaling, database operations, cloud infrastructure (GCP preferred), networking, and microservices orchestration.
- Document operational runbooks, on-call procedures, and system architecture to support maintenance, knowledge sharing, and compliance.
Requirements
What you’ll need- Strong programming or scripting skills (Go, Python, Bash, or similar) for automation, tooling, and operational tasks.
- Hands-on experience with cloud infrastructure, ideally Google Cloud Platform (GCP).
- Familiarity with containerization and orchestration (Docker, Kubernetes, or equivalent).
- Experience with infrastructure-as-code tools (Terraform, Cloud Deployment Manager, or similar).
- Experience with either FluxCD or ArgoCD for GitOps-based delivery.
- Solid understanding of distributed systems, microservices architecture, and reliability patterns.
- Experience setting up monitoring, logging, alerting, and observability (e.g., Prometheus, Grafana, ELK, distributed tracing).
- Strong troubleshooting skills and ability to respond to incidents under pressure.
- Knowledge of backup and disaster recovery strategies, database management, and secure operations.
Benefits
Comp & perks- Competitive salary and meaningful equity with room for growth.
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
GoPythonBashcloud infrastructureGoogle Cloud PlatformDockerKubernetesTerraformFluxCDArgoCD
Soft Skills
troubleshootingincident responsecollaborationroot-cause analysispostmortem analysisknowledge sharingcapacity planningoperational best practices
