FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.
Tech Stack
Tools & technologiesAnsibleAWSCloudGoGrafanaJenkinsKubernetesPrometheusPythonTerraform
About the role
Key responsibilities & impact- You'll own the reliability, scalability, and security of the production application and/or platform.
- Implementing a World-Class Observability Platform: Design, implement, and manage our monitoring, logging, and alerting stack (e.g., Prometheus, Loki, Alloy, and Grafana).
- Defining and Upholding Reliability: Define, measure, and own alerting that feeds into our Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
- Leading Incident Response: Act as the incident responder and potentially incident commander during critical incidents.
- Automating for Scale and Security: Partner with platform engineers to design, build, and manage secure, resilient Kubernetes clusters and cloud/on-prem environments.
- Eliminating Toil and Scaling the Team: Proactively identify and eliminate operational toil by building automation.
Requirements
What you’ll need- An active Top Secret clearance
- 5+ years in Platform, DevOps, or Site Reliability Engineering with an infrastructure and operations focus.
- Proven partner to DevOps/Platform and application teams; collaborates well across functions and shares context openly.
- A deep understanding of incident response processes, with experience conducting thorough root cause analyses and driving continuous improvement.
- Infrastructure as Code: Terraform (or CloudFormation), Ansible.
- Containers and orchestration: Kubernetes design, deployment, and operations.
- CI/CD: experience building and maintaining pipelines (GitLab CI/CD, Jenkins, GitHub Actions).
- Scripting: proficiency with at least one of Python, Go, or Bash.
- Cloud: Familiarity with AWS or AWS GovCloud.
- Observability: Grafana stack, ELK stack, or Datadog.
- Networking fundamentals: core protocols and secure configurations.
Benefits
Comp & perks- Equity: Share in the company's success.
- Flexible Work Environment: Remote-first organization* with flexible work hours and unlimited PTO.***(*note that some roles are in-person, on-site with customers)*
- Comprehensive Health Coverage: Health, dental, vision, and life insurance.
- Retirement Plan: 401(k) plan with company match to secure your future.
- Parental Leave: 8 weeks at 100% regardless of state.
- Company Retreats: Annual company summit trips.
- Home Office Budget: $1,000 per year for home office improvements.
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
KubernetesTerraformAnsiblePythonGoBashGitLab CI/CDJenkinsGitHub ActionsAWS
Soft Skills
collaborationincident responseroot cause analysiscontinuous improvement
Certifications
Top Secret clearance
