Reliability Engineering & Automation
Develop and manage infrastructure as code (e.g., Terraform, AWS CloudFormation).
Review infrastructure changes, automation scripts, and reliability-impacting code changes to ensure production readiness.
Monitor system health using modern observability tools (e.g., Prometheus, Grafana, Datadog).
Lead incident response, root cause analysis, and postmortems for production issues.

Requirements

5-7 years related experience
Bachelor's Degree in related field
Linux systems and networking fundamentals (DNS, TCP/IP, TLS)
Distributed systems debugging and failure analysis
Load, stress, and fault-injection testing
CI/CD tools and processes
Version control (e.g., Git)
Cloud platforms (e.g., AWS, Azure)
Containers and orchestration (Kubernetes)
Kafka (messaging/streaming)
Scripting and programming languages (e.g., Java, TypeScript, Groovy, Python)
Agile methodologies (e.g., Scrum, XP, SAFe)
Databases/SQL
Observability/monitoring tools (DataDog)

Benefits

Medical (HSA available)
Dental
Vision
Short-term & long-term disability (company-paid)
Life & AD&D (company-paid)
401K with company match
10 paid holidays, quarterly company closure dates, + holiday week company closure
Flexible time off policy
Work from home
6 weeks paid parental leave

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

infrastructure as codeTerraformAWS CloudFormationLinux systemsnetworking fundamentalsdistributed systems debuggingload testingCI/CDcontainersscripting languages

Soft Skills

incident responseroot cause analysispostmortemsleadership

Certifications

Bachelor's Degree