
Senior Systems Operations Engineer
DistroKid
full-time
Posted on:
Location Type: Remote
Location: United States
Visit company websiteExplore more
Salary
💰 $155,000 - $170,000 per year
Job Level
About the role
- Design, deploy, and manage scalable and highly available cloud infrastructure on AWS, with deep expertise in core services (EC2, EKS, S3, RDS, IAM, VPC, and beyond).
- Develop and maintain disaster recovery plans leveraging AWS capabilities for backup and replication to ensure business continuity.
- Collaborate with engineering and security teams to improve infrastructure health, security, and long-term scalability.
- Design reusable Terraform/OpenTofu modules following DRY principles and organizational standards; implement module versioning and lifecycle strategies.
- Direct the migration of manual infrastructure to code; establish patterns and best practices for IaC adoption across the team.
- Implement IaC testing strategies, including validation, linting, and integration testing, using tools such as Terraform-Compliance or Checkov.
- Architect and maintain complex Bitbucket pipeline configurations for multi-environment IaC deployments; implement pipeline security best practices.
- Implement AIOps practices, leveraging AI tools to enhance monitoring, incident response, and predictive alerting.
- Use AI-assisted development and operations tools (e.g., Cursor, Claude) to accelerate troubleshooting, code review, and documentation generation.
- Evaluate and implement AI-powered automation to reduce operational toil, improve repeatability, and scale platform capabilities.
- Define and implement SLOs for services; guide and/or participate in incident response and conduct blameless postmortems.
- Implement chaos engineering practices to proactively identify system weaknesses before they impact production.
- Build and maintain comprehensive monitoring solutions using tools such as CloudWatch and Datadog to track performance and drive optimization.
- Develop automation scripts and tools in Python, Bash, or similar languages to streamline operations and eliminate manual toil.
- Build self-service capabilities for development teams to reduce cognitive load and enable developer autonomy across the organization.
- Guide the solution architecture and end-to-end implementation of DistroKid’s first Internal Developer Portal (IDP).
- Define the IDP roadmap and success criteria in partnership with engineering leadership; establish golden paths, service catalogs, and self-service workflows that reduce deployment friction and accelerate developer productivity.
- Drive adoption of the IDP across engineering teams; gather feedback, iterate on the platform, and measure impact through developer experience metrics and reduced time-to-deploy.
- Guide cost optimization initiatives; implement rightsizing recommendations, reserved-capacity strategies, and tagging standards for cost allocation.
- Monitor and optimize AWS resource usage; select appropriate services and configurations to meet performance requirements cost-effectively.
- Direct planning, decision-making, and execution for infrastructure projects; own workstreams end-to-end.
- Partner cross-functionally with engineering, security, and product teams; communicate impact in terms of company strategy and OKRs.
- Provide technical mentorship to junior and mid-level engineers; invest in team growth and foster a culture of continuous learning.
- Maintain and contribute to infrastructure documentation, runbooks, and architectural decision records to ensure knowledge sharing and operational consistency.
Requirements
- Bachelor’s degree in Computer Science, Information Technology, a related field, or equivalent practical experience.
- 5+ years of experience in systems operations, platform engineering, or DevOps with a focus on cloud infrastructure and containerized environments.
- Proven production experience with AWS services (EC2, EKS, S3, RDS, IAM, VPC, API Gateway, Event Bridge, etc) and Kubernetes.
- 5+ years of hands-on experience with Infrastructure as Code tools, specifically Terraform and/or OpenTofu, including module design, state management, remote backends, and IaC testing.
- Strong knowledge of Linux/Unix administration, systems, and shell scripting.
- Proficiency in Python, Go, or similar programming languages.
- Experience with CI/CD pipelines for infrastructure deployments (Bitbucket Pipelines, Jenkins, or similar).
- Experience with monitoring and observability tools (Prometheus, Grafana, CloudWatch, or Datadog).
- Demonstrated experience implementing or working with AIOps tools, practices, or AI-assisted operations in a professional context.
- Experience using AI-assisted development tools (e.g., Cursor, Warp, Claude, or similar) to accelerate engineering work.
Benefits
- Retirement plans (401k, SIPP, etc.)
- Health insurance
- Generous paid time off
- Parental leave
- Home office allowance
- Flexible work schedules
- Paid and discounted subscriptions
- Regular engagement activities
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
AWSTerraformOpenTofuIaCPythonBashKubernetesLinux/Unix administrationCI/CDAIOps
Soft Skills
collaborationcommunicationtechnical mentorshipproblem-solvingcontinuous learningproject managementcost optimizationfeedback gatheringteam growthdeveloper autonomy
Certifications
Bachelor’s degree in Computer ScienceBachelor’s degree in Information Technologyrelated field degree