Senior Site Reliability Engineer

PlayOn! Sports

Senior Site Reliability Engineer focused on building tools and automation for system reliability at PlayOn. Collaborating with DevOps and engineering teams to enhance performance and scalability.

Posted 5/11/2026full-timeRemote • 🇺🇸 United StatesSeniorWebsite

Tech Stack

Tools & technologies

AWSAzureCloudDistributed SystemsDockerGoGoogle Cloud PlatformGrafanaJavaKubernetesLinuxPrometheusPythonTerraform

About the role

Key responsibilities & impact

Contribute to system observability i.e implementing, improving metrics, alerting, and dashboards for better insight and faster recovery.
Develop automation, tooling, and monitoring solutions to support high service availability.
Partner with application and quality engineering teams to implement best practices in reliability, release automation, and testing.
Drive operational excellence through proactive incident prevention, blameless postmortems, and capacity planning.
Participate in on-call rotations to support critical services and ensure rapid response to incidents.

Requirements

What you’ll need

Solid experience in Python, especially for automation, tooling, and data-driven operational tasks.
Proficiency in at least one (Java, C++, or Go).
Strong understanding of Linux systems, cloud infrastructure (AWS, GCP, or Azure), and modern deployment practices (Docker, Kubernetes, Terraform).
Experience with CI/CD pipelines, version control, and automated testing frameworks.
Experience with observability tools (e.g., Prometheus, Grafana, ELK, Datadog, etc.) and log/metric analysis for diagnosing issues.
Proven experience facilitating and documenting Critical User Journeys translating them to actionable SLA/SLO for automation.
Demonstrated ability to collaborate with cross-functional teams and communicate clearly in high-impact situations.
A problem-solver who approaches reliability as a shared responsibility across engineering.
Familiarity with AI-augmented development tools (Claude, Codex) as part of a modern engineering workflow.
**Nice to Have**
Experience writing or maintaining end-to-end or integration tests for distributed systems.
Background in performance testing, capacity planning, or chaos engineering.
Contributions to internal developer tooling or reliability-focused frameworks.
Exposure to security, compliance, or change management processes in production environments.
Relevant certifications.

Benefits

Comp & perks

Multiple medical insurance plans to choose from
Dental, vision life and disability insurance
Employee Emergency Fund
Company equity (stock options)
Open PTO policy
401K plan with company match
Hybrid/flexible work environment

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

PythonJavaC++GoLinux systemsAWSGCPAzureDockerKubernetes

Soft Skills

collaborationcommunicationproblem-solvingoperational excellenceincident preventionblameless postmortemscapacity planningfacilitatingdocumentingtranslating