Senior Site Reliability Engineer

Senior Site Reliability Engineer at Zello responsible for maintaining data tier reliability using MySQL, MongoDB, and more.

Posted 5/13/2026full-timeAustin • Texas • 🇺🇸 United StatesSeniorWebsite

Tools & technologies

AWSAzureCassandraCloudDockerElasticSearchGoGoogle Cloud PlatformKubernetesLinuxMongoDBMySQLPrometheusPython

Key responsibilities & impact

Design, deploy, and operate highly available MySQL and MongoDB clusters across our cloud environments
Tune query performance, schema, and index strategy in partnership with application engineers and push fixes upstream into the application when that's the right answer
Extend our observability stack — Prometheus, Loki, and Tempo
Participate in the Platform on-call rotation, lead incident response for data-tier issues, and write postmortems that drive durable change
Improve disaster recovery, security posture, and compliance for our database footprint — encryption, access control, audit logging, backup integrity
Evaluate and operate ScyllaDB/Cassandra and Elasticsearch where they fit the workload
Write the automation, tooling, and operators that take repetitive work off the team's plate
Use AI to compress incident response and root-cause analysis

What you’ll need

7+ years in SRE, DevOps, platform, infrastructure, or database reliability roles, with at least 3 years owning production databases
BSc in Computer Science or equivalent practical experience
You've operated highly available MySQL and MongoDB in production at scale
replication, sharding, backups, point-in-time recovery, and failover drills you've actually run, not just designed on paper
You diagnose database performance end-to-end; query plan, indexes, locking, OS, storage, network
You've shipped meaningful work on at least two of bare metal Linux, containerized workloads (Docker, Kubernetes, or similar), and a major cloud (GCP preferred; AWS or Azure equivalent is fine)
You instrument what you build. You've used Prometheus, OpenTelemetry, or comparable systems to close real incidents, and you've written the dashboard the next on-call engineer will actually open.
You write code that runs in production: Python, Go, Bash, or similar for automation, tooling, or operators. You don't hand off scripting to someone else.
You communicate clearly under pressure and after the fact. Your postmortems are blameless, specific, and lead to changes that stick
You bring an opinion on managed vs. self-managed databases, and can defend the trade-off based on availability, cost, and operational burden.
ScyllaDB/Cassandra or Elasticsearch experience is a plus
You've used AI tooling: copilots, agents, or custom automation to expedite incident response, root-cause analysis, or developer workflows.

Comp & perks

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

MySQLMongoDBScyllaDBCassandraElasticsearchPythonGoBashDockerKubernetes

Soft Skills

communication under pressureincident responsepostmortem writingblameless culturediagnostic skills

Certifications

BSc in Computer Science