Site Reliability Engineer

OXIO

Site Reliability Engineer designing and implementing cloud platform for OXIO's Telecom services while maintaining production infrastructure.

Posted 5/27/2026full-timeRemote • 🇺🇸 United StatesMid-LevelSeniorWebsite

ATS Keywords

Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills

LinuxUnixPythonGoRubyBashPerlTerraformCloudFormationAnsible

Soft Skills

continuous improvementincident managementon-call rotationblameless postmortems

Tools & Technologies

DockerKubernetesPrometheusGrafanaDatadogJenkinsGitLab CICircleCIAWSGoogle Cloud

Industry Keywords

cloud-native architecturehigh availabilityfailover mechanismsdistributed systemsdatabase managementSQLNoSQLload balancingzero trust principlesConfiguration Management

Tech Stack

Tools & technologies

AnsibleAWSAzureCassandraCloudDistributed SystemsDNSDockerElasticSearchFirewallsGoGrafanaJenkinsKafkaKubernetesLinuxNoSQLPerlPrometheusPythonRubySaltStackSplunkSQLTCP/IPTerraformUnixVMware

About the role

Key responsibilities & impact

Design and implement platform on the cloud to support OXIO backend services
Automate technical operations: deployments, scaling, recovery, etc.
Monitor and maintain mission-critical production infrastructure to ensure maximum uptime
Participate in an on-call rotation and culture of continuous improvement through blameless postmortems
Enable the Engineering/Telecom/Data Engineering teams by providing them the tools to operate the service they build

Requirements

What you’ll need

Understanding of Linux/Unix systems (most systems are Linux-based).
Familiarity with Linux/Unix system internals like process management, filesystems, memory management, and networking.
Proficiency in at least one programming language (Python, Go, or Ruby) and strong skills in scripting (Bash, Perl).
Experience with infrastructure provisioning tools such as Terraform, CloudFormation, or Ansible.
Familiarity with containerization (Docker) and orchestration tools (Kubernetes).
Familiarity with monitoring tools like Prometheus, Grafana, or Datadog.
Knowledge of setting up alerts, analyzing logs, and creating dashboards for observability.
Familiarity with incident management practices (e.g., runbooks, postmortems).
Experience in being part of an on-call rotation and handling incidents.
Experience in setting up and maintaining Continuous Integration/Continuous Delivery pipelines (Jenkins, GitLab CI, CircleCI, etc.).
Hands-on experience with cloud providers (AWS, Google Cloud, Azure).
Knowledge of virtualization technologies (VMware, KVM) and cloud-native architecture.
Understanding of TCP/IP, DNS, HTTP/HTTPS, load balancing, and firewalls.
Strong understanding of deployment strategies (canary releases, blue-green deployments, etc.).
Familiarity with high availability and understanding failover mechanisms.
Familiarity with IAM (Identity and Access Management) and zero trust principles.
Experience working with distributed systems (e.g., Kafka, Cassandra, Elasticsearch).
Building custom monitoring tools or writing complex automation scripts.
Functional knowledge of database management (SQL and NoSQL).
Familiarity with distributed tracing (Jaeger, OpenTelemetry) and advanced log aggregation strategies (ELK stack, Splunk).
Familiarity with performance profiling tools and optimizing application performance under heavy load.
Familiarity in load testing and identifying bottlenecks.
Familiarity with Configuration Management using SaltStack for maintaining server configurations.

Benefits

Comp & perks

N/A 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score