Site Reliability Engineer

NOV

Site Reliability Engineer responsible for monitoring production systems and leading incident responses. Join a high-impact team to optimize system performance and scalability in the oil and gas industry.

Posted 6/23/2026full-timeHouston • Texas • 🇺🇸 United StatesMid-LevelSeniorWebsite

Tech Stack

Tools & technologies

AkkaAWSAzureCloudDistributed SystemsGoogle Cloud PlatformGrafanaKubernetes.NETPostgresPrometheusPython

About the role

Key responsibilities & impact

Maintain and monitor production systems for availability, latency, and performance.
Lead incident response efforts, including communication, resolution, and postmortem documentation.
Design and implement health checks, alerting systems, and automated remediation workflows.
Drive root cause analysis and implement permanent resolutions for recurring issues.
Set up and maintain full observability stacks (logging, metrics, tracing) using tools like Prometheus, Grafana, Datadog, OpenTelemetry, or ELK.
Analyze telemetry and logs to identify trends, anomalies, and opportunities for improvement.
Conduct post-incident reviews and use insights to inform future engineering investments.
Tune and optimize distributed systems, including AKKA.NET actors, for performance and resource efficiency.
Work with developers to evolve architecture and improve system throughput, latency, and stability.
Optimize PostgreSQL performance, queries, and maintenance strategies.
Design and maintain modern CI/CD pipelines using GitHub Actions, Azure Pipelines, or GitLab CI.
Automate deployment, testing, and rollback processes to reduce friction and increase deployment frequency.
Standardize infrastructure as code practices across environments.

Requirements

What you’ll need

5+ years of experience in SRE, DevOps, or Infrastructure Engineering roles.
Expertise in Kubernetes and container orchestration at scale.
Strong experience with AKKA.NET or similar actor-based frameworks.
Proficiency with scripting and automation (Bash, PowerShell, Python).
Experience with observability tools (Phobos,Datadog, Prometheus, Grafana, OpenTelemetry, ELK).
Hands-on experience with cloud platforms (AWS, Azure, or GCP).
Strong PostgreSQL knowledge—performance tuning, query optimization, maintenance.
Proven ability to lead incident management and drive postmortem processes.
A builder’s mindset with high standards for operational excellence and technical ownership.

Benefits

Comp & perks

Health insurance
Retirement plans
Paid time off
Flexible work arrangements
Professional development

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

SREDevOpsInfrastructure EngineeringKubernetesAKKA.NETBashPowerShellPythonPostgreSQLCI/CD

Soft Skills

incident managementcommunicationleadershipoperational excellencetechnical ownership