Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
SoluStaff

Principal Site Reliability Engineer, SRE

SoluStaff

Principal Site Reliability Engineer ensuring reliability, scalability, and performance of a healthcare SaaS platform for U.S. providers.

Posted 6/11/2026full-timeRemote • 🇺🇸 United StatesLeadWebsite

Tech Stack

Tools & technologies
AWSCloudDjangoGrafanaKubernetesPythonTerraform

About the role

Key responsibilities & impact
  • Serve as the primary technical owner for production reliability across U.S. customer environments.
  • Investigate and resolve complex issues spanning web applications, APIs, backend services, data pipelines, cloud infrastructure, and customer integrations.
  • Lead production incident response efforts, coordinating cross-functional teams to restore service and minimize customer impact.
  • Perform root cause analysis and drive corrective actions that improve long-term system stability and resilience.
  • Partner with software engineering and platform teams to identify recurring reliability risks and implement sustainable solutions.
  • Design, configure, and validate secure customer connectivity solutions including Site-to-Site VPNs, Transit Gateway integrations, routing configurations, and secure network paths.
  • Support customer onboarding initiatives by troubleshooting connectivity challenges and ensuring consistent implementation processes.
  • Enhance platform observability through improvements in monitoring, logging, alerting, tracing, and operational dashboards.
  • Contribute to CI/CD, infrastructure automation, and deployment processes that improve release safety and operational consistency.
  • Develop operational tooling that supports incident response, troubleshooting, onboarding, and system monitoring activities.
  • Collaborate with engineering leadership to improve cloud architecture, scalability, security, and operational readiness.
  • Partner with customer-facing teams to communicate technical issues, remediation plans, and reliability improvements in a clear and effective manner.
  • Support compliance, security, and risk management initiatives within highly regulated healthcare environments.

Requirements

What you’ll need
  • 6+ years of hands-on experience supporting and managing AWS-based production environments.
  • 4+ years of experience supporting web applications and backend services (Python/Django experience strongly preferred).
  • Experience with AWS networking technologies including VPCs, Site-to-Site VPNs, Transit Gateways, routing, NAT gateways, and security groups.
  • Strong experience with Terraform and infrastructure-as-code deployment practices.
  • Experience with containerized environments including ECS, Fargate, Kubernetes, or similar technologies.
  • Experience building and supporting CI/CD pipelines and release automation processes.
  • Familiarity with monitoring and observability platforms such as Datadog, CloudWatch, Sentry, Grafana, or similar tools.
  • Experience leading production incidents, outage management, and root cause analysis initiatives.
  • Exposure to Windows Server environments, Active Directory, Kerberos, and enterprise infrastructure concepts is preferred.
  • Healthcare technology, healthcare SaaS, clinical software, or other regulated industry experience is highly preferred.
  • Bachelor’s degree in Computer Science, Engineering, Information Technology, or a related technical field preferred.

Benefits

Comp & perks
  • Health Care Plan (Medical, Dental & Vision)
  • Retirement Plan (401k, IRA)
  • Paid Time Off (Vacation, Sick & Public Holidays)

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
AWSPythonDjangoTerraformCI/CDECSFargateKubernetesroot cause analysisinfrastructure-as-code
Soft Skills
incident responsecross-functional collaborationcommunicationtroubleshootingproblem-solvingleadershipcustomer focusanalytical thinkingorganizational skillsadaptability