SRE/DevOps Engineer

Versana

SRE/DevOps Engineer at Versana improving cloud observability and efficiency in loan market technologies. Collaborating with teams to enhance system reliability and monitoring practices.

Posted 5/23/2026full-timeNew York City • New York • 🇺🇸 United StatesMid-LevelSeniorWebsite

Tech Stack

Tools & technologies

AWSAzureCloudDockerElasticSearchGoogle Cloud PlatformGrafanaJenkinsKafkaKubernetesLinuxTerraform

About the role

Key responsibilities & impact

Design, implement and enhance system observability and monitoring tools
Monitor system performance, create incident response plans, and implement observability practices to gain insights into system behavior.
Implement and monitor service-level objectives (SLOs) and indicators.
Improve system reliability and resiliency.
Conduct post-incident reviews and implement necessary changes to prevent system failures.
Assist teams in implementing observability tools and leveraging available telemetry data to troubleshoot and resolve incidents and problems.
Leverage observability and event management to improve key incident management metrics, such as mean time to detect and mean time to restore services.
Continually optimize systems and workflows by improving architecture, infrastructure, automation, CI/CD, and observability.
Collaborate with developers to ensure applications are designed with DevOps best practices in mind.
Participate in a rotating on-call schedule for weekend releases and being available to respond to production issues outside of regular working hours, including weekends and holidays.

Requirements

What you’ll need

5+ years of experience as a Site Reliability Engineer or similar role.
3+ years of work experience with public cloud (Azure, AWS or GCP).
3+ years of direct experience with observability tools like Datadog, Elasticsearch, and Grafana Labs, etc.
3+ years of experience with containerization and orchestration technologies like Docker and Kubernetes.
2+ years of experience in development and management of CI/CD pipelines (e.g., Azure DevOps, Gitlab CI/CD, Github Actions, Jenkins, etc).
2+ years of experience with Infrastructure-as-code tools like Terraform, Azure Bicep, Cloud Formation, etc.
1+ years of experience with site reliability tools like Gremlin, Chaos Mesh, or similar.
Proven track record leveraging core observability concepts, end-user monitoring, and infrastructure monitoring with SaaS solutions.
Experience with messaging services like Kafka or Azure Event Hubs.
Good understanding of the Linux operating system.

Benefits

Comp & perks

Equal Opportunity Employer
Health insurance
Professional development opportunities

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

system observabilitymonitoring toolsservice-level objectivessystem reliabilityincident response plansCI/CD pipelinesInfrastructure-as-codecontainerizationorchestration technologiesend-user monitoring

Soft Skills

collaborationtroubleshootingincident managementoptimizationcommunication