Innovaccer

Site Reliability Engineer II

Innovaccer

full-time

Posted on:

Location Type: Remote

Location: United States

Visit company website

Explore more

AI Apply
Apply

About the role

  • Take ownership of SRE pillars: Deployment, Reliability, Scalability, Service Availability (SLA/SLO/SLI), Performance, and Cost.
  • Lead production rollouts of new releases and emergency patches using CI/CD pipelines while continuously improving deployment processes.
  • Establish robust production promotion and change management processes with quality gates across Dev/QA teams.
  • Roll out a complete observability stack across systems to proactively detect and resolve outages or degradations.
  • Analyze production system metrics, optimize system utilization, and drive cost efficiency.
  • Manage autoscaling of the platform during peak usage scenarios.
  • Perform triage and RCA by leveraging observability toolchains across the platform architecture.
  • Reduce escalations to higher-level teams through proactive reliability improvements.
  • Participate in the 24x7 OnCall Production Support team.
  • Lead monthly operational reviews with executives covering KPIs such as uptime, RCA, CAP (Corrective Action Plan), PAP (Preventive Action Plan), and security/audit reports.
  • Operate and manage production and staging cloud platforms, ensuring uptime and SLA adherence.
  • Collaborate with Dev, QA, DevOps, and Customer Success teams to drive RCA and product improvements.
  • Implement security guidelines (e.g., DDoS protection, vulnerability management, patch management, security agents).
  • Manage least-privilege RBAC for production services and toolchains.
  • Build and execute Disaster Recovery plans and actively participate in Incident Response.

Requirements

  • 4–7 years in production engineering, site reliability, or related roles.
  • Solid hands-on experience with at least one cloud provider (AWS, Azure, GCP) with automation focus (certifications preferred).
  • Strong expertise in Kubernetes and Linux.
  • Proficiency in scripting/programming (Python required).
  • Observability is very critical for the scale of our systems and ability to find insights/behavior, detect problem/failures. Looking for leads to drive this charter spanning across logs, metrics, mesh, tracing etc.
  • Knowledge of CI/CD pipelines and toolchains (Jenkins, ArgoCD, GitOps).
  • Familiarity with persistence stores (Postgres, MongoDB), data warehousing (Snowflake, Databricks), and messaging (Kafka).
  • Exposure to monitoring/observability tools such as ElasticSearch, Prometheus, Jaeger, NewRelic, etc.
  • Proven experience in production reliability, scalability, and performance systems.
  • Experience in 24x7 production environments with process focus.
  • Familiarity with ticketing and incident management systems.
  • Security-first mindset with knowledge of vulnerability management and compliance.
  • Excellent judgment, analytical thinking, and problem-solving skills.
  • Strong sense of personal responsibility and accountability for delivering high quality work.
Benefits
  • Generous Paid Time Off: Recharge and relax with 22 days of fixed time off per year, in addition to company holidays—because we believe work-life balance fuels performance.
  • Best-in-Class Parental Leave: Spend quality time with your growing family. We offer one of the industry’s most generous parental leave policies to support you during life’s most important moments.
  • Recognition & Rewards: We celebrate wins—big and small. Get rewarded with monetary incentives and company-wide recognition for your impact and dedication. Your hard work won’t go unnoticed.
  • Comprehensive Insurance Coverage: Stay covered with medical, dental, and vision insurance, plus 100% company-paid short- and long-term disability and basic life insurance. Optional perks include discounted legal aid and pet insurance.
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
KubernetesLinuxPythonCI/CDobservabilityproduction reliabilityscalabilityperformance optimizationvulnerability managementDisaster Recovery
Soft Skills
analytical thinkingproblem-solvingpersonal responsibilityaccountabilityjudgment