
Site Reliability Engineer II
Innovaccer
full-time
Posted on:
Location Type: Remote
Location: United States
Visit company websiteExplore more
Tech Stack
About the role
- Take ownership of SRE pillars: Deployment, Reliability, Scalability, Service Availability (SLA/SLO/SLI), Performance, and Cost.
- Lead production rollouts of new releases and emergency patches using CI/CD pipelines while continuously improving deployment processes.
- Establish robust production promotion and change management processes with quality gates across Dev/QA teams.
- Roll out a complete observability stack across systems to proactively detect and resolve outages or degradations.
- Analyze production system metrics, optimize system utilization, and drive cost efficiency.
- Manage autoscaling of the platform during peak usage scenarios.
- Perform triage and RCA by leveraging observability toolchains across the platform architecture.
- Reduce escalations to higher-level teams through proactive reliability improvements.
- Participate in the 24x7 OnCall Production Support team.
- Lead monthly operational reviews with executives covering KPIs such as uptime, RCA, CAP (Corrective Action Plan), PAP (Preventive Action Plan), and security/audit reports.
- Operate and manage production and staging cloud platforms, ensuring uptime and SLA adherence.
- Collaborate with Dev, QA, DevOps, and Customer Success teams to drive RCA and product improvements.
- Implement security guidelines (e.g., DDoS protection, vulnerability management, patch management, security agents).
- Manage least-privilege RBAC for production services and toolchains.
- Build and execute Disaster Recovery plans and actively participate in Incident Response.
Requirements
- 4–7 years in production engineering, site reliability, or related roles.
- Solid hands-on experience with at least one cloud provider (AWS, Azure, GCP) with automation focus (certifications preferred).
- Strong expertise in Kubernetes and Linux.
- Proficiency in scripting/programming (Python required).
- Observability is very critical for the scale of our systems and ability to find insights/behavior, detect problem/failures. Looking for leads to drive this charter spanning across logs, metrics, mesh, tracing etc.
- Knowledge of CI/CD pipelines and toolchains (Jenkins, ArgoCD, GitOps).
- Familiarity with persistence stores (Postgres, MongoDB), data warehousing (Snowflake, Databricks), and messaging (Kafka).
- Exposure to monitoring/observability tools such as ElasticSearch, Prometheus, Jaeger, NewRelic, etc.
- Proven experience in production reliability, scalability, and performance systems.
- Experience in 24x7 production environments with process focus.
- Familiarity with ticketing and incident management systems.
- Security-first mindset with knowledge of vulnerability management and compliance.
- Excellent judgment, analytical thinking, and problem-solving skills.
- Strong sense of personal responsibility and accountability for delivering high quality work.
Benefits
- Generous Paid Time Off: Recharge and relax with 22 days of fixed time off per year, in addition to company holidays—because we believe work-life balance fuels performance.
- Best-in-Class Parental Leave: Spend quality time with your growing family. We offer one of the industry’s most generous parental leave policies to support you during life’s most important moments.
- Recognition & Rewards: We celebrate wins—big and small. Get rewarded with monetary incentives and company-wide recognition for your impact and dedication. Your hard work won’t go unnoticed.
- Comprehensive Insurance Coverage: Stay covered with medical, dental, and vision insurance, plus 100% company-paid short- and long-term disability and basic life insurance. Optional perks include discounted legal aid and pet insurance.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
KubernetesLinuxPythonCI/CDobservabilityproduction reliabilityscalabilityperformance optimizationvulnerability managementDisaster Recovery
Soft Skills
analytical thinkingproblem-solvingpersonal responsibilityaccountabilityjudgment