
Principal DevOps, SRE Engineer
ELSA, Corp
full-time
Posted on:
Location Type: Remote
Location: India
Visit company websiteExplore more
Job Level
About the role
- Own the SRE practice: define severity tiers (P1–P4), formalize on-call rotation, build SLA tracking dashboards, and establish incident management workflows across a team of 4 DevOps engineers.
- Build runbooks for the top recurring operational issues — pod scaling, deploy rollbacks, access management, EKS upgrades, CI/CD pipeline failures — and automate L1/L2 responses using tools like Shoreline.io , Rundeck, or PagerDuty automation.
- Introduce and operationalize AI-assisted DevOps tooling: AIOps for alert correlation, CastAI/Kubecost for cost optimization, GitHub Copilot for IaC acceleration. Train the existing team on these tools.
- Drive infrastructure modernization: EKS upgrades, Karpenter migration, observability (SigNoz/Prometheus), secrets management (ArgoCD/SOPS), and Terraform-based IaC maturity.
- Collaborate with AI Engineering, Mobile, and B2B teams to ensure infrastructure supports real-time speech processing, GPU workloads, and multi-region enterprise deployments.
- Design and plan round-the-clock SRE coverage model as B2B enterprise SLA commitments grow — evaluate vendor partnerships or strategic hires for Americas timezone coverage.
Requirements
- 2+ years in DevOps/SRE, with at least 2 years in a principal or staff-level role owning reliability practices for a production SaaS product.
- Deep hands-on experience with AWS (EKS, EC2, DynamoDB, S3, IAM, Secrets Manager), Kubernetes (HPA, KEDA, Karpenter, pod scheduling, GPU workloads), and IaC (Terraform, Helm, ArgoCD).
- Track record of building runbooks, on-call rotations, and incident management frameworks — not just participating in them.
- Experience with observability stacks (Prometheus, Grafana, SigNoz or Datadog), CI/CD (GitLab CI, GitHub Actions), and alerting (PagerDuty, Opsgenie).
- Comfort working across timezones with distributed teams (India, Vietnam, Portugal).
- Strong written communication — you'll be writing runbooks, RCAs, and proposals as much as Terraform.
Benefits
- Flexible work setup: Remote-first for Singapore, India, Indonesia, Malaysia; hybrid model for Vietnam.
- Comprehensive employee well-being benefits.
- Free ELSA Premium courses to polish your language skills.
- Collaborative, international team culture.
- Opportunity to contribute to a fast-growing, well-funded Silicon Valley startup with global impact.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
DevOpsSREAWSKubernetesIaCTerraformGitLab CIGitHub Actionsobservabilityincident management
Soft Skills
written communicationcollaborationteam trainingtime zone management