FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.
Tech Stack
Tools & technologiesAnsibleAWSCloudDNSDockerGoGrafanaKubernetesLinuxPrometheusPythonSplunkTCP/IPTerraform
About the role
Key responsibilities & impact- Act as a primary or escalation responder in a 24x7 on-call rotation
- Lead or support Major Incident (MI) response, including triage, mitigation, and resolution
- Coordinate across Engineering, Infrastructure, Security, and Product teams
- Execute and improve runbooks, playbooks, and escalation paths
- Drive blameless post-incident reviews (PIRs) and track corrective actions
- Own service health monitoring across infrastructure, applications, and dependencies
- Design and maintain alerting strategies that align with SLIs/SLOs
- Reduce alert fatigue through signal-to-noise improvements
- Build dashboards using tools such as Grafana, Prometheus, Datadog, Splunk, CloudWatch
- Automate repetitive operational tasks to reduce manual toil
- Improve mean time to detect (MTTD) and mean time to resolve (MTTR)
- Develop scripts and tools (Python, Bash, Go, etc.) to support NOC/SRE workflows
- Implement self-healing and auto-remediation where possible
- Partner with engineering teams to improve system design for reliability
- Support and troubleshoot Linux-based systems, cloud platforms, Kubernetes/containerized environments
- Assist with capacity planning and availability reviews
- Ensure operational readiness for production releases
Requirements
What you’ll need- Strong Linux systems administration
- Experience with incident management and production support
- Familiarity with cloud infrastructure (AWS preferred)
- Containers & orchestration (Docker, Kubernetes)
- Monitoring/alerting platforms
- Scripting or programming experience in Python, Bash, Go, or similar
- Understanding of networking fundamentals (DNS, TCP/IP, load balancing)
- Experience working in 24x7 NOC or production operations environments
- Ability to handle high-pressure incidents calmly and effectively
- Strong written and verbal communication for incident coordination
- Comfort working from runbooks—but improving them when they fall short
- Experience defining or operating to SLOs / SLIs
- Prior migration from traditional NOC → SRE model
- Infrastructure as Code experience (Terraform, Ansible, etc.)
- Exposure to security, compliance, or regulated environments
Benefits
Comp & perks- Professional development opportunities
- Flexible working hours
- Work from home
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Linux systems administrationincident managementcloud infrastructurecontainersorchestrationscriptingprogrammingnetworking fundamentalsInfrastructure as Codemonitoring
Soft Skills
calm under pressurewritten communicationverbal communicationincident coordinationrunbook improvement
