FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Site Reliability Engineer
Hewlett Packard Enterprise. Engage in and improve the whole lifecycle of services - from inception and design, through to deployment, operation, and refinement.
Tech Stack
Tools & technologiesAirflowAnsibleApacheAWSCassandraCloudDistributed SystemsDockerElasticSearchFluxGoKafkaKubernetesLinuxPackerPostgresPythonRedisRubySparkTerraformUnix
About the role
Key responsibilities & impact- Engage in and improve the whole lifecycle of services - from inception and design, through to deployment, operation, and refinement.
- Support development of services from planning phase before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
- Provide technical leadership and guidance to other team members on managing availability and performance of mission critical services, on building automation to prevent problem recurrence, and building automated responses for non-exceptional service conditions.
- Maintain services once they are living by measuring and monitoring availability, latency, and overall system health.
- Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
- Capacity planning the growth of cloud infrastructure.
- Improve operational processes such as deployments and upgrades.
- Manage execution of project priorities, deadlines, and deliverables.
- Be on an on-call rotation to respond to incidents that impact platform availability.
- Use your on-call shift to prevent incidents from happening.
- Experience in incident response, including conducting post-mortems and implementing lessons learned, enhances system reliability.
Requirements
What you’ll need- 10+ years of engineering or systems experience.
- Experience building and running reliable and fault-tolerant production cloud systems at scale on AWS.
- Coding infrastructure automation with Terraform, Terragrunt, Packer, CI/CD, and knowing how to use configuration management systems like Ansible.
- Hands-on experience with Linux/Unix operating systems internals, file systems, system tuning, administration, and networking.
- Deep experience in microservice technologies, container orchestration, and continuous deployment (Kubernetes, Docker, Helm, GitOps with Flux).
- Experience in designing, building, maintaining production services, and troubleshooting large-scale distributed systems.
- Experience with technologies like Apache Kafka, Apache Storm, Apache Flink, Apache Airflow and Spark, Postgres, Redis, Elasticsearch, Arango, Cassandra.
- Experience with observability tools and methodology (monitoring, logging, tracing, SLOs/SLIs) for detecting and diagnosing issues in advance before causing service impact or performance degradation.
- Possess strong programming skills in Shell, Python, Golang and/or Ruby.
- Deliver efficiently and effectively.
- Strong problem-solving and debugging skills with a high sense of ownership.
Benefits
Comp & perks- Health & Wellbeing
- Personal & Professional Development
- Unconditional Inclusion
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
AWSTerraformTerragruntPackerCI/CDAnsibleLinuxUnixKubernetesDocker
Soft Skills
technical leadershipproblem-solvingdebuggingownershipcommunicationproject managementincident responsecollaborationprocess improvementcapacity planning