Tech Stack
AnsibleAWSChefCloudDistributed SystemsDockerEC2ElasticSearchGoGrafanaGraphiteLinuxPerlPrometheusPuppetPythonTCP/IPTerraformUnixVagrant
About the role
- Work as an integral member of product teams to build, deploy, and monitor cloud services reliably
- Contribute to complex software development projects to maintain essential, revenue-critical services
- Ensure the reliability, availability, and performance of Elasticsearch infrastructure
- Actively develop code and build frameworks to monitor services deployed in production
- Build systems and infrastructure to monitor complex, large-scale distributed systems
- Identify stability and performance issues and collaborate with developers to triage critical production issues
- Represent the SRE organization in design reviews and operational readiness exercises for new and existing services
- Devise ways to actively monitor system throughput, capacity, and reliability
- Debug complex systems and evolve running environments without causing downtime
- Engage in service capacity planning, demand forecasting, software performance analysis, and system tuning
- Drive standardization efforts across multiple disciplines and services with embedded SREs throughout the organization
- Monitor and troubleshoot Elasticsearch performance issues and outages
- Collaborate effectively with Developers, Designers, Customer Support, and Engineering Leadership
Requirements
- Bachelor’s degree in Computer Science or equivalent work experience as a System Administrator with programming skills
- Fundamental knowledge of technologies across a broad range of disciplines, including virtualization, storage, networking, server, and security
- Understanding of systems and application design, including the operational trade-offs of various designs
- Experience with monitoring and logging solutions such as Prometheus, Grafana, and ELK stack
- Proficiency in scripting languages such as Python
- Experience with infrastructure-as-code tools such as Terraform or CloudFormation
- Strong understanding of Linux system administration and networking concepts
- Excellent troubleshooting and problem-solving skills
- Ability to work independently and collaboratively in a fast-paced environment
- Strong communication and interpersonal skills
- Demonstrable knowledge of Unix, TCP/IP, HTTP, web application security, and experience supporting multi-tier web application architectures
- Experience in analyzing logs and troubleshooting large-scale distributed systems
- Nice to have: Experience with instrumenting and monitoring production systems using tools such as ELK stack, Zabbix, Nagios, Statsd/Graphite, APM
- Nice to have: Experience with Amazon AWS Infrastructure (including EC2, S3, VPC, Security Groups, RDS)
- Nice to have: Working understanding of Docker, Vagrant, and configuration management tools like Ansible, Chef, or Puppet
- Nice to have: Experience with general-purpose programming/scripting languages including Bash, Perl, or Go
- Medical Insurance
- Flexible PTO
- Flex Friday
- Hybrid Work Option Available
- Tuition Reimbursement
- And more!
ATS Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
ElasticsearchPythonTerraformCloudFormationLinux system administrationTCP/IPHTTPBashPerlGo
Soft skills
troubleshootingproblem-solvingcommunicationinterpersonalcollaborationindependencefast-paced environment