Salary
💰 $75,000 - $125,000 per year
Tech Stack
AWSAzureCloudDistributed SystemsGoGoogle Cloud PlatformJavaJenkinsKafkaKubernetesMavenPrometheusPythonRabbitMQSDLCSplunkSpring
About the role
- Demonstrate and innovate SRE practices by collaborating with stakeholders to implement important SRE principles and objectives
- Partner with product and platform teams to define and track service level objectives (SLOs) and indicators (SLIs)
- Monitor and manage system reliability performance, ensuring systems meet SLOs
- Communicate reliability concerns and their potential impact with key stakeholders
- Promote the prioritization of reliability throughout the software development life cycle
- Design, code, test, and deliver solutions to automate manual operations
- Participate in on-call rotations, provide support for SRE systems, and lead or participate in post-mortem incident analysis
- Engage in system design, capacity planning, and architecture discussions to ensure operational requirements are met
- Share lessons learned and best practices regarding reliability and performance with stakeholders and team members
- Assist in training and mentoring fellow junior SREs to ensure best practices are followed and scaled within the organization
- Pursue continuous improvement opportunities to stay up to date on SRE methods and trends and participate in organizational learning initiatives
- Support governance and ensure compliance with policies by collaborating with security, compliance, and other teams
- Respond promptly to requests for assistance from technical customers, providing engineering support and best-practice guidance
- Adhere to and suggest improvements to standard operating procedures, advocate for automation and workflow optimization
Requirements
- Bachelor's degree in computer science, Information Systems, or a related technical field, or equivalent practical experience
- Experience coding in one or more programming languages such as Java, Python, Go (also mentions C++, Spring Framework)
- Understanding of DevOps principles and practices
- Interest in building and operating large-scale, distributed systems
- Familiarity with cloud platforms like AWS, Azure, or GCP
- Experience with Message Queue (MQ) technologies like RabbitMQ, Kafka, or similar
- Experience with observability tools like Splunk, Dynatrace, Prometheus, or Datadog
- Knowledge of industry-standard CI/CD tools like Git/Bitbucket, Jenkins, Maven, and Artifactory
- Understanding of client-server relationships, network concepts, and operating system navigation
- Familiarity with Kubernetes and configuration management tools
- Ability to participate in on-call rotations and incident post-mortem analysis
- Strong verbal and written communication skills
- Critical thinking skills and a proactive approach to problem-solving
- Willingness to learn and take on challenging opportunities
- Desirable: One to two years of experience in a related role, with SRE experience preferred
- Desirable: Automation provider certifications
- Desirable: Experience with algorithms, data structures, scripting, pipeline management, and software design
- insurance (including medical, prescription drug, dental, vision, disability, life insurance)
- flexible spending account and health savings account
- paid leaves (including 16 weeks of new parent leave and up to 20 days of bereavement leave)
- 80 hours of Paid Sick and Safe Time
- 25 days of vacation time and 5 personal days, pro-rated based on date of hire
- 10 annual paid U.S. observed holidays
- 401k with a best-in-class company match
- deferred compensation for eligible roles
- fitness reimbursement or on-site fitness facilities
- eligibility for tuition reimbursement
- competitive base salary and may be eligible for an annual bonus or commissions
- other unspecified benefits ("and many more")
ATS Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
JavaPythonGoC++Spring FrameworkDevOpsKubernetesMessage QueueCI/CDsoftware design
Soft skills
communicationcritical thinkingproblem-solvingmentoringcollaborationcontinuous improvementtrainingproactive approachstakeholder engagementincident analysis
Certifications
Bachelor's degreeAutomation provider certifications