Develop and enhance software applications and configuration to better align with operational needs. Collaborate closely with the development team to achieve the company’s overarching goals.
Deploy, maintain, and optimize our comprehensive observability stack, including metrics, logs, and traces. Design and refine alerting strategies to transition from reactive monitoring to proactive.
Manage and provision cloud infrastructure using modern Infrastructure as Code tools.
Leverage innovative GenAI tools to boost SRE efficiency. This involves developing and maintaining systems that utilize AI for in-depth data analysis, automated incident diagnostics, and improved deployment reliability checks.
Participate in on-call rotation to ensure production reliability.
Requirements
A Bachelor’s degree in Computer Science, Engineering, or 1+ years of experience in a relevant technical operations or platform role.
Possess a solid understanding of core SRE concepts and cloud computing principles.
Demonstrate skill in at least one modern programming or scripting language (e.g., Python, Java, Bash) for automation and tooling development.
Experience working within Windows, Linux, or Unix environments.
Proven ability to approach complex, ambiguous production issues with a systematic, data-driven methodology.
Benefits
Flexible time off
Comprehensive health coverage
Competitive salary
Paid parental leave
Wellness benefits including access to mental health resources, virtual HIIT and yoga workouts
A bevy of other perks including Udemy access, childcare assistance, pet insurance discounts, legal assistance, and additional discounts.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.