- Ensure platform reliability and availability across production and pre-production environments through proactive monitoring, alerting, and automation.
- First response for incidents, contribute to problem management and root cause analysis.
- Supporting the development team's effort towards reliability, creating a solid reliability culture within the development lifecycle.
- Develop troubleshooting documentation for production support resources.
- Collaborate with Engineering teams to develop optimised and productive runbooks, operational documentation and automation of operational tasks.
- Collaborate with development and cloud engineering teams to embed reliability and performance into the software delivery lifecycle.
- Design, implement, and evolve observability solutions (metrics, logs, traces, dashboards) using tools such as Prometheus, Grafana, and ELK.
- Participate in on-call rotations and continuously improve alert quality and response processes.
- Champion a culture of reliability, performance, and continuous improvement across teams.
Requirements
- Bachelor's Degree or MS in Engineering or equivalent.
- Experience in operating at least one container orchestration cluster (Kubernetes, Docker Swarm).
- Experience developing or maintaining software for production services at scale.
- Experience with ELK.
- Experience with AWS.
- Experience with Grafana/Prometheus stack.
- Strong scripting skills (Bash, Python or Go).
- Excellent communication skills.
- Thinking out of the box and anticipating challenges. It is imperative we are not simply reactive; we must expect challenges and question technologies, procedures and thinking already in place. You will be expected to constantly review and challenge at all levels.
- Versatility. We work with agile/lean methods. We'd much rather iterate and learn than assume we know all the answers.
- Being a team player. You don't (always) work in isolation and are excited by the thought of using your team whilst involving product, experience design, engineering, and more in the process.
**Will be considered as a plus:**
- Telephony knowledge (SIP, VoIP);
- Experience in Linux Administration (RedHat, CentOS, AL);
- Working knowledge in Configuration Management tools (Terraform, Ansible);
- Experience with TCP/IP and general networking concepts;
- RDBMS knowledge (MySQL, Postgres);
- NoSQL knowledge (Redis).
Benefits
- Fixed compensation;
- Long-term employment with the working days vacation;
- Development in professional growth (courses, training, etc);
- Being part of successful cutting-edge technology products that are making a global impact in the service industry;
- Proficient and fun-to-work-with colleagues;
- Apple gear.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.