Ensure that infrastructure and applications have high-quality Service Level Agreements (SLA) and Service Level Objectives (SLO) that are measured and adhered to
Ensure KUBRA maintains well-documented standards and best practices to ensure existing and new services are built for high availability and security
Ensure appropriate automation and observability exists to achieve low and continuously improving mean time to recovery (MTTR) for service-impacting incidents
Ensure that any incidents are thoroughly investigated and documented appropriately, along with the corresponding problem records with corrective actions
Participate in the Architectural Review Process for new and existing services being built for the KUBRA HQ platform, ensuring compliance with standards and best practices for high availability, observability, security, and cost efficiency
Work closely with Development, Infrastructure, and Operations teams to lead the root cause analysis related to any major incidents – leading senior stakeholder communication, driving problem-solving, and debugging with best practice techniques
Design and conduct fault injection experiments to identify potential weak points in high-availability architecture and work with Platform Engineering and Software Engineering teams to remediate any findings
Perform periodic audits of applications and infrastructure to ensure compliance with standards and identify necessary remediation
Requirements
Bachelor’s degree in computer science, Engineering, Information Technology, or equivalent experience
AWS Certifications (Solutions Architect, SysOps Administrator, DevOps Engineer) are desirable
Kubernetes Certifications (CKA, CKS, CKAD, KCNA) are desirable
Experience with a systems programming language, such as Go or Python, and shell scripting
Proficient with Terraform and infrastructure as code principles
Demonstrated proficiency in public cloud environments, particularly AWS
Hands-on experience with Kubernetes management within AWS EKS
Experience with CI/CD automation tools such as CircleCI and ArgoCD
Experience with monitoring and logging in cloud environments, using tools like Prometheus, Grafana, Open Telemetry, CloudWatch, Honeycomb, etc.
In-depth understanding of containerization, microservice architecture, and related technologies
Strong communication skills and ability to facilitate effective technical problem-solving
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.