
Senior Site Reliability Engineer, AI Research
Algolia
full-time
Posted on:
Location Type: Remote
Location: Australia
Visit company websiteExplore more
Job Level
About the role
- Support and evolve the reliability of platforms used by the AI Research team
- Ensure production services meet expectations for availability, latency, and operational readiness
- Design infrastructure and operational patterns that prioritize iteration speed while maintaining appropriate safeguards for production systems
- Work closely with researchers and engineers in a cross-functional setting
- Participate directly in team planning and execution, from early exploration through production rollout
- Help researchers self-serve infrastructure safely and effectively
- Build and maintain Kubernetes-based services on GCP using infrastructure-as-code and GitOps
- Own and improve CI/CD pipelines for services written primarily in Go
- Design and operate observability systems using tools such as Datadog
- Participate in an on-call rotation (relatively light)
Requirements
- Strong experience operating cloud-first infrastructure
- Hands-on experience running production services on Kubernetes
- Proficiency with infrastructure-as-code (Terraform) and CI/CD systems
- Experience supporting production services written in Go (Python experience is a plus)
- Solid grounding in service reliability, incident response, and operational best practices
- Comfort working in environments with ambiguity, where problems are not always well-defined upfront.
Benefits
- Flexible workplace strategy
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
KubernetesGCPinfrastructure-as-codeGitOpsCI/CDGoTerraformPythonservice reliabilityincident response
Soft Skills
cross-functional collaborationproblem-solvingadaptabilitycommunicationteam planningexecutionself-service supportoperational readinessiteration speedsafeguards