Senior Machine Learning Site Reliability Engineer

Prima Power

full-time

Posted on: 1/13/2026

Location Type: Remote

Location: Italy

✨ AI Apply

About the role

Design, build, and operate reliable and scalable systems by defining and monitoring SLOs/SLIs
work directly on production infrastructure
collaborate closely with software engineers on system design and reliability improvements
actively develop automation for infrastructure and operational workflows to eliminate toil and reduce MTTR
participate in and lead incident response
drive blameless post-incident reviews with concrete follow-ups implemented in code and tooling
continuously analyze and optimize system performance and cost
provide data, insights, and recommendations to inform capacity planning
support security best practices through hands-on vulnerability remediation and threat mitigation

Hands-on experience with SRE practices in production
strong AWS expertise
Kubernetes, networking, DNS, and Infrastructure as Code (Pulumi preferred, Terraform a plus)
strong software engineering fundamentals with emphasis on code quality and maintainability
solid Python proficiency and deep knowledge of the Python ecosystem (testing, debugging, packaging)
hands-on experience with PySpark
Familiarity with MLOps practices such as model registries, model versioning, retraining workflows, and end-to-end deployment lifecycles
stakeholder engagement and mentoring e.g. lead incident response and RCAs
improve system reliability
engage stakeholders to propose solutions, share learnings, and mentor others

Benefits

Tip: use these terms in your resume and cover letter to boost ATS matches.

SRE practicesAWSKubernetesInfrastructure as CodePulumiTerraformPythonPySparkMLOpssystem reliability

stakeholder engagementmentoringincident responsecollaborationcommunicationleadershipproblem-solvinganalysisrecommendationcapacity planning