
Senior Machine Learning Site Reliability Engineer
Prima Power
full-time
Posted on:
Location Type: Remote
Location: Italy
Visit company websiteExplore more
Job Level
About the role
- Design, build, and operate reliable and scalable systems by defining and monitoring SLOs/SLIs
- work directly on production infrastructure
- collaborate closely with software engineers on system design and reliability improvements
- actively develop automation for infrastructure and operational workflows to eliminate toil and reduce MTTR
- participate in and lead incident response
- drive blameless post-incident reviews with concrete follow-ups implemented in code and tooling
- continuously analyze and optimize system performance and cost
- provide data, insights, and recommendations to inform capacity planning
- support security best practices through hands-on vulnerability remediation and threat mitigation
Requirements
- Hands-on experience with SRE practices in production
- strong AWS expertise
- Kubernetes, networking, DNS, and Infrastructure as Code (Pulumi preferred, Terraform a plus)
- strong software engineering fundamentals with emphasis on code quality and maintainability
- solid Python proficiency and deep knowledge of the Python ecosystem (testing, debugging, packaging)
- hands-on experience with PySpark
- Familiarity with MLOps practices such as model registries, model versioning, retraining workflows, and end-to-end deployment lifecycles
- stakeholder engagement and mentoring e.g. lead incident response and RCAs
- improve system reliability
- engage stakeholders to propose solutions, share learnings, and mentor others
Benefits
- private healthcare
- gym discounts
- wellbeing programs
- mental health support
- learning resources
- mentorship
- tailored growth plan
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
SRE practicesAWSKubernetesInfrastructure as CodePulumiTerraformPythonPySparkMLOpssystem reliability
Soft skills
stakeholder engagementmentoringincident responsecollaborationcommunicationleadershipproblem-solvinganalysisrecommendationcapacity planning