Prima Power

Senior Machine Learning Site Reliability Engineer

Prima Power

full-time

Posted on:

Location Type: Remote

Location: Italy

Visit company website

Explore more

AI Apply
Apply

Job Level

About the role

  • Design, build, and operate reliable and scalable systems by defining and monitoring SLOs/SLIs
  • work directly on production infrastructure
  • collaborate closely with software engineers on system design and reliability improvements
  • actively develop automation for infrastructure and operational workflows to eliminate toil and reduce MTTR
  • participate in and lead incident response
  • drive blameless post-incident reviews with concrete follow-ups implemented in code and tooling
  • continuously analyze and optimize system performance and cost
  • provide data, insights, and recommendations to inform capacity planning
  • support security best practices through hands-on vulnerability remediation and threat mitigation

Requirements

  • Hands-on experience with SRE practices in production
  • strong AWS expertise
  • Kubernetes, networking, DNS, and Infrastructure as Code (Pulumi preferred, Terraform a plus)
  • strong software engineering fundamentals with emphasis on code quality and maintainability
  • solid Python proficiency and deep knowledge of the Python ecosystem (testing, debugging, packaging)
  • hands-on experience with PySpark
  • Familiarity with MLOps practices such as model registries, model versioning, retraining workflows, and end-to-end deployment lifecycles
  • stakeholder engagement and mentoring e.g. lead incident response and RCAs
  • improve system reliability
  • engage stakeholders to propose solutions, share learnings, and mentor others
Benefits
  • private healthcare
  • gym discounts
  • wellbeing programs
  • mental health support
  • learning resources
  • mentorship
  • tailored growth plan

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
SRE practicesAWSKubernetesInfrastructure as CodePulumiTerraformPythonPySparkMLOpssystem reliability
Soft skills
stakeholder engagementmentoringincident responsecollaborationcommunicationleadershipproblem-solvinganalysisrecommendationcapacity planning