Senior Machine Learning Engineer – Evaluations, Design Generation

Canva

full-time

Posted on: 12/9/2025

Location Type: Remote

Location: Remote • 🇦🇺 Australia

Visit company website

✨ AI Apply

Apply

Job Level

Senior

Tech Stack

Distributed SystemsPythonSQL

About the role

Understand and optimize existing evaluation systems including LLM-as-Judge frameworks, visual quality models , and multi-dimensional scoring approaches - analyzing their strengths, limitations, and trade-offs to identify gaps in coverage and opportunities for improvement across brand adherence, visual appeal, layout quality, and functional correctness
Design and implement automated evaluation pipelines that score generated designs at scale, balancing accuracy with computational cost
Define evaluation strategies for different scenarios: pre-deployment validation, continuous monitoring, A/B experiment analysis, and model comparison
Curate high-quality evaluation datasets and benchmark suites that represent diverse use cases, edge cases, and quality dimensions
Integrate evaluation systems into continuous deployment pipelines, creating automated quality gates that catch regressions before production
Reduce evaluation cycle time to enable teams to iterate faster on model improvements and launch experiments earlier
Partner with research teams to understand evaluation needs for new model architectures and capabilities
Define the evaluation ecosystem strategy: how different evaluation tools and methods compose together for Design Generation
Guide teams on evaluation best practices, appropriate methodologies for their use cases, and interpretation of results

Requirements

Strong ML engineering fundamentals with experience building and maintaining production ML systems at scale
Proven ability to build robust, scalable infrastructure (not just models) - you're a platform engineer who speaks ML
Deep understanding of distributed systems, observability patterns, and monitoring best practices
Python proficiency with production-quality coding standards, code reviews, and testing practices
Experience with data pipelines, time-series data, and statistical analysis for detecting anomalies
SQL fluency for querying and analyzing large datasets across data warehouse and analytics systems
Track record of building self-service platforms or developer tooling that gets adoption
Excellent collaboration skills - this role requires working across teams to understand needs and deliver solutions
Experience with evaluation of Gen AI systems at scale (even better if that’s evaluation of systems with creative outputs!)

Benefits

Equity packages - we want our success to be yours too
Inclusive parental leave policy that supports all parents & carers
An annual Vibe & Thrive allowance to support your wellbeing, social connection, office setup & more
Flexible leave options that empower you to be a force for good, take time to recharge and supports you personally

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills

ML engineeringproduction ML systemsdistributed systemsobservability patternsmonitoring best practicesPythondata pipelinestime-series datastatistical analysisSQL

Soft skills

collaborationcommunicationproblem-solvingteamworkadaptabilityguidanceinterpretation of resultsbest practices