FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Multimodal ML Engineer
White CircleMultimodal ML Engineer developing and fine-tuning multimodal models for AI safety platform. Train and deploy vision, audio, and speech models at White Circle.
Tech Stack
Tools & technologiesPyTorch
About the role
Key responsibilities & impact- Train and fine-tune large-scale multimodal models (vision-language, audio, speech) from scratch and from pretrained checkpoints
- Extend models across modalities: image understanding, video temporal modeling, long-context processing, and streaming audio
- Design and run experiments: architecture changes, data mixes, training recipes
- Build and maintain multimodal data pipelines — from raw images, video, and audio recordings to training-ready datasets, including synthetic data generation
- Train and optimize MoE architectures for efficient multimodal inference
- Build alignment pipelines: SFT, DPO, GRPO, reward modeling — across modalities, not just text
- Optimize models for production: quantization, distillation, batching, streaming and low-latency serving
- Deploy models end-to-end: from research checkpoint to production serving
- Define evaluation metrics and benchmarks that actually matter for the product: visual QA, spatial reasoning, video comprehension, speech and audio understanding
Requirements
What you’ll need- 3+ years training large-scale deep learning models in multimodal domains (vision-language, audio, speech, or acoustic)
- Strong PyTorch skills with hands-on distributed training experience (DeepSpeed, FSDP, or similar)
- Deep experience with multimodal architectures — you understand how vision/audio encoders, projectors, and LLMs fit together (LLaVA, Qwen-VL, InternVL, Audio Flamingo, Omni Qwen, Audio Qwen, Whisper, HuBERT, Conformer, or similar)
- Hands-on with RLHF/alignment for multimodal: GRPO, DPO, reward modeling — not just for text
- Experience with video and/or audio sequence modeling: temporal modeling, long-context processing, efficient attention, streaming inference
- Track record of shipping models to production: you've hit latency targets and optimized inference, not just reported benchmark scores
- Comfortable with large-scale multimodal dataset curation: image-text pairs, video-instruction data, audio preprocessing, augmentation, synthetic data generation
- Familiar with MoE architectures and their tradeoffs for multimodal workloads
- Strong engineering fundamentals: clean code, version control, testing, documentation
Benefits
Comp & perks- Paid time off in line with your local regulations, no matter where you work from
- Work from Paris (hybrid) with a relocation package available, or work from London (note: we are unable to provide relocation support for London-based roles)
- Comprehensive medical insurance for our France-based team (please note that we are in the process of setting up our UK office and therefore cannot offer medical insurance for London-based roles yet)
- All the hardware, tools, and services you need
- Covered subscriptions for AI agents and IDEs
- Team off-sites twice a year: we’ve recently been to the Alps and to Saint-Tropez
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Multimodal Model TrainingData Pipeline DevelopmentModel OptimizationQuantizationDistillationTemporal ModelingSynthetic Data GenerationMoE ArchitecturesEvaluation Metrics DefinitionClean Code Practices