
Member of Technical Staff, Training Engineer – Large Scale Foundation Models
FirstPrinciples Holding Company
full-time
Posted on:
Location Type: Remote
Location: Remote • 🇨🇦 Canada
Visit company websiteJob Level
Lead
Tech Stack
Node.jsPyTorch
About the role
- Develop and lead end-to-end pre-training of large language models on GPU clusters.
- Combine deep engineering expertise with research intuition.
- Build data pipelines and perform distributed training at scale.
- Make informed decisions about microbatch and global batch configurations.
- Provide strategic insights to the executive team on financial implications.
- Design capital allocation frameworks for sustainability.
- Operate distributed training infrastructure using modern techniques.
- Write production-grade PyTorch and Triton/CUDA kernels when required.
- Lead cross-functional efforts and mentor engineers.
Requirements
- Bachelor's or Master's degree in Computer Science, Engineering, or related field.
- 7-12+ years of total experience, including 2+ years training large Transformers at scale.
- Hands-on experience with at least one frontier-style training run.
- Expert-level proficiency in PyTorch (including compiled mode/torch.compile).
- Deep facility with distributed frameworks (PyTorch FSDP or DeepSpeed ZeRO).
- Proven success operating multi-node GPU jobs.
- Demonstrated impact from data quality work.
- Strong applied mathematics background.
Benefits
- Health insurance
- Innovative research environment
- Collaboration with top experts
- Opportunity to work on groundbreaking technology
- Flexible remote work
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
large language modelsGPU clustersdata pipelinesdistributed trainingmicrobatch configurationsglobal batch configurationsPyTorchTritonCUDAapplied mathematics
Soft skills
leadershipmentoringstrategic insightscross-functional collaboration
Certifications
Bachelor's degreeMaster's degree