Tech Stack
AWSAzureCloudGoogle Cloud PlatformPyTorchTensorflow
About the role
- Architect, train, and optimize large-scale speech AI models, including speech-to-speech, speech restoration, and speech translation.
- Leverage self-supervised learning, contrastive learning, and transformer-based architectures (e.g., wav2vec, Whisper, GPT-style models) to improve model accuracy and adaptability.
- Develop efficient model distillation and quantization strategies to deploy large models with low-latency inference.
- Innovate on cross-lingual and multilingual speech processing using large-scale pretraining and fine-tuning.
- Curate and scale massive diverse, multilingual, and multimodal datasets for robust model training.
- Apply active learning, domain adaptation, and synthetic data generation to overcome data limitations.
- Lead efforts in data quality assessment, augmentation, and curation for large-scale training pipelines.
- Develop distributed training strategies for large-scale models using cloud-based and on-prem GPU clusters.
- Design and implement scalable model evaluation frameworks, tracking WER, MOS, and latency across diverse scenarios.
- Optimize real-time inference pipelines to ensure high-throughput, low-latency speech processing.
- Collaborate with academia, open-source communities, and research partners to drive innovation.
- Work closely with MLOps, Data Engineering, and Product teams to deploy scalable AI systems.
- Ensure seamless integration of foundational models with edge devices, real-time applications, and cloud platforms.
- Translate cutting-edge research into production-grade models that power real-world communication.
Requirements
- Bachelor’s, Master’s or Ph.D. in Computer Science, Electrical Engineering, or a related field with a focus on Machine Learning, Deep Learning, or Speech Processing.
- 5+ years of hands-on industry experience in developing and implementing the following systems: Speech-to-text (ASR), Text-to-speech (TTS), Voice conversion & speech enhancement, Speech translation & multimodal learning.
- Strong proficiency in transformer-based architectures (e.g., wav2vec 2.0, Whisper, GPT, BERT).
- Expertise in deep learning frameworks such as PyTorch, TensorFlow, and large-scale training techniques.
- Experience with distributed training and optimization across multi-GPU clusters.
- Strong understanding of self-supervised learning, contrastive learning, and generative modeling for speech AI.
- Hands-on experience with cloud-based AI platforms (AWS, GCP, Azure) and model deployment.
- Experience curating and scaling massive diverse, multilingual, and multimodal datasets for robust model training.
- Experience with active learning, domain adaptation, and synthetic data generation.
- Experience with data quality assessment, augmentation, and curation for large-scale training pipelines.
- Experience developing distributed training strategies for large-scale models using cloud-based and on-prem GPU clusters.
- Experience designing and implementing scalable model evaluation frameworks and tracking WER, MOS, and latency.
- Experience optimizing real-time inference pipelines for high-throughput, low-latency speech processing.
- Preferred: Experience in developing multimodal AI models integrating speech, text, and vision.
- Preferred: Track record of publishing in top-tier AI/ML conferences.
- Preferred: Experience optimizing large models for real-time inference on edge devices.
- Preferred: Proficiency with MLOps best practices for deploying and monitoring models in production.
- Preferred: Familiarity with open-source ASR/TTS toolkits.