Salary
💰 $148,540 - $245,050 per year
Tech Stack
AWSAzureCloudGoKubernetesLinuxOpenShiftOpen SourcePythonPyTorchTensorflow
About the role
- Lead the team strategy and implementation for Kubernetes-native components in Model Serving, including Custom Resources, Controllers, and Operators.
- Be an influencer and leader in MLOps-related open source communities to help build an active MLOps open source ecosystem for Open Data Hub and OpenShift AI
- Act as an MLOps SME within Red Hat by supporting customer-facing discussions, presenting at technical conferences, and evangelizing OpenShift AI within the internal community of practices
- Architect and design new features for open-source MLOps communities such as KubeFlow and KServe
- Provide technical vision and leadership on critical and high-impact projects
- Mentor, influence, and coach a team of distributed engineers
- Ensure non-functional requirements including security, resiliency, and maintainability are met
- Write unit and integration tests and work with quality engineers to ensure product quality
- Use CI/CD best practices to deliver solutions as productization efforts into RHOAI
- Contribute to a culture of continuous improvement by sharing recommendations and technical knowledge with team members
- Collaborate with product management, other engineering, and cross-functional teams to analyze and clarify business requirements
- Communicate effectively to stakeholders and team members to ensure proper visibility of development efforts
- Give thoughtful and prompt code reviews
- Represent RHOAI in external engagements including industry events, customer meetings, and open-source communities
- Proactively utilize AI-assisted development tools (e.g., GitHub Copilot, Cursor, Claude Code) for code generation, auto-completion, and intelligent suggestions to accelerate development cycles and enhance code quality.
- Explore and experiment with emerging AI technologies relevant to software development, proactively identifying opportunities to incorporate new AI capabilities into existing workflows and tooling.
Requirements
- Proven expertise with Kubernetes API development and testing (CRs, Operators, Controllers), including reconciliation logic.
- Strong background with model serving (like KServe, vLLM) and distributed inference strategies for LLMs (tensor, pipeline, data parallelism).
- Deep understanding of GPU optimization, autoscaling (KEDA/Knative), and low-latency networking (e.g., NVLink, P2P GPU).
- Experience architecting resilient, secure, and observable systems for model serving, including metrics and tracing.
- Advanced skills in Go and Python; ability to design APIs for high-performance inference and streaming.
- Excellent system troubleshooting skills in cloud environments and the ability to innovate in fast-paced environments.
- Strong communication and leadership skills to mentor teams and represent projects in open-source communities.
- Autonomous work ethic and passion for staying at the forefront of AI and open source.
- The following will be considered a plus: An existing contributor in one or more MLOps open source projects such as KubeFlow, KServe, RayServe, and vLLM is a huge plus
- Familiarity with optimization techniques for LLMs (quantization, TensorRT, Hugging Face Accelerate).
- Knowledge of end-to-end MLOps workflows, including model registry, explainability, and drift detection.
- Bachelor's degree in statistics, mathematics, computer science, operations research, or a related quantitative field, or equivalent expertise; Master’s or PhD is a big plus
- Understanding of how Open Source and Free Software communities work
- Experience with development for public cloud services (AWS, GCE, Azure)
- Experience in engineering, consulting or another field related to model serving and monitoring, model registry, explainable AI, deep neural networks, in a customer environment or supporting a data science team
- Highly experienced in OpenShift
- Familiarity with popular Python machine learning libraries such as PyTorch, Tensorflow, and Hugging Face