
Senior Software Reliability Engineer – AI
MixMode
full-time
Posted on:
Location Type: Remote
Location: California • United States
Visit company websiteExplore more
Job Level
About the role
- Own the reliability, performance, and operational health of production AI systems, focusing on improving complex, existing services.
- Lead efforts to refactor and harden the AI codebase to improve observability, maintainability, and resilience.
- Diagnose and resolve issues across distributed systems, including latency, throughput, data pipelines, and resource utilization.
- Design and build monitoring, alerting, and debugging tools for high-availability services.
- Partner with researchers and ML engineers to productionize models at scale.
- Establish best practices for testing, deployment, capacity planning, and incident response.
- Serve as a technical leader during on-call rotations, driving incident response, postmortems, and continuous system improvements.
Requirements
- 7+ years of professional software engineering experience
- Strong proficiency in Python and at least one JVM language (Java, Scala, or Kotlin preferred)
- Proven experience designing, building, and operating distributed systems in production
- Strong understanding of service architecture, concurrency, resource management, and distributed failure modes
- Prior experience with streaming data pipelines (e.g. Spark streaming, Flink, Kafka)
- Hands-on experience running production services on Kubernetes, including pod lifecycle management and fault tolerance.
- Strong experience with relational databases (e.g., PostgreSQL, MySQL), including query performance analysis, indexing, and connection management
- Demonstrated ability to diagnose and resolve performance, scalability, and reliability issues across application, database, and infrastructure layers
- Experience implementing automated testing (unit, integration, end-to-end) and production observability (logging, metrics, tracing)
- Experience collaborating with ML or data science teams to productionize predictive systems. (Note: ML expertise is not required.)
- Ability to improve system architecture and engineering practices over time through design, code review, and mentorship
Benefits
- Remote-First Work Culture
- Healthcare (Medical, Dental, Vision, Accident)
- Basic & Voluntary Life and AD&D
- Flexible Spending Account (FSA)
- 401(k) with Employer Match
- Paid Holidays & Flexible Paid Time Off (PTO)
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
PythonJavaScalaKotlindistributed systemsstreaming data pipelinesKubernetesPostgreSQLMySQLautomated testing
Soft skills
technical leadershipcollaborationmentorshipproblem-solvingincident responsecontinuous improvementcommunicationcapacity planningobservabilityresilience