MixMode

Senior Software Reliability Engineer – AI

MixMode

full-time

Posted on:

Location Type: Remote

Location: CaliforniaUnited States

Visit company website

Explore more

AI Apply
Apply

Job Level

About the role

  • Own the reliability, performance, and operational health of production AI systems, focusing on improving complex, existing services.
  • Lead efforts to refactor and harden the AI codebase to improve observability, maintainability, and resilience.
  • Diagnose and resolve issues across distributed systems, including latency, throughput, data pipelines, and resource utilization.
  • Design and build monitoring, alerting, and debugging tools for high-availability services.
  • Partner with researchers and ML engineers to productionize models at scale.
  • Establish best practices for testing, deployment, capacity planning, and incident response.
  • Serve as a technical leader during on-call rotations, driving incident response, postmortems, and continuous system improvements.

Requirements

  • 7+ years of professional software engineering experience
  • Strong proficiency in Python and at least one JVM language (Java, Scala, or Kotlin preferred)
  • Proven experience designing, building, and operating distributed systems in production
  • Strong understanding of service architecture, concurrency, resource management, and distributed failure modes
  • Prior experience with streaming data pipelines (e.g. Spark streaming, Flink, Kafka)
  • Hands-on experience running production services on Kubernetes, including pod lifecycle management and fault tolerance.
  • Strong experience with relational databases (e.g., PostgreSQL, MySQL), including query performance analysis, indexing, and connection management
  • Demonstrated ability to diagnose and resolve performance, scalability, and reliability issues across application, database, and infrastructure layers
  • Experience implementing automated testing (unit, integration, end-to-end) and production observability (logging, metrics, tracing)
  • Experience collaborating with ML or data science teams to productionize predictive systems. (Note: ML expertise is not required.)
  • Ability to improve system architecture and engineering practices over time through design, code review, and mentorship
Benefits
  • Remote-First Work Culture
  • Healthcare (Medical, Dental, Vision, Accident)
  • Basic & Voluntary Life and AD&D
  • Flexible Spending Account (FSA)
  • 401(k) with Employer Match
  • Paid Holidays & Flexible Paid Time Off (PTO)

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
PythonJavaScalaKotlindistributed systemsstreaming data pipelinesKubernetesPostgreSQLMySQLautomated testing
Soft skills
technical leadershipcollaborationmentorshipproblem-solvingincident responsecontinuous improvementcommunicationcapacity planningobservabilityresilience