Salary
💰 $152,500 - $262,350 per year
Tech Stack
AndroidDistributed SystemsGoGradleiOSJenkinsKotlinPythonSwift
About the role
- Acts as a project or system leader, coordinating the activities of other engineers on the project or within the system
- Determines the technical tasks that other engineers will follow
- Proactively improves existing structures & processes and exercises judgement in reconciling diverse priorities
- Define mobile-specific SLIs and SLOs (e.g., crash-free sessions, ANRs, app startup time, network success rates, battery/memory usage)
- Establish best practices for observability, alerting, and incident response in Datadog
- Lead development of automation and tools for mobile reliability (automated regression detection, performance benchmarking, crash/ANR triage, release health dashboards, instrumentation libraries)
- Ensure tooling aligns with existing systems (Harness for CI/CD, Gradle/Bazel for builds)
- Act as primary liaison with backend/web SRE leadership for incident response and shared visibility
- Partner with Release Engineering, QA, and Product to ensure operational readiness of new features
- Influence architecture and design decisions to prioritize mobile reliability
- Lead cultural change: define and roll out on-call model for mobile teams and champion a blameless postmortem culture
- Mentor and guide a distributed team of senior Mobile SREs and provide technical leadership in complex incidents
- Help recruit and onboard new SREs and set technical and cultural standards
- Partner with infrastructure and developer productivity teams to integrate Bazel and Gradle builds into reliable CI/CD pipeline and establish long-term roadmaps for mobile reliability
Requirements
- Minimum of 8 years of relevant work experience
- Bachelor's degree or equivalent experience
- 8+ years of experience in software engineering, SRE, or mobile systems roles
- Strong understanding of iOS and/or Android performance and reliability challenges
- Hands-on experience with Datadog (or equivalent observability platforms) for monitoring, alerting, and dashboards
- Proven ability to define and implement SLIs/SLOs across complex, distributed systems
- Experience leading on-call rotations, incident response, and postmortems
- Demonstrated experience building automation and internal tools for reliability
- Strong programming skills in Python, Go, or similar
- Working knowledge of Swift/Kotlin for client instrumentation
- Exceptional ability to influence and partner across engineering, product, and SRE orgs
- Track record of mentoring engineers and leading distributed teams
- Preferred: Experience with CI/CD for mobile (Harness, Fastlane, Jenkins)
- Preferred: Familiarity with Bazel and Gradle build systems
- Preferred: Prior experience introducing cultural changes (e.g., adopting on-call or reliability practices)
- Strong knowledge of backend service reliability concepts, to bridge between client and server