Tech Stack
AWSAzureCloudGoogle Cloud PlatformKubernetesPythonTerraformTypeScript
About the role
- Own strategy, design, and execution of platform integrations connecting Sully.ai’s AI medical employees with major EHRs, partner APIs, and healthcare data standards.
- Build and scale a high-performing platform team; collaborate closely with Product, Clinical, Security, and Customer Success.
- Own multi‑cloud provisioning, upgrading, and reliability of the core platform (AWS, GCP, Azure).
- Automate everything via infrastructure as code and GitOps (ArgoCD, Terraform/Pulumi).
- Ship fully automated deployments across regions and standardize developer experience (tooling, CI/CD).
- Optimize stack scalability and cost‑awareness for AI workloads; enforce observability, SLOs/SLIs, incident management, and MTTR improvements.
- Ensure integrations are robust, secure, HIPAA compliant, and customer‑facing quality is high.
- First‑month: define and deliver a platform solution, audit platform/CI/CD/dev tooling, baseline lead time/deploy freq/MTTR/cost per token, and draft a Platform Playbook.
- 90‑day success: 100% infra changes via IaC+GitOps, automated multi‑region deployments across ≥2 clouds, SLOs enforced, MTTR ↓50%, and significant engineer efficiency and cost reductions.
Requirements
- 10+ years in Platform, DevOps, or SRE roles with production Kubernetes across multiple clouds (AWS, GCP, Azure), not single‑cloud only.
- Demonstrated experience designing multi‑cloud architecture and portability patterns.
- Proficiency with GitOps and Infrastructure as Code, including ArgoCD and either Terraform or Pulumi.
- Strong coding skills in Python or TypeScript for platform tooling and automation.
- Experience running CI and CD at scale; owning observability (metrics, logs, tracing); practicing SLOs and SLIs; managing incidents; improving MTTR.
- Experience operating in regulated environments (HIPAA) with strong security, networking, and IAM fundamentals.
- Nice-to-Have: AI/LLM production experience; OpenAI-compatible APIs.
- Nice-to-Have: Cost optimization for AI/cloud.
- Nice-to-Have: Security depth (policy‑as‑code, supply chain).