Tech Stack
AWSAzureCloudGoogle Cloud PlatformLinuxPySparkSparkTerraform
About the role
- Own and optimize the cloud systems that power our clinical trial innovation platform
- Develop, maintain, optimize and harden our single-tenant cloud infrastructure
- Implement a secure, high-performance network topology that connects frontend services, databases, and ML processing clusters
- Design and implement disaster recovery strategies, including backup automation, fail-over procedures, and restore drills
- Coordinate organization-wide SRE practices, including cross-component tracing, incident management, alerting, and reliability metrics
- Work closely with engineering teams to understand their infrastructure requirements and enable them to achieve continuous and stable deployments
- Administer key systems (AWS, Databricks, DBs, etc.) including access controls, security hardening, monitoring, and compliance management
- Establish and manage an infrastructure request ticketing system with self-service capabilities enabling engineers to request changes, provision resources, and receive guidance
Requirements
- At least 6 years of experience in each of the following: managing core services on a major cloud provider (AWS, GCP, Azure), cloud networking, IaC tools (Terraform, Cloud Formation, etc.), Relational DB Management, CI/CD pipelines
- At least 2 year of experience in each of the following: distributed architectures, infrastructure operations process, Linux and bash scripting, SRE and observability platforms (NewRelic, Data Dog, etc.)
- Excellent written and verbal communication skills
- Ability to work independently and as part of a team
- Experience with Databricks: metastore management, access control, asset bundles and the Databricks terraform provider, etc.
- Experience developing single-tenant solutions for large enterprise clients
- Working knowledge of PySpark, Spark, AWS ECR and the like
- Experience with container orchestration platforms (EKS, ECS, Podman, etc.)
- Experience with Vercel