Salary
💰 $139,840 - $192,280 per year
Tech Stack
ApacheAWSCassandraCloudDistributed SystemsGoGoogle Cloud PlatformGrafanaJenkinsKafkaKubernetesMicroservicesMongoDBPackerPrometheusPythonRedisSpinnakerSplunkTerraformZookeeper
About the role
- Lead a team of engineers for Splunk Cloud Observability in FedRAMP environments.
- Manage across the organization to deliver quality products.
- Mentor and grow engineering teams building cloud-based environment for massive-scale data processing.
- Partner with Talent Acquisition to recruit and hire SRE FedRAMP team members.
- Manage teams to exceed goals and drive success.
- Lead reliability projects: HA, BCP, disaster recovery, backup/restore, RTO, RPO, chaos engineering, uptime and performance.
- Capacity management & planning, SLIs, SLOs, error budgets, monitoring dashboards.
- Deploy and operate large-scale distributed data stores and streaming services.
- Establishing design patterns for monitoring and benchmarking.
- Document production run books and developer guidelines.
- Implement tooling, toil reduction, runbooks & automation for production.
- Incident management and improving MTTD/MTTR.
- Cloud cost optimization.
Requirements
- Must-Have:
- 8+ years of experience in handling large-scale cloud-native microservices platforms.
- 2+ years of strong hands-on management experience managing teams deploying, handling, and monitoring large-scale Kubernetes clusters in the public cloud specifically AWS or GCP
- Experience with and leading a team in infrastructure automation and scripting using Python and/or Golang.
- Experience managing remote teams.
- Strong hands-on experience in monitoring tools such as Splunk, Prometheus, Grafana, ELK stack, etc. in order to build observability for large-scale microservices deployments.
- Experience with deployment, operations, and performance management of one or more of the following large-scale clusters such as Cassandra, Kafka, Elastic Search, MongoDB, ZooKeeper, Redis, etc.
- Excellent problem-solving, triaging, and debugging skills in large-scale distributed systems
Preferred:
- Familiarity working with and/or managing in compliance environments such as HIPPA, GovCloud, State Government, Federal Government, SOC2 or FedRAMP
- AWS Solutions Architect certification preferred.
- Confluent Certified Administrator for Apache Kafka and/or Apache Cassandra Administrator Associate certifications are preferred
- Experience with Infrastructure-as-Code using Terraform, CloudFormation, Google Deployment Manager, Pulumi, Packer, ARM, etc.
- Experience with CI/CD frameworks and Pipeline-as-Code such as Jenkins, Spinnaker, Gitlab, Argo, Artifactory, etc.
- Proven skills to effectively work across teams and functions to influence the design, operations, and deployment of highly available software.
- Bachelors/Masters in Computer Science, Computer Engineering, or related technical field, or equivalent practical experience.