Tech Stack
AirflowAnsibleAWSCloudDockerFluxGoGoogle Cloud PlatformGrafanaJavaJenkinsKubernetesMongoDBMySQLNGINXPrometheusPythonRedisRustSaltStackSparkTerraform
About the role
- Build and maintain internal Platform-as-a-Service running on Kubernetes
- Contribute to maintenance and evolution of CD platform based on FluxCD
- Ensure platform reliability and performance through testing, automation and monitoring
- Participate in on-call program, incident management and production troubleshooting
- Ensure optimal use of Kubernetes clusters in AWS (performance and FinOps/cost effectiveness)
- Ensure high level of platform security
- Maintain and improve CD platform and delivery workflows
- Share best practices and collaborate with developers to understand requirements
- Provide guidance on MLOPS and Data architectures (GCP, Airflow, GPUs)
- Maintain detailed documentation of engineering processes, infrastructures, and troubleshooting procedures
- Stay up-to-date with industry trends and drive continuous improvement
Requirements
- 5+ years of experience in DevOps
- Proficiency in containerization and orchestration (Docker, Kubernetes, Helm)
- Proficiency with Infrastructure as Code (Terraform)
- Strong Expertise with Cloud Platforms (AWS,GCP)
- Strong Expertise in CD pipelines (Flux)
- Strong Expertise in monitoring and observability (Prometheus, Grafana, Datadog, Looker) and FinOps best practices
- Experience with Data lifecycle Management (storage policies, costs optimization, security, encryption)
- Experience in Data pipelines for ML models (Airflow, DataFlow, Kestra)
- Experience in ML infrastructure optimization (GPUs, TPUs, Inferentia)
- Experience in CI pipelines (Tekton, Jenkins, Jx3)
- Strong programming skills in Python and experience in Bash scripting
- Knowledge of Go or Rust is a plus
- Strong collaboration and communication skills
- Ability to drive innovation and advocate for best DevOps practices
- Excellent English communication skills