Tech Stack
AWSCloudDockerGoGrafanaJenkinsKubernetesPrometheusPythonTerraform
About the role
- Design, implement, and maintain secure, scalable cloud infrastructure for high-throughput, low-latency applications (mainly AWS, with flexibility for multi-cloud environments)
- Develop and enhance CI/CD pipelines for efficient, reliable, and consistent deployment processes (GitHub Actions and similar tools)
- Continuously identify and implement infrastructure improvements across the entire product ecosystem
- Architect and support serverless solutions using AWS Lambda, ECS Fargate, and event-driven architecture components
- Establish comprehensive monitoring, alerting, and logging frameworks to ensure system reliability and visibility
- Manage infrastructure-as-code implementations using CloudFormation, Terraform, and related tools
- Partner with development teams to determine infrastructure requirements, deployment approaches, and operational toolsets
- Oversee secrets management and identity systems utilizing AWS IAM and similar platforms
- Maintain compliance with security and privacy standards, including access controls, encryption protocols, and audit mechanisms
- Resolve production incidents across all service layers with a focus on rapid response and thorough post-incident analysis
- Create and maintain automated backup, disaster recovery, and failover systems
- Research and adopt emerging DevOps methodologies and technologies to enhance platform performance and reliability
Requirements
- 10+ years of software engineering background with 6+ years focused on DevOps, Site Reliability Engineering, or Infrastructure Engineering
- Demonstrated experience managing production environments with comprehensive infrastructure responsibilities
- Extensive AWS expertise, including compute, storage, networking, and identity management services
- Practical experience with serverless technologies (AWS Lambda, Step Functions, EventBridge, API Gateway, ECS Fargate)
- Advanced skills in Docker and container orchestration platforms (Kubernetes, ECS, GKE)
- Proficiency with CI/CD platforms such as GitHub Actions, CircleCI, ArgoCD, or Jenkins
- Strong scripting capabilities in Bash, Python, or Go for automation and tooling development
- Experience with observability solutions (Datadog, Prometheus, Grafana, ELK stack)
- Solid understanding of network design, security frameworks, and zero-trust access architectures
- Knowledge of secrets management systems and infrastructure-level access policy enforcement
- Exceptional troubleshooting and root cause analysis abilities
- Strong collaborative and communication skills across diverse technical and business teams
- Continuous improvement mindset with focus on automation, optimization, and security enhancement