Tech Stack
AWSAzureCloudDistributed SystemsGoGoogle Cloud PlatformGrafanaKubernetesPrometheusPythonRust
About the role
- Design, develop, and maintain advanced test automation frameworks that incorporate chaos engineering principles
- Create and execute chaos experiments that simulate various failure modes and edge cases in our distributed systems
- Implement monitoring solutions that effectively track system performance, resilience, and failure recovery
- Establish observability practices that provide deep insights into system behavior during chaos experiments
- Collaborate with development teams to build resilience into our applications from the ground up
- Develop metrics and dashboards to visualize system reliability and the impact of chaos experiments
- Lead post-mortem analyses to identify system weaknesses discovered through chaos testing
- Integrate chaos testing into CI/CD pipelines to validate system resilience continuously
- Mentor engineers through code reviews, technical sessions, and hands-on guidance in test automation, chaos engineering, and monitoring best practices
- Contribute to the company's overall testing strategy and quality assurance practices
Requirements
- Bachelor's degree in Computer Science, Engineering, or related field
- 5+ years of experience in software testing and quality assurance, with at least 2 years focused on chaos engineering
- Strong programming skills in languages such as Python, Go, and/or Rust
- Experience with chaos engineering tools such as Chaos Monkey, Gremlin, or similar frameworks
- In-depth knowledge of monitoring systems like Prometheus, Grafana, ELK Stack, or similar tools
- Experience implementing observability practices (metrics, logging, tracing) in distributed systems
- Familiarity with container orchestration platforms like Kubernetes and related chaos tools
- Experience with SRE practices and principles
- Strong understanding of CI/CD pipelines and how to integrate testing workflows
- Experience with cloud platforms (AWS, GCP, Azure) and their monitoring capabilities
- Excellent communication skills with the ability to present technical findings to various stakeholders
- Master’s degree in Computer Science, Engineering, or related field (preferred)
- Knowledge of statistical analysis for evaluating test results and system performance (preferred)
- Experience with distributed systems and microservice architectures (preferred)
- Contributions to open-source testing or chaos engineering projects (preferred)
- Familiarity with AI/ML systems and their unique testing challenges (preferred)
- Relevant certifications in cloud platforms, testing methodologies, or chaos engineering (preferred)