Salary
💰 $152,500 - $224,200 per year
Tech Stack
AirflowApacheAWSCloudDistributed SystemsEC2GoGrafanaJavaKafkaKubernetesPrometheusPythonTerraform
About the role
- Design, build, and maintain infrastructure and scalable frameworks to support data ingestion, processing, and analysis.
- Collaborate with stakeholders, analysts, and product teams to understand business requirements and translate them into technical solutions.
- Architect and implement data streaming solutions using modern data technologies such as Kafka, AWS MSK, Terraform, Hive, Hudi, Presto, Airflow, and cloud-based services like AWS EKS, Lakeformation, Glue and Athena.
- Design and implement frameworks and solutions for performance, reliability, and cost-efficiency.
- Ensure data quality, integrity, and security throughout the data lifecycle.
- Stay current with emerging technologies and best practices in big data technologies.
- Mentor early in career engineers and contribute to a culture of continuous learning and improvement.
Requirements
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
- 8+ years of experience in Site Reliability Engineering, DevOps, or Software Engineering roles with a focus on infrastructure or backend systems.
- Strong production experience, including operational management, scaling, partitioning strategies, and tuning for performance and reliability.
- Hands-on experience with Kubernetes (preferably EKS), including deploying and managing stateful services and operators in Kubernetes environments.
- Deep understanding of AWS cloud services, particularly those relevant to data infrastructure (e.g., EC2, EBS, S3, IAM, MSK, CloudWatch, VPC, ALB/NLB).
- Proficiency in infrastructure-as-code tools, such as Terraform or CloudFormation, for managing and automating infrastructure.
- Expertise in observability tools (e.g., Prometheus, Grafana, OpenTelemetry, Datadog) to monitor distributed systems and set up alerting for reliability and latency.
- Proficient in at least one programming language (e.g., Go, Python, Java, or similar) for building automation, tooling, and contributing to platform services.
- Experience designing and implementing incident response processes, SLOs/SLIs, runbooks, and participating in on-call rotations.
- Strong understanding of distributed systems principles, including consensus, durability, throughput, and availability tradeoffs.
- Proven track record of driving reliability improvements in high-scale, data-intensive systems and collaborating with platform and data engineering teams.
- Excellent problem-solving and analytical skills.
- Strong verbal & written communication skills, with the ability to work effectively in a cross-functional team environment.