Salary
💰 $99,601 - $149,401 per year
Tech Stack
AnsibleAWSAzureCassandraCloudDockerGoGoogle Cloud PlatformGrafanaHadoopHDFSJavaKafkaKubernetesMySQLNoSQLPostgresPrometheusPythonScalaSparkTerraform
About the role
- Design and implement monitoring and alerting systems to ensure the stability, reliability, and performance of data platforms.
- Join on-call shift to quickly respond to and resolve issues.
- Develop and maintain automation tools and scripts for deployment, monitoring, backup and disaster recovery.
- Analyze and optimize the performance of data storage, query performance, and data flows to ensure efficient processing of large-scale datasets, reduce latency, and improve processing speed.
- Respond quickly to platform failures, perform troubleshooting, and coordinate cross-team efforts to resolve issues and ensure high availability and reliability.
- Work with engineering teams to analyze and forecast capacity requirements and scale infrastructure accordingly.
- Support Freewheel powered Live events.
- Document the architecture, configurations, and operational procedures for platforms and provide relevant training.
- Ensure platforms meet security standards and compliance requirements.
- Collaborate with engineering, product, and project management teams to support product design and implementation and solve reliability-related issues.
Requirements
- At least 3 years of experience as an SRE, DevOps or Operations Engineer.
- Relevant Work Experience 5-7 Years.
- Experience with cloud platforms (e.g. AWS, OCI, GCP, Azure).
- Hands-on experience with Terraform and infrastructure as code principle.
- Proficiency in automation tools and frameworks (e.g. Ansible, Terraform, Kubernetes, Docker).
- Familiarity with modern data architectures and technologies, including big data platforms (e.g. Kafka, Hadoop, Spark) and distributed storage (e.g. Cassandra, HDFS, AWS S3).
- Extensive experience in data base management (e.g. NoSQL databases, MySQL, PostgreSQL).
- Proficient in at least one programming language such as Python, Go, Java, or Scala.
- Familiar with monitoring and log management tools such as Prometheus, Grafana, ELK Stack.
- Strong debugging and troubleshooting skills with ability to quickly identify and resolve production issues.
- Excellent communication skills; ability to convey technical information clearly to technical and non-technical stakeholders.
- Proactive learner eager to grow in operations and governance.
- Bachelor’s degree or higher in Computer Science, Software Engineering, or a related field.
- Willingness to join on-call shifts and support FreeWheel powered Live events.