Site Reliability Engineer 3

NBCUniversal

full-time

Posted on: 9/24/2025

Origin: • 🇺🇸 United States • Illinois, Virginia

✨ AI Apply

💰 $99,601 - $149,401 per year

Mid-LevelSenior

AnsibleAWSAzureCassandraCloudDockerGoGoogle Cloud PlatformGrafanaHadoopHDFSJavaKafkaKubernetesMySQLNoSQLPostgresPrometheusPythonScalaSparkTerraform

About the role

Design and implement monitoring and alerting systems to ensure the stability, reliability, and performance of data platforms.
Join on-call shift to quickly respond to and resolve issues.
Develop and maintain automation tools and scripts for deployment, monitoring, backup and disaster recovery.
Analyze and optimize the performance of data storage, query performance, and data flows to ensure efficient processing of large-scale datasets, reduce latency, and improve processing speed.
Respond quickly to platform failures, perform troubleshooting, and coordinate cross-team efforts to resolve issues and ensure high availability and reliability.
Work with engineering teams to analyze and forecast capacity requirements and scale infrastructure accordingly.
Support Freewheel powered Live events.
Document the architecture, configurations, and operational procedures for platforms and provide relevant training.
Ensure platforms meet security standards and compliance requirements.
Collaborate with engineering, product, and project management teams to support product design and implementation and solve reliability-related issues.

At least 3 years of experience as an SRE, DevOps or Operations Engineer.
Relevant Work Experience 5-7 Years.
Experience with cloud platforms (e.g. AWS, OCI, GCP, Azure).
Hands-on experience with Terraform and infrastructure as code principle.
Proficiency in automation tools and frameworks (e.g. Ansible, Terraform, Kubernetes, Docker).
Familiarity with modern data architectures and technologies, including big data platforms (e.g. Kafka, Hadoop, Spark) and distributed storage (e.g. Cassandra, HDFS, AWS S3).
Extensive experience in data base management (e.g. NoSQL databases, MySQL, PostgreSQL).
Proficient in at least one programming language such as Python, Go, Java, or Scala.
Familiar with monitoring and log management tools such as Prometheus, Grafana, ELK Stack.
Strong debugging and troubleshooting skills with ability to quickly identify and resolve production issues.
Excellent communication skills; ability to convey technical information clearly to technical and non-technical stakeholders.
Proactive learner eager to grow in operations and governance.
Bachelor’s degree or higher in Computer Science, Software Engineering, or a related field.
Willingness to join on-call shifts and support FreeWheel powered Live events.