Tech Stack
AWSAzureCloudGoGoogle Cloud PlatformGrafanaKubernetesPrometheusPythonSplunk
About the role
- Lead the design, implementation, and continuous improvement of the observability stack, including monitoring, logging, and tracing systems.
- Define and enforce observability standards and best practices across engineering teams to ensure consistent instrumentation and visibility.
- Build scalable monitoring solutions that provide real-time insights into system health, performance, and availability.
- Develop and maintain dashboards, alerts, and automated responses to proactively detect and resolve issues before they impact users.
- Collaborate with development, infrastructure, and SRE teams to integrate observability into CI/CD pipelines and production workflows.
- Conduct root cause analysis and post-incident reviews to identify observability gaps and drive improvements.
- Evaluate and implement tools such as Splunk, Splunk Observability Cloud, Netreo to support monitoring and alerting needs.
- Champion a culture of data-driven decision-making by enabling teams to access and interpret observability data effectively.
- Automating observability pipelines and alerting mechanisms.
Requirements
- 5+ years of experience in Site Reliability Engineering, DevOps, or Observability roles.
- 3+ years of experience in SRE/DevOps.
- Demonstrated success in deploying and managing monitoring tools and observability solutions at scale.
- Hands-on experience with monitoring and observability platforms such as Splunk, Splunk Observability Cloud (O11y), Grafana, Prometheus, Datadog.
- Proven ability to design and implement SLOs/SLIs, dashboards, and alerting strategies that align with business and operational goals.
- Familiarity with incident response, alert tuning, and postmortem analysis.
- Strong scripting or programming skills (e.g., Python, Go, Bash).
- Excellent communication and collaboration skills, with a focus on knowledge sharing and mentorship.
- Strong understanding of distributed tracing tools like OpenTelemetry, Jaeger, or Zipkin.
- Experience integrating observability into CI/CD pipelines and Kubernetes environments.
- Contributions to open-source observability tools or frameworks.
- Strong knowledge of cloud platforms (AWS, Azure, or GCP).