Tech Stack
AWSGrafanaKubernetesPrometheusSplunk
About the role
- Hunt down and decommission all high-cardinality custom metrics
- Audit the log ingestion for every service
- Analyze our APM and trace ingestion
- Use automation to enforce cost-saving policies
- Create internal documentation and run "Datadog Dojo" workshops
Requirements
- 3+ years as an Infrastructure, DevOps, or Site Reliability Engineer
- Expert-level knowledge of Datadog's pricing model and platform architecture
- Deep proficiency with AWS and Kubernetes
- Strong programming skills for infrastructure automation
- The courage to tell a founder or principal engineer that their favorite metric is financially irresponsible.
- Experience with other monitoring/observability tools (Prometheus, Grafana, Honeycomb, Splunk)
- Experience implementing OpenTelemetry standards and agents
- competitive total rewards package
- market-benched salary & equity
- comprehensive health benefits
- flexible paid time off
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
AWSKubernetesinfrastructure automationOpenTelemetryDatadogPrometheusGrafanaHoneycombSplunk
Soft skills
communicationcourageanalytical skills