
Lead Observability Engineer
Vantage Data Centers
full-time
Posted on:
Location Type: Remote
Location: Colorado • United States
Visit company websiteExplore more
Salary
💰 $135,000 - $145,000 per year
Job Level
Tech Stack
About the role
- Design and operate a scalable observability platform with a primary focus on the Elastic Stack (Elasticsearch, Logstash, Kibana)
- Build and maintain log ingestion and enrichment pipelines (Logstash) including parsing, normalization, and routing standards
- Create, curate, and govern Kibana assets (dashboards, visualizations, Lens, Discover views) that support operations and engineering use cases
- Define and implement metrics and alerting standards (SLIs/SLOs, thresholds, burn-rate alerts) to improve detection and reduce MTTR
- Develop observability metrics for Operational Technology (OT) environments (e.g., BMS/EPMS/SCADA and other OT telemetry) to track availability, performance, alarms, and operational KPIs
- Partner across teams to instrument services and infrastructure, troubleshoot incidents using telemetry, and continuously improve reliability, performance, and cost
- Engineer and operate Elasticsearch clusters (or Elastic Cloud) including sizing, scaling, sharding/ILM, retention, backup/restore, and performance tuning
- Develop and maintain Logstash pipelines (inputs/filters/outputs) to ingest logs/metrics from servers, network devices, virtualization platforms, containers, and cloud services
- Create telemetry standards: field naming conventions, ECS alignment where appropriate, parsing/grok patterns, enrichment lookups, and data quality checks
- Build and maintain Kibana dashboards, visualizations, and alerting rules; publish curated views for NOC/operations and engineering teams
- Create and operationalize metrics: define SLIs/SLOs, implement metric collection/export, and ensure actionable alerting with runbooks and escalation paths
- Partner with facilities/critical infrastructure teams to define OT-focused SLIs/SLOs and metrics (e.g., alarm rates, sensor health, control loop status, device/point availability), normalize and tag OT telemetry, and build dashboards/alerts that support 24x7 operations
- Automate configuration and deployment of observability components using infrastructure-as-code and configuration management (e.g., Terraform, Ansible) and CI/CD pipelines
- Implement security best practices for telemetry platforms including role-based access control, data handling/PII controls, encryption, and auditability
- Participate in incident response and post-incident reviews; use logs and metrics to identify root cause, document findings, and drive preventive improvements
Requirements
- Bachelor’s degree in Computer Science, Information Technology, Engineering, or equivalent practical experience
- 7+ years of experience in observability, SRE, platform, or DevOps engineering roles, with demonstrated ownership of production monitoring/logging systems
- Deep, hands-on expertise with Elastic Stack: Elasticsearch (cluster operations and tuning), Logstash (pipeline engineering), and Kibana (dashboards and alerting)
- Strong experience creating and operationalizing metrics (collection, aggregation, cardinality management, alerting strategy) and defining SLIs/SLOs
- Proficiency in at least one scripting/programming language (Python, Ruby, Go, or PowerShell) for automation, data parsing, and platform tooling
- Strong knowledge of log and event data modeling, parsing (grok/regex), enrichment, and schema management (e.g., ECS), including troubleshooting ingestion issues end-to-end
- Experience with query languages and analysis techniques (KQL/Lucene, Elasticsearch DSL), and ability to build actionable visualizations and detections from telemetry
- Hands-on experience with index lifecycle management (ILM), data streams, retention policies, and capacity planning for high-volume telemetry workloads
- Experience with infrastructure-as-code and automation (e.g., Terraform, Ansible) and CI/CD practices to deploy and manage observability components
- Solid understanding of Linux systems, networking fundamentals, and distributed systems concepts as they relate to telemetry, performance, and troubleshooting
- Experience integrating telemetry from hybrid environments (data center infrastructure, virtualization, containers/Kubernetes, and cloud services)
- Familiarity with Operational Technology (OT) / Industrial Control Systems (ICS) observability concepts and common BMS/EPMS applications such as Inductive Automation Ignition
- Working knowledge of complementary observability tooling (e.g., Beats/Elastic Agent, Prometheus, Grafana, OpenTelemetry) and how to integrate telemetry between systems to follow event management practices
- Experience operating services with on-call practices, incident management, and post-incident review processes; ability to write clear runbooks
- Strong understanding of reliability engineering concepts including observability design, alert fatigue reduction, and measuring user/system impact
- Experience working within ITIL or similar operational practices (change, incident, problem management)
- Familiarity with regulatory/compliance expectations and secure handling of operational data (e.g., GDPR, PCI, SOX) as applicable
- Excellent written and verbal communication skills; ability to translate telemetry needs into standards, dashboards, and alerts that teams adopt
- Ability to work independently and collaboratively across infrastructure, network, security, and application teams.
Benefits
- medical, dental, and vision coverage
- life and AD&D insurance
- short and long-term disability coverage
- paid time off
- employee assistance program
- participation in a 401k program that includes company match
- many additional voluntary benefits
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Elastic StackElasticsearchLogstashKibanaPythonRubyGoPowerShellinfrastructure-as-codeCI/CD
Soft Skills
communicationcollaborationincident managementproblem managementreliability engineeringownershiptroubleshootingdocumentationalert fatigue reductionindependence
Certifications
Bachelor’s degree in Computer ScienceBachelor’s degree in Information TechnologyBachelor’s degree in Engineering