Vantage Data Centers

Lead Observability Engineer

Vantage Data Centers

full-time

Posted on:

Location Type: Remote

Location: ColoradoUnited States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $135,000 - $145,000 per year

Job Level

About the role

  • Design and operate a scalable observability platform with a primary focus on the Elastic Stack (Elasticsearch, Logstash, Kibana)
  • Build and maintain log ingestion and enrichment pipelines (Logstash) including parsing, normalization, and routing standards
  • Create, curate, and govern Kibana assets (dashboards, visualizations, Lens, Discover views) that support operations and engineering use cases
  • Define and implement metrics and alerting standards (SLIs/SLOs, thresholds, burn-rate alerts) to improve detection and reduce MTTR
  • Develop observability metrics for Operational Technology (OT) environments (e.g., BMS/EPMS/SCADA and other OT telemetry) to track availability, performance, alarms, and operational KPIs
  • Partner across teams to instrument services and infrastructure, troubleshoot incidents using telemetry, and continuously improve reliability, performance, and cost
  • Engineer and operate Elasticsearch clusters (or Elastic Cloud) including sizing, scaling, sharding/ILM, retention, backup/restore, and performance tuning
  • Develop and maintain Logstash pipelines (inputs/filters/outputs) to ingest logs/metrics from servers, network devices, virtualization platforms, containers, and cloud services
  • Create telemetry standards: field naming conventions, ECS alignment where appropriate, parsing/grok patterns, enrichment lookups, and data quality checks
  • Build and maintain Kibana dashboards, visualizations, and alerting rules; publish curated views for NOC/operations and engineering teams
  • Create and operationalize metrics: define SLIs/SLOs, implement metric collection/export, and ensure actionable alerting with runbooks and escalation paths
  • Partner with facilities/critical infrastructure teams to define OT-focused SLIs/SLOs and metrics (e.g., alarm rates, sensor health, control loop status, device/point availability), normalize and tag OT telemetry, and build dashboards/alerts that support 24x7 operations
  • Automate configuration and deployment of observability components using infrastructure-as-code and configuration management (e.g., Terraform, Ansible) and CI/CD pipelines
  • Implement security best practices for telemetry platforms including role-based access control, data handling/PII controls, encryption, and auditability
  • Participate in incident response and post-incident reviews; use logs and metrics to identify root cause, document findings, and drive preventive improvements

Requirements

  • Bachelor’s degree in Computer Science, Information Technology, Engineering, or equivalent practical experience
  • 7+ years of experience in observability, SRE, platform, or DevOps engineering roles, with demonstrated ownership of production monitoring/logging systems
  • Deep, hands-on expertise with Elastic Stack: Elasticsearch (cluster operations and tuning), Logstash (pipeline engineering), and Kibana (dashboards and alerting)
  • Strong experience creating and operationalizing metrics (collection, aggregation, cardinality management, alerting strategy) and defining SLIs/SLOs
  • Proficiency in at least one scripting/programming language (Python, Ruby, Go, or PowerShell) for automation, data parsing, and platform tooling
  • Strong knowledge of log and event data modeling, parsing (grok/regex), enrichment, and schema management (e.g., ECS), including troubleshooting ingestion issues end-to-end
  • Experience with query languages and analysis techniques (KQL/Lucene, Elasticsearch DSL), and ability to build actionable visualizations and detections from telemetry
  • Hands-on experience with index lifecycle management (ILM), data streams, retention policies, and capacity planning for high-volume telemetry workloads
  • Experience with infrastructure-as-code and automation (e.g., Terraform, Ansible) and CI/CD practices to deploy and manage observability components
  • Solid understanding of Linux systems, networking fundamentals, and distributed systems concepts as they relate to telemetry, performance, and troubleshooting
  • Experience integrating telemetry from hybrid environments (data center infrastructure, virtualization, containers/Kubernetes, and cloud services)
  • Familiarity with Operational Technology (OT) / Industrial Control Systems (ICS) observability concepts and common BMS/EPMS applications such as Inductive Automation Ignition
  • Working knowledge of complementary observability tooling (e.g., Beats/Elastic Agent, Prometheus, Grafana, OpenTelemetry) and how to integrate telemetry between systems to follow event management practices
  • Experience operating services with on-call practices, incident management, and post-incident review processes; ability to write clear runbooks
  • Strong understanding of reliability engineering concepts including observability design, alert fatigue reduction, and measuring user/system impact
  • Experience working within ITIL or similar operational practices (change, incident, problem management)
  • Familiarity with regulatory/compliance expectations and secure handling of operational data (e.g., GDPR, PCI, SOX) as applicable
  • Excellent written and verbal communication skills; ability to translate telemetry needs into standards, dashboards, and alerts that teams adopt
  • Ability to work independently and collaboratively across infrastructure, network, security, and application teams.
Benefits
  • medical, dental, and vision coverage
  • life and AD&D insurance
  • short and long-term disability coverage
  • paid time off
  • employee assistance program
  • participation in a 401k program that includes company match
  • many additional voluntary benefits
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Elastic StackElasticsearchLogstashKibanaPythonRubyGoPowerShellinfrastructure-as-codeCI/CD
Soft Skills
communicationcollaborationincident managementproblem managementreliability engineeringownershiptroubleshootingdocumentationalert fatigue reductionindependence
Certifications
Bachelor’s degree in Computer ScienceBachelor’s degree in Information TechnologyBachelor’s degree in Engineering