Salary
💰 CA$200,000 - CA$220,000 per year
Tech Stack
AnsibleAWSAzureCloudDockerElasticSearchGoogle Cloud PlatformGrafanaKubernetesLogstashPrometheusPythonTerraform
About the role
- Design and maintain OpenTelemetry-based observability infrastructure for distributed AI systems and LLM applications
- Build and scale ELK stack deployments (Elasticsearch, Logstash, Kibana) for log aggregation, search, and visualization of AI application data
- Implement comprehensive tracing and monitoring solutions for LLM inference, RAG pipelines, and AI Agent workflows
- Develop and maintain data ingestion pipelines for processing high-volume telemetry data from AI applications
- Configure and optimize OpenSearch clusters for real-time analytics and trace reconstruction of conversational flows
- Deploy and manage LLM observability platforms like Langfuse, OpenLLMetry, and custom monitoring solutions
- Implement Infrastructure as Code (Terraform, CloudFormation) for reproducible observability and application stack deployments
- Build automated alerting and incident response systems for AI application performance and reliability
- Collaborate with engineering teams to instrument AI applications with proper telemetry and observability hooks
- Optimize data retention policies, indexing strategies, and query performance for large-scale observability data
Requirements
- 4+ years of DevOps/Infrastructure engineering experience with focus on observability and monitoring
- Expert-level experience with OpenTelemetry implementation, configuration, and custom instrumentation
- Production experience with ELK stack (Elasticsearch, Logstash, Kibana) including cluster management and optimization
- Strong knowledge of distributed tracing, metrics collection, and log aggregation architectures
- Experience with container orchestration (Kubernetes, Docker) and cloud infrastructure (AWS/GCP/Azure)
- Proficiency with Infrastructure as Code tools (Terraform, Ansible, CloudFormation)
- Experience building high-throughput data ingestion pipelines and real-time analytics systems
- Strong scripting skills (Python, Bash/Sh) for automation and tooling
- Knowledge of observability best practices, SLI/SLO definitions, and incident response
- Experience with monitoring tools like Prometheus, Grafana, or DataDog
- Competitive salary
- Flexible work hours
- Professional development opportunities
- Remote work options
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
OpenTelemetryELK stackElasticsearchLogstashKibanaInfrastructure as CodeTerraformCloudFormationPythonBash
Soft skills
collaborationautomationincident response