Tealium

Senior AI Observability Engineer

Tealium

full-time

Posted on:

Location Type: Remote

Location: Remote • 🇺🇸 United States

Visit company website
AI Apply
Apply

Salary

💰 $165,000 - $200,000 per year

Job Level

Senior

Tech Stack

AWSDynamoDBGoGrafanaJenkinsKubernetesPrometheusPythonTerraform

About the role

  • Lead end-to-end observability design for AI/ML features in production and internal usage (e.g., RAG, Copilots, LLM-enhanced customer experiences).
  • Instrument AI features in Tealium products (e.g., ML-powered segmentation, decisioning, or predictions) for latency, accuracy, drift, usage, and cost.
  • Implement monitoring and cost tracking for third-party AI services (OpenAI, Anthropic Claude, Amazon Q, etc.), including rate limiting, quota management, and failover strategies.
  • Build telemetry pipelines to track LLM request/response metrics, prompt engineering observability, token usage, hallucination detection, and failover.
  • Collaborate with data science and product teams to define and automate quality SLIs/SLOs for models.
  • Implement AI-aware tracing (e.g., OpenTelemetry + LangChain/LLM traces) into the broader observability stack.
  • Participate in on-call rotations and help triage AI-specific incidents related to model regressions, latency spikes, or API failures.
  • Automate validation pipelines to ensure AI features are robust across environments.
  • Establish dashboards and alerts for AI observability using tools like Datadog, Sumologic, Prometheus, OpenTelemetry, and Grafana.
  • Contribute to ethical AI monitoring practices: PII exposure detection, prompt abuse, fairness, and content compliance.
  • Help guide Tealium’s use of Generative AI developer tools (e.g., GitHub Copilot, Amazon Q Developer, Cursor) for coding efficiency and ensure telemetry around their use is captured appropriately.
  • Initial goal within 6 months: Establish baseline AI observability across 3+ production ML features.

Requirements

  • 6+ years in Site Reliability Engineering, Observability Engineering, or ML Ops with a focus on production-grade AI/ML systems.
  • Deep experience in instrumenting AI pipelines (e.g., LLMs, recommender systems, ML APIs) for observability, including drift detection and cost tracking.
  • Familiarity with prompt engineering, embeddings, vector DBs (Neptune), and RAG-style architectures.
  • Hands-on experience with OpenTelemetry, Datadog, Sumologic, Prometheus, or similar.
  • Experience integrating observability into AI platforms: e.g., Bedrock, Neptune, LangChain, LlamaIndex, HuggingFace, SageMaker, etc.
  • Proficiency with Python, Go, or similar languages used in backend and ML infrastructure.
  • Familiarity with AWS services (especially those relevant to AI: SageMaker, Bedrock, Lambda, DynamoDB, etc.).
  • Experience deploying and observing third-party LLM APIs (OpenAI, Claude, Amazon Q).
  • Strong background in Infrastructure-as-Code (Terraform, ArgoCD) and CI/CD tooling (Jenkins, GitHub Actions).
  • Understanding of Kubernetes and container orchestration.
  • Experience with FinOps/cost optimization for AI workloads.
  • Strong understanding of ethical AI practices and responsible telemetry instrumentation.
  • Additionally, Data Privacy and compliance experience.
  • Excellent collaboration skills and comfort leading across SRE, Data Engineering, and Product/ML teams.
  • Experience mentoring or leading technical initiatives.
  • Communication skills for explaining complex AI concepts to non-technical stakeholders.
Benefits
  • Employees are eligible to receive an annual bonus and stock options.
  • Employees and their families are eligible for medical, dental, vision, life, and disability insurance.
  • Employees have the option to enroll in our 401k plan and are eligible to receive contributions for company matching.
  • Employees are eligible for flexible paid time-off and extended paid parental leave.
  • We offer 11 paid holidays annually.
  • We offer 15 hours of paid work time for volunteer activities and programs.
  • Our sick leave accrual is the following for our employees: Exempt CA employees (not including San Francisco) including NY: accrue 40 hours each year. Unused sick leave carries over into the next year. Employees cannot exceed 80 hours in a given year. Exempt Non-CA employees (not including NY) including SF: Accrue 1 hour every 30 hours worked. Cannot exceed 180 hours in the calendar year. Non-Exempt: accrue 1 hour every 30 hours worked. Unused carries over to the next year. Not to exceed 108 hours in a calendar year.

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
observability designAI/ML features instrumentationmonitoring and cost trackingtelemetry pipelinesAI-aware tracingvalidation pipelines automationdrift detectionInfrastructure-as-CodeCI/CD toolingcost optimization
Soft skills
collaborationmentoringcommunicationleadershipproblem-solvingtriageethical AI practicesstakeholder engagementtechnical initiative leadershipcross-team coordination
FINN Partners

AI Adoption Lead

FINN Partners
Seniorfull-time$90k–$100k / year🇺🇸 United States
Posted: 1 hour agoSource: joinus.applytojob.com
Mural

Senior Segment Marketing Manager, AI

Mural
Seniorfull-time$130k–$160k / year🇺🇸 United States
Posted: 3 hours agoSource: jobs.ashbyhq.com
Welocalize

AI Squad Lead

Welocalize
Seniorfull-time🇺🇸 United States
Posted: 6 hours agoSource: jobs.lever.co
Snowflake

Solution Innovation Architect – AI/ML

Snowflake
Mid · Seniorfull-time$112k–$163k / year🇺🇸 United States
Posted: 9 hours agoSource: jobs.ashbyhq.com
PandasPythonPyTorchScikit-LearnTensorflow