Lead end-to-end observability design for AI/ML features in production and internal usage (e.g., RAG, Copilots, LLM-enhanced customer experiences).
Instrument AI features in Tealium products (e.g., ML-powered segmentation, decisioning, or predictions) for latency, accuracy, drift, usage, and cost.
Implement monitoring and cost tracking for third-party AI services (OpenAI, Anthropic Claude, Amazon Q, etc.), including rate limiting, quota management, and failover strategies.
Build telemetry pipelines to track LLM request/response metrics, prompt engineering observability, token usage, hallucination detection, and failover.
Collaborate with data science and product teams to define and automate quality SLIs/SLOs for models.
Implement AI-aware tracing (e.g., OpenTelemetry + LangChain/LLM traces) into the broader observability stack.
Participate in on-call rotations and help triage AI-specific incidents related to model regressions, latency spikes, or API failures.
Automate validation pipelines to ensure AI features are robust across environments.
Establish dashboards and alerts for AI observability using tools like Datadog, Sumologic, Prometheus, OpenTelemetry, and Grafana.
Contribute to ethical AI monitoring practices: PII exposure detection, prompt abuse, fairness, and content compliance.
Help guide Tealium’s use of Generative AI developer tools (e.g., GitHub Copilot, Amazon Q Developer, Cursor) for coding efficiency and ensure telemetry around their use is captured appropriately.
Initial goal within 6 months: Establish baseline AI observability across 3+ production ML features.
Requirements
6+ years in Site Reliability Engineering, Observability Engineering, or ML Ops with a focus on production-grade AI/ML systems.
Deep experience in instrumenting AI pipelines (e.g., LLMs, recommender systems, ML APIs) for observability, including drift detection and cost tracking.
Familiarity with prompt engineering, embeddings, vector DBs (Neptune), and RAG-style architectures.
Hands-on experience with OpenTelemetry, Datadog, Sumologic, Prometheus, or similar.
Experience integrating observability into AI platforms: e.g., Bedrock, Neptune, LangChain, LlamaIndex, HuggingFace, SageMaker, etc.
Proficiency with Python, Go, or similar languages used in backend and ML infrastructure.
Familiarity with AWS services (especially those relevant to AI: SageMaker, Bedrock, Lambda, DynamoDB, etc.).
Strong background in Infrastructure-as-Code (Terraform, ArgoCD) and CI/CD tooling (Jenkins, GitHub Actions).
Understanding of Kubernetes and container orchestration.
Experience with FinOps/cost optimization for AI workloads.
Strong understanding of ethical AI practices and responsible telemetry instrumentation.
Additionally, Data Privacy and compliance experience.
Excellent collaboration skills and comfort leading across SRE, Data Engineering, and Product/ML teams.
Experience mentoring or leading technical initiatives.
Communication skills for explaining complex AI concepts to non-technical stakeholders.
Benefits
Employees are eligible to receive an annual bonus and stock options.
Employees and their families are eligible for medical, dental, vision, life, and disability insurance.
Employees have the option to enroll in our 401k plan and are eligible to receive contributions for company matching.
Employees are eligible for flexible paid time-off and extended paid parental leave.
We offer 11 paid holidays annually.
We offer 15 hours of paid work time for volunteer activities and programs.
Our sick leave accrual is the following for our employees: Exempt CA employees (not including San Francisco) including NY: accrue 40 hours each year. Unused sick leave carries over into the next year. Employees cannot exceed 80 hours in a given year. Exempt Non-CA employees (not including NY) including SF: Accrue 1 hour every 30 hours worked. Cannot exceed 180 hours in the calendar year. Non-Exempt: accrue 1 hour every 30 hours worked. Unused carries over to the next year. Not to exceed 108 hours in a calendar year.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
observability designAI/ML features instrumentationmonitoring and cost trackingtelemetry pipelinesAI-aware tracingvalidation pipelines automationdrift detectionInfrastructure-as-CodeCI/CD toolingcost optimization
Soft skills
collaborationmentoringcommunicationleadershipproblem-solvingtriageethical AI practicesstakeholder engagementtechnical initiative leadershipcross-team coordination