Salary
💰 $187,000 - $240,000 per year
Tech Stack
Distributed SystemsGoPythonRust
About the role
- Design and prototype intelligent systems for AI-native observability, including cost-aware agent orchestration, adaptive query execution, and self-optimizing system components.
- Apply reinforcement learning, search, or hybrid approaches to infrastructure-level decision-making, such as autoscaling, scheduling, or load shaping.
- Collaborate with AI researchers and platform engineers to design experimentation loops and verifiers that guide LLM outputs using runtime metrics and formal models.
- Explore emerging paradigms like AI compilers, “programming after code,” and runtime-aware prompt engineering to inform Datadog’s infrastructure and product design.
- Help define the direction of BitsEvolve - Datadog’s optimization agent that uses LLMs and evolutionary search to discover code improvements, optimize GPU kernels, and tune configurations to improve performance.
- Partner with product teams and platform stakeholders to ensure scientific advances translate into measurable improvements in cost, performance, and observability depth.
- Join the team evolving observability infrastructure for stochastic, self-improving systems and build an intelligent control plane for production systems.
Requirements
- You have a BS/MS/PhD in a scientific field or equivalent experience
- You have 8+ years of experience in systems engineering, database internals, or infrastructure research, including hands-on experience in a production environment
- You have a strong software engineering foundation, ideally in C++, Rust, Go, or Python, and are comfortable writing performant, maintainable code
- You have deep expertise in at least one of the following areas: query optimization, data center scheduling, compiler design, reinforcement learning, or distributed systems design
- You have experience applying search, planning, or learning techniques to solve real-world optimization problems
- You are excited by systems that learn, adapt, and improve over time using feedback from runtime metrics and human-defined objectives
- You are hypothesis-driven and enjoy designing experiments and evaluation loops, whether through simulations, benchmarks, or live systems
- You thrive in ambiguity, enjoy reading papers and building prototypes, and want to help shape the future of infrastructure in the AI era
- You enjoy collaborating across research, engineering, and product to bring scientific insights to practical outcomes