Salary
💰 $234,000 - $300,000 per year
Tech Stack
Distributed SystemsGoPythonRust
About the role
- Design and prototype intelligent systems for AI-native observability, including cost-aware agent orchestration, adaptive query execution, and self-optimizing system components
- Lead efforts to apply reinforcement learning, search, or hybrid approaches to infrastructure-level decision-making (autoscaling, scheduling, load shaping)
- Collaborate with AI researchers and platform engineers to design experimentation loops and verifiers guiding LLM outputs using runtime metrics and formal models
- Explore emerging paradigms like AI compilers, “programming after code,” and runtime-aware prompt engineering to inform infrastructure and product design
- Help define the direction of BitsEvolve — Datadog’s optimization agent that uses LLMs and evolutionary search
- Partner with product teams and platform stakeholders to translate scientific advances into measurable improvements in cost, performance, and observability depth
Requirements
- BS/MS/PhD in a scientific field or equivalent experience
- 8+ years of experience in systems engineering, database internals, or infrastructure research, including hands-on experience in a production environment
- Strong software engineering foundation, ideally in C++, Rust, Go, or Python
- Deep expertise in at least one of: query optimization, data center scheduling, compiler design, reinforcement learning, or distributed systems design
- Experience applying search, planning, or learning techniques to solve real-world optimization problems
- Experience applying reinforcement learning, search, or hybrid approaches to infrastructure-level decision-making
- Hypothesis-driven: experience designing experiments, simulations, benchmarks, or live-system evaluation loops
- Comfortable reading research papers and building prototypes
- Ability to collaborate across research, engineering, and product teams