
Senior DevOps / SRE Engineer
MLabs
full-time
Posted on:
Location Type: Remote
Location: United States
Visit company websiteExplore more
Salary
💰 $120,000 - $150,000 per year
Job Level
Tech Stack
About the role
- Build and maintain the infrastructure for concurrent AI trading agents, managing complex cron schedules, state files, and trailing stop processes.
- Deploy and manage agent environments, including workspace persistence, isolated session management, and Model Context Protocol (MCP) server connectivity.
- Design and operate pipelines for shipping trading skills and plugins to production without interrupting live trading activity.
- Execute deployment strategies (blue/green, canary) ensuring active financial positions remain protected during every infrastructure change.
- Build comprehensive alerting across the full stack using metrics, logs, and traces to detect agent failures, state file corruption, or infrastructure regressions before financial loss occurs.
- Operate and scale core platform infrastructure, including Kubernetes (EKS) clusters, Redis, Postgres, ClickHouse, and Kafka.
- Maintain blockchain node infrastructure and ensure stable connectivity to exchange APIs and on-chain transaction systems.
- Lead incident response and on-call practices, including debugging, mitigation, and post-mortems to improve long-term platform reliability.
Requirements
- Extensive experience in DevOps, SRE, or Infrastructure Engineering, preferably within a startup environment where systems were built from the ground up.
- Proven track record of deploying, scaling, and debugging production workloads, specifically within AWS EKS.
- Proficiency with tools such as Terraform, Ansible, or equivalent frameworks.
- Hands-on experience with Docker and Helm for packaging production services.
- Experience operating production-grade data and messaging systems (Redis, Postgres/RDS, ClickHouse, Kafka).
- Strong experience with Prometheus, Grafana, Datadog, Loki, or OpenTelemetry to build proactive operational visibility.
- Ability to debug across multiple languages, including Python, Node.js, and Go.
- Understanding of systems where latency and reliability have direct financial consequences.
- Familiarity with node infrastructure, exchange APIs, wallet operations, and on-chain monitoring.
- Experience managing secrets, access controls, and production hardening for sensitive financial environments.
- Experience defining SLOs and building mature on-call practices.
Benefits
- Opportunity to build infrastructure for a new category of software (Autonomous AI Agents).
- High-autonomy environment with a focus on engineering excellence and technical ownership.
- Competitive compensation package commensurate with senior-level experience.
- Remote-first or flexible working arrangements (as specified by the client).
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
DevOpsSREInfrastructure EngineeringAWS EKSDockerHelmTerraformAnsiblePythonNode.js
Soft Skills
incident responsedebuggingmitigationpost-mortemsproactive operational visibility