Tech Stack
AnsibleAWSAzureCloudDistributed SystemsDockerGoogle Cloud PlatformGrafanaKubernetesPrometheusPythonTerraform
About the role
- About Trase Systems: AI, Uncomplicated; Trase empowers enterprise leaders to harness the full potential of AI without the associated complexity and risks; end-to-end solution for deploying, managing, and optimizing AI in the enterprise.
- Trase is at the forefront of AI Agent innovation, topping the Hugging Face GAIA Leaderboard for Generalized AI Assistants, ahead of Google, Meta, Microsoft, and OpenAI. We are leveraging our cutting-edge technologies to develop mission-critical agentic applications in Healthcare, Oil & Gas, and National Security.
- About the Role: Build and maintain the resilient, scalable infrastructure that powers cutting-edge AI; ensure reliability and performance of complex, distributed systems; automate, monitor, and optimize the platforms that enable ML innovation.
- You will work closely with ML engineers, software engineers, and product teams to build and operate the infrastructure that runs our advanced AI agents and machine learning models.
- Responsibilities: Design, Build, and Maintain Core Infrastructure; Automate Everything; Ensure System Reliability and Performance; Manage ML Infrastructure and Pipelines; Incident Response and Post-Mortems; Implement and Enhance Observability; Capacity Planning and Cost Optimization; Foster a Culture of Reliability.
- Requirements: Proven SRE and DevOps Experience; Cloud Infrastructure Expertise; Proficiency in Infrastructure as Code; Containerization and Orchestration Mastery; Strong Programming and Scripting Skills; Experience with Monitoring and Observability Tools; CI/CD Pipeline Development; Excellent Problem-Solving and Communication Skills; Educational Background.
- Benefits: 100% employer-paid health care including medical, dental, and vision for you and your family; Paid maternity and paternity; Unlimited PTO; Educational reimbursements; Optional 401K, FSA, and equity incentives; Mental health benefits through TARA Mind.
- Some travel is required.
- We are an Equal Opportunity Employer: You’ll receive consideration for employment without regard to race, sex, color, religion, sexual orientation, gender identity, national origin, protected veteran status, or on the basis of disability.
Requirements
- Proven SRE and DevOps Experience: Demonstrated experience in a Site Reliability Engineering or DevOps role, managing complex, large-scale production environments.
- Cloud Infrastructure Expertise: Hands-on experience with one or more major cloud platforms (GCP, AWS, Azure).
- Proficiency in Infrastructure as Code: Strong skills with IaC tools such as Terraform, Ansible, or CloudFormation.
- Containerization and Orchestration Mastery: Deep knowledge of Docker and Kubernetes, including experience deploying and managing containerized applications in production.
- Strong Programming and Scripting Skills: Proficiency in languages such as Python, with a focus on automation and building reliable software.
- Experience with Monitoring and Observability Tools: Expertise in setting up and using monitoring and logging systems like Prometheus, Grafana, or the ELK stack.
- CI/CD Pipeline Development: A strong background in building and managing CI/CD pipelines for both software applications and machine learning models.
- Excellent Problem-Solving and Communication Skills: The ability to troubleshoot complex issues across the stack and clearly communicate technical concepts to both technical and non-technical stakeholders.
- Educational Background: A Bachelor\'s or Master\'s degree in Computer Science, Software Engineering, or a related field.