Trase

Senior Site Reliability Engineer

Trase

full-time

Posted on:

Location: 🇺🇸 United States

Visit company website
AI Apply
Apply

Job Level

Senior

Tech Stack

AnsibleAWSAzureCloudDistributed SystemsDockerGoogle Cloud PlatformGrafanaKubernetesPrometheusPythonTerraform

About the role

  • Design, Build, and Maintain Core Infrastructure: Architect and implement scalable, highly available, and secure infrastructure on cloud platforms (GCP, AWS, Azure) to support our AI-driven applications and services.\n
  • Automate Everything: Develop and maintain automation tools and frameworks to eliminate manual effort in deployment, configuration, and management of our production environment.\n
  • Ensure System Reliability and Performance: Establish and monitor Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for our production systems. Proactively identify and resolve performance bottlenecks and availability issues.\n
  • Manage ML Infrastructure and Pipelines: Collaborate with ML engineers to build and maintain robust CI/CD pipelines for machine learning models, ensuring seamless training, deployment, and monitoring.\n
  • Incident Response and Post-Mortems: Lead incident response efforts to minimize downtime and conduct thorough post-incident reviews to identify root causes and implement preventative measures.\n
  • Implement and Enhance Observability: Deploy and manage comprehensive monitoring, logging, and tracing solutions (e.g., Prometheus, Grafana, ELK stack) to provide deep visibility into system health.\n
  • Capacity Planning and Cost Optimization: Forecast infrastructure needs and optimize resource utilization to ensure our platform can scale efficiently and cost-effectively.\n
  • Foster a Culture of Reliability: Champion SRE best practices across the engineering organization and mentor team members on reliability, performance, and scalability.

Requirements

  • Proven SRE and DevOps Experience: Demonstrated experience in a Site Reliability Engineering or DevOps role, managing complex, large-scale production environments.\n
  • Cloud Infrastructure Expertise: Hands-on experience with one or more major cloud platforms (GCP, AWS, Azure).\n
  • Proficiency in Infrastructure as Code: Strong skills with IaC tools such as Terraform, Ansible, or CloudFormation.\n
  • Containerization and Orchestration Mastery: Deep knowledge of Docker and Kubernetes, including experience deploying and managing containerized applications in production.\n
  • Strong Programming and Scripting Skills: Proficiency in languages such as Python, with a focus on automation and building reliable software.\n
  • Experience with Monitoring and Observability Tools: Expertise in setting up and using monitoring and logging systems like Prometheus, Grafana, or the ELK stack.\n
  • CI/CD Pipeline Development: A strong background in building and managing CI/CD pipelines for both software applications and machine learning models.\n
  • Excellent Problem-Solving and Communication Skills: The ability to troubleshoot complex issues across the stack and clearly communicate technical concepts to both technical and non-technical stakeholders.\n
  • Educational Background: A Bachelor\'s or Master\'s degree in Computer Science, Software Engineering, or a related field.
Granicus

Senior Site Reliability Engineer – AWS, AI/ML, APM

Granicus
Seniorfull-time$80k–$100k / year🇺🇸 United States
Posted: 6 hours agoSource: careers-granicus.icims.com
AnsibleAWSAzureChefCloudElasticSearchGoJavaLinuxLogstashPuppetPython+2 more
Northwoods

Senior DevOps Engineer

Northwoods
Seniorfull-timeOhio · 🇺🇸 United States
Posted: 7 hours agoSource: teamnorthwoods.bamboohr.com
AnsibleAWSChefDockerEC2GoGraphiteLinuxPrometheusPuppetPythonRuby+2 more
Color

Staff Software Engineer, DevOps/SRE

Color
Leadfull-time$195k–$250k / year🇺🇸 United States
Posted: 21 hours agoSource: jobs.lever.co
AWSCloudKubernetesLinuxTerraform
Abbott

DevOps Manager

Abbott
Senior · Leadfull-time$97k–$195k / year🇺🇸 United States
Posted: 1 day agoSource: abbott.wd5.myworkdayjobs.com
AzureCloudCyber SecurityJenkins
CACI International Inc

Senior Infrastructure – DevOps Engineer

CACI International Inc
Seniorfull-time$99k–$207k / yearVirginia · 🇺🇸 United States
Posted: 1 day agoSource: caci.wd1.myworkdayjobs.com
AnsibleChefCyber SecurityKubernetesNode.jsPuppetTypeScript