Artisight

Senior Site Reliability Engineer

Artisight

full-time

Posted on:

Origin:  • 🇺🇸 United States

Visit company website
AI Apply
Manual Apply

Job Level

Senior

Tech Stack

AnsibleAWSAzureCloudDjangoDNSDockerFirewallsGoogle Cloud PlatformGrafanaKubernetesLinuxPrometheusPythonTCP/IPTerraform

About the role

  • Reliability & Support: Act as a primary point of contact for L2 support issues, troubleshooting complex problems across our stack and driving them to resolution. Implement permanent fixes and preventative measures to reduce recurrence.
  • Automation & Tooling: Design, develop, and implement automation solutions for common operational tasks, system provisioning, maintenance, and incident response. Reduce operational "toil" through smart tooling.
  • Observability & Monitoring: Lead the strategy and implementation of comprehensive monitoring, logging, and alerting systems. Enhance our observability stack to provide deep insights into system health and performance.
  • System Architecture & Design: Collaborate with development and product teams to design and build scalable, reliable, and secure infrastructure and applications. Provide SRE perspective on new features and architectural decisions.
  • Incident Management: Participate in on-call rotations, respond to incidents, perform root cause analyses, and implement post-incident actions to prevent future occurrences.
  • Mentorship & Leadership: Mentor junior SREs and other engineering team members, sharing best practices in reliability, operations, and software development. Potentially lead small projects or initiatives.
  • Performance Optimization: Identify and address performance bottlenecks across infrastructure and applications.
  • Documentation: Create and maintain thorough documentation for systems, processes, and playbooks.

Requirements

  • Expert-level proficiency in Python for scripting, automation, and tooling.
  • Experience with Django frameworks is a strong plus.
  • 7+ years of experience in a Site Reliability Engineering, DevOps, or similar role with a strong focus on system reliability and automation.
  • Deep understanding and extensive experience with Linux operating systems (Ubuntu preferred), including system administration, networking, and troubleshooting.
  • Extensive experience with containerization technologies, especially Docker.
  • Strong practical experience with container orchestration platforms, specifically Kubernetes, including deployment, management, and troubleshooting of clusters and applications.
  • Demonstrated experience in mentorship, team leadership, or technical management.
  • You should be comfortable guiding, coaching, and developing less experienced engineers.
  • Methodical approach to problem-solving: Ability to systematically diagnose complex issues, analyze data, and propose effective solutions.
  • Self-motivated and proactive: Takes initiative, identifies areas for improvement, and drives projects to completion with minimal supervision.
  • Excellent communication skills: Ability to articulate complex technical concepts clearly to both technical and non-technical audiences, strong written communication, and ability to collaborate effectively across teams.
  • Experience with cloud platforms (e.g., AWS, GCP, Azure) is a significant advantage, with AWS preferred.
  • Experience with CI/CD pipelines and related tools.
  • Familiarity with infrastructure as code (e.g., Terraform, Ansible).
  • Understanding of networking concepts (TCP/IP, DNS, Load Balancing, Firewalls).
  • Experience with various monitoring and alerting tools (e.g., Prometheus, Grafana, ELK Stack, Datadog, New Relic).