Senior Site Reliability Engineer

Artisight

full-time

Posted on: 8/28/2025

Location: 🇺🇸 United States

Visit company website

✨ AI Apply

Apply

Job Level

Senior

Tech Stack

AnsibleAWSAzureCloudDjangoDNSDockerFirewallsGoogle Cloud PlatformGrafanaKubernetesLinuxPrometheusPythonTCP/IPTerraform

About the role

Reliability & Support: Act as a primary point of contact for L2 support issues, troubleshooting complex problems across our stack and driving them to resolution. Implement permanent fixes and preventative measures to reduce recurrence.
Automation & Tooling: Design, develop, and implement automation solutions for common operational tasks, system provisioning, maintenance, and incident response. Reduce operational "toil" through smart tooling.
Observability & Monitoring: Lead the strategy and implementation of comprehensive monitoring, logging, and alerting systems. Enhance our observability stack to provide deep insights into system health and performance.
System Architecture & Design: Collaborate with development and product teams to design and build scalable, reliable, and secure infrastructure and applications. Provide SRE perspective on new features and architectural decisions.
Incident Management: Participate in on-call rotations, respond to incidents, perform root cause analyses, and implement post-incident actions to prevent future occurrences.
Mentorship & Leadership: Mentor junior SREs and other engineering team members, sharing best practices in reliability, operations, and software development. Potentially lead small projects or initiatives.
Performance Optimization: Identify and address performance bottlenecks across infrastructure and applications.
Documentation: Create and maintain thorough documentation for systems, processes, and playbooks.

Requirements

Expert-level proficiency in Python for scripting, automation, and tooling.
Experience with Django frameworks is a strong plus.
7+ years of experience in a Site Reliability Engineering, DevOps, or similar role with a strong focus on system reliability and automation.
Deep understanding and extensive experience with Linux operating systems (Ubuntu preferred), including system administration, networking, and troubleshooting.
Extensive experience with containerization technologies, especially Docker.
Strong practical experience with container orchestration platforms, specifically Kubernetes, including deployment, management, and troubleshooting of clusters and applications.
Demonstrated experience in mentorship, team leadership, or technical management.
You should be comfortable guiding, coaching, and developing less experienced engineers.
Methodical approach to problem-solving: Ability to systematically diagnose complex issues, analyze data, and propose effective solutions.
Self-motivated and proactive: Takes initiative, identifies areas for improvement, and drives projects to completion with minimal supervision.
Excellent communication skills: Ability to articulate complex technical concepts clearly to both technical and non-technical audiences, strong written communication, and ability to collaborate effectively across teams.
Experience with cloud platforms (e.g., AWS, GCP, Azure) is a significant advantage, with AWS preferred.
Experience with CI/CD pipelines and related tools.
Familiarity with infrastructure as code (e.g., Terraform, Ansible).
Understanding of networking concepts (TCP/IP, DNS, Load Balancing, Firewalls).
Experience with various monitoring and alerting tools (e.g., Prometheus, Grafana, ELK Stack, Datadog, New Relic).