Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
SonicWall

Principal Site Reliability Engineer

SonicWall

Principal Site Reliability Engineer responsible for cloud-based service reliability and SRE practices at SonicWall. Leading incident response, observability, and mentoring engineering teams.

Posted 5/21/2026full-timeRemote • 🇺🇸 United StatesLeadWebsite

Tech Stack

Tools & technologies
AWSCloudDistributed SystemsDockerDynamoDBGrafanaKubernetesMongoDBPostgresPrometheusPythonRedisTerraform

About the role

Key responsibilities & impact
  • Own the reliability, scalability, and operational excellence of our Cloud-based services.
  • Define and enforce reliability standards.
  • Drive the adoption of SRE practices across engineering teams.
  • Build the systems and tooling that keep our production infrastructure healthy.
  • Define, publish, and continuously refine Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) for all critical services, partnering with product and engineering leadership.
  • Own the error budget framework: track consumption, enforce error budget policies, and drive reliability investments when budgets are at risk.
  • Lead the design and implementation of comprehensive observability platforms — metrics, structured logging, and distributed tracing — to ensure full visibility into production systems.
  • Drive toil reduction initiatives by identifying and automating repetitive, manual operational work, targeting measurable reduction in operational burden across teams.
  • Design and execute chaos engineering programs to proactively uncover reliability weaknesses in our infrastructure and services before they impact customers.
  • Lead blameless postmortem culture: facilitate incident retrospectives, extract systemic learnings, and track corrective action items to completion.
  • Build and improve on-call incident response processes, runbooks, and escalation paths; manage and optimize on-call rotation health to prevent burnout.
  • Help design, build, and support infrastructure and security technologies within the cloud that offer resiliency, observability, and optimized cost.
  • Develop solutions for automated deployment of software and services on our production infrastructure hosted on AWS, applying reliability engineering principles throughout.
  • Shape how mission-critical enterprise software solutions are developed and deployed using optimized CI/CD pipelines that embed reliability and quality gates.
  • Develop management solutions for services across multiple cloud platforms and data centers, with a focus on fault tolerance and graceful degradation.
  • Collaborate with developers to bring new features and services into production using production-readiness reviews and launch checklists.
  • Champion reliability engineering best practices across the organization, embedding SRE principles into the software development lifecycle.
  • Mentor team members on SRE philosophy, technical decision-making, code reviews, and cloud engineering best practices.
  • Participate in roadmap planning, identify areas of improvement, and perform technology evaluation and selection.

Requirements

What you’ll need
  • 7+ years of experience in scalable, distributed systems architecture.
  • 3+ years of hands-on Site Reliability Engineering experience, including ownership of SLOs and error budget management.
  • 4+ years of experience with Cloud Platforms, including AWS.
  • 4+ years of experience in infrastructure as code (Terraform, AWS CDK).
  • 5+ years of experience in scripting using Python, Shell, or a similar language.
  • 3+ years of experience with containerization technologies, including Docker.
  • 4+ years of experience with orchestration technologies, including Kubernetes.
  • Demonstrated experience designing and operating observability stacks (e.g., Prometheus, Grafana, Datadog, OpenTelemetry, Jaeger, or equivalent).
  • Experience with incident management platforms and on-call tooling (e.g., PagerDuty, OpsGenie).
  • Experience defining and implementing automated service deployments, including provisions for networking, security, reliability, management, reporting, and configuration management.
  • Experience with chaos engineering principles and tools (e.g., Chaos Monkey, LitmusChaos, Gremlin, or equivalent).
  • Experience managing databases — PostgreSQL, Redis, DynamoDB, MongoDB.
  • In-depth understanding of best practices for deployment automation and production-readiness reviews.
  • Experience using Git in a team environment (merge requests, branching, push, and pulls).
  • CS Degree or equivalent experience.

Benefits

Comp & perks
  • Health insurance
  • Professional development opportunities

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Site Reliability EngineeringSLO managementinfrastructure as codeTerraformAWS CDKscriptingPythonShellcontainerizationDocker
Soft Skills
leadershipmentoringcollaborationincident managementblameless postmortem culturecommunicationproblem-solvingroadmap planningtechnical decision-makingtoil reduction
Certifications
CS Degree