Senior Staff Production Engineer – Cloud Platform, Reliability, Machine Identity Security

Palo Alto Networks

Senior Staff Production Engineer at Palo Alto Networks responsible for cloud platform reliability and operational excellence. Leading design and improvements across production environments while mentoring engineers.

Posted 6/11/2026full-timeSanta Clara • California • 🇺🇸 United StatesSenior💰 $126,000 - $203,500 per yearWebsite

ATS Keywords

Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills

cloud infrastructureInfrastructure as CodeCI/CD pipelinesKubernetesTerraformAnsibleJenkinsPythonGoLinux systems

Soft Skills

problem-solvingmentoringcollaborationincident responseroot cause analysisautomationcommunicationleadershiporganizational skillsscalability

Tools & Technologies

AWSAzureGCPEKSAKSGKEGitLab CIArgoCDGitHub ActionsVPC

Certifications & Qualifications

CKACKADAWS Solutions ArchitectAzure Administrator

Industry Keywords

DevOpsPlatform EngineeringSite Reliability Engineeringcloud-native design patternsmonitoringalertingobservabilityhigh-scale production environmentsDevSecOpsoperational overhead

Tech Stack

Tools & technologies

AnsibleAWSAzureCloudDistributed SystemsDNSGoGoogle Cloud PlatformJenkinsKubernetesLinuxPythonTCP/IPTerraform

About the role

Key responsibilities & impact

Design, build, and evolve highly available cloud infrastructure platforms with a focus on scalability, resilience, and reliability
Lead improvements across production systems, including performance, availability, and incident response
Drive and standardize Infrastructure as Code (IaC) practices to improve consistency and reduce operational overhead
Design and optimize CI/CD pipelines to support fast, secure, and reliable software delivery at scale
Partner with development teams to improve system reliability, observability, and cloud-native design patterns
Define and implement monitoring, alerting, and observability strategies across distributed systems
Lead incident response efforts, including root cause analysis and long-term remediation strategies
Identify and eliminate operational toil through automation and system improvements
Mentor engineers and contribute to raising the bar for production engineering practices

Requirements

What you’ll need

5+ years of experience in DevOps, Platform Engineering, or Site Reliability Engineering (SRE)
Strong experience designing and operating cloud infrastructure on AWS, Azure, or GCP
Deep expertise managing and scaling Kubernetes environments (EKS, AKS, or GKE)
Strong experience with Infrastructure as Code tools (Terraform, Ansible, or Pulumi)
Proven experience designing and maintaining complex CI/CD systems (Jenkins, GitLab CI, ArgoCD, GitHub Actions)
Strong programming/scripting skills (Python, Go, or similar) for automation and tooling
Experience operating in high-scale, 24/7 production environments with ownership of incident response and reliability
Solid understanding of Linux systems and networking fundamentals (DNS, TCP/IP, load balancing, VPC, mTLS)
Strong problem-solving skills and ability to work across teams
Nice to Have: Experience implementing DevSecOps practices in cloud environments, professional certifications (CKA/CKAD, AWS Solutions Architect, Azure Administrator).

Benefits

Comp & perks

Employee benefits may include restricted stock units and a bonus.