Site Reliability Engineer 3

Granicus

full-time

Posted on: 8/16/2025

Location: 🇺🇸 United States

Visit company website

✨ AI Apply

Apply

Job Level

Mid-LevelSenior

Tech Stack

AnsibleAWSAzureChefCloudDistributed SystemsDockerGoGrafanaJavaKubernetesLinuxNoSQLPrometheusPuppetPythonRubySplunkSQLTerraformUnix

About the role

On-call Production Support: Provide production support on a shift according to the team on-call roster.
Work on the customer and internal engineering/implementation team raised tickets while not on-call for production support. For example, a client may request to correct some data on the database server which cannot be done through the web interface.
Work on SREs backlog items.
Monitor and Maintain Systems: Continuously monitor the health and performance of our services, systems, and infrastructure. Respond to alerts and incidents promptly to ensure high availability.
Automate Processes: Develop and maintain automation scripts and tools to streamline operations and reduce manual intervention.
Incident Management: Assist in troubleshooting and resolving incidents, performing root cause analysis, and implementing long-term fixes to prevent recurrence.
System Improvements: Participate in designing and implementing system improvements to enhance reliability, scalability, and performance.
Collaboration: Work closely with software engineers to understand application requirements, provide feedback on design and architecture, and support deployment and release processes.
Documentation: Create and maintain documentation for processes, procedures, and troubleshooting guides to ensure knowledge sharing within the team.
Capacity Planning: Assist in capacity planning activities to anticipate future needs and ensure that our infrastructure can handle growth.
Security: Implement and adhere to security best practices to protect our systems and data.

Requirements

Technical Skills: Good understanding of Linux/Unix systems, networking, and cloud services (AWS, Azure, or Google Cloud). Experience with scripting languages such as Python, Bash, or Ruby.
Education: Bachelor’s or Master’s degree in Computer Science, Information Technology, or a related field, or equivalent practical experience.
Experience: 5+ years of experience in site reliability engineering, system administration, or a similar role, with a proven track record of managing large-scale, high-availability systems
Technical Skills: Expertise in Linux/Unix systems, networking, and cloud services (AWS, Azure, or Google Cloud). Proficiency in scripting languages (Python, Bash, Ruby) and programming languages (Go, Java, C++).
Tools and Technologies: Advanced knowledge of monitoring and logging tools (Prometheus, Grafana, Splunk), configuration management (Ansible, Chef, Puppet), and CI/CD pipelines.
Problem-Solving: Strong analytical and problem-solving skills with the ability to diagnose and resolve complex issues efficiently.
Communication: Excellent verbal and written communication skills, with the ability to convey complex technical concepts to non-technical stakeholders.
Leadership: Demonstrated ability to lead and mentor a team, drive projects to completion, and manage cross-functional initiatives.
5+ years experience in a SRE, DevOps or Software Engineering role
Certifications: Relevant certifications such as AWS Certified DevOps Engineer, Google Cloud Professional DevOps Engineer, or similar.
Knowledge: In-depth understanding of containerization (Docker, Kubernetes) and infrastructure as code (Terraform, CloudFormation).
Experience: Experience with database management (SQL, NoSQL), load balancing, and distributed systems.

Site Reliability Engineer 3

Job Level

Tech Stack

About the role

Requirements

Similar jobs on JobTailor

AWS DevOps Engineer

DevOps Engineer II – FinTech Commissions, Substantiation

Site Reliability Engineer

DevOps Technical Lead

DevOps Engineer