Granicus

Site Reliability Engineer 3

Granicus

full-time

Posted on:

Origin:  • 🇺🇸 United States

Visit company website
AI Apply
Apply

Job Level

Mid-LevelSenior

Tech Stack

AnsibleAWSAzureChefCloudDistributed SystemsDockerGoGrafanaJavaKubernetesLinuxNoSQLPrometheusPuppetPythonRubySplunkSQLTerraformUnix

About the role

  • On-call Production Support: Provide production support on a shift according to the team on-call roster.
  • Work on the customer and internal engineering/implementation team raised tickets while not on-call for production support. For example, a client may request to correct some data on the database server which cannot be done through the web interface.
  • Work on SREs backlog items.
  • Monitor and Maintain Systems: Continuously monitor the health and performance of our services, systems, and infrastructure. Respond to alerts and incidents promptly to ensure high availability.
  • Automate Processes: Develop and maintain automation scripts and tools to streamline operations and reduce manual intervention.
  • Incident Management: Assist in troubleshooting and resolving incidents, performing root cause analysis, and implementing long-term fixes to prevent recurrence.
  • System Improvements: Participate in designing and implementing system improvements to enhance reliability, scalability, and performance.
  • Collaboration: Work closely with software engineers to understand application requirements, provide feedback on design and architecture, and support deployment and release processes.
  • Documentation: Create and maintain documentation for processes, procedures, and troubleshooting guides to ensure knowledge sharing within the team.
  • Capacity Planning: Assist in capacity planning activities to anticipate future needs and ensure that our infrastructure can handle growth.
  • Security: Implement and adhere to security best practices to protect our systems and data.

Requirements

  • Technical Skills: Good understanding of Linux/Unix systems, networking, and cloud services (AWS, Azure, or Google Cloud). Experience with scripting languages such as Python, Bash, or Ruby.
  • Education: Bachelor’s or Master’s degree in Computer Science, Information Technology, or a related field, or equivalent practical experience.
  • Experience: 5+ years of experience in site reliability engineering, system administration, or a similar role, with a proven track record of managing large-scale, high-availability systems
  • Technical Skills: Expertise in Linux/Unix systems, networking, and cloud services (AWS, Azure, or Google Cloud). Proficiency in scripting languages (Python, Bash, Ruby) and programming languages (Go, Java, C++).
  • Tools and Technologies: Advanced knowledge of monitoring and logging tools (Prometheus, Grafana, Splunk), configuration management (Ansible, Chef, Puppet), and CI/CD pipelines.
  • Problem-Solving: Strong analytical and problem-solving skills with the ability to diagnose and resolve complex issues efficiently.
  • Communication: Excellent verbal and written communication skills, with the ability to convey complex technical concepts to non-technical stakeholders.
  • Leadership: Demonstrated ability to lead and mentor a team, drive projects to completion, and manage cross-functional initiatives.
  • 5+ years experience in a SRE, DevOps or Software Engineering role
  • Certifications: Relevant certifications such as AWS Certified DevOps Engineer, Google Cloud Professional DevOps Engineer, or similar.
  • Knowledge: In-depth understanding of containerization (Docker, Kubernetes) and infrastructure as code (Terraform, CloudFormation).
  • Experience: Experience with database management (SQL, NoSQL), load balancing, and distributed systems.