
Intermediate Site Reliability Engineer, Database Operations
GitLab
full-time
Posted on:
Location Type: Remote
Location: Remote • 🇪🇺 Anywhere in Europe
Visit company websiteJob Level
Mid-LevelSenior
Tech Stack
AnsibleChefDistributed SystemsGoKubernetesPostgresPuppetRubySQLTerraform
About the role
- Automating every operational task is a core requirement for this role. For example, package updates, configuration changes across all environments, creating tools for automatic provisioning of user facing services, etc.
- Responding to platform emergencies, alerts, and escalations from Customer Support.
- Ensure systems exist to manage software life-cycles (e.g. Operating Systems) with a minimum of manual effort.
- Develop a fully automated multi-environment observability stack based on the existing SaaS system, and extend it to predict capacity needs based on the usage patterns.
- Plan for new service roll-outs, expansion and capacity management of existing services, and work with users to optimize their resource consumption.
- Work on database reliability and performance aspects for GitLab.com from within the SRE team as well as work on shipping solutions with the product.
- Analyze solutions and implement best practices for our PostgreSQL database clusters and its components.
- Work on observability of relevant database metrics and make sure we reach our database objectives.
- Work with peer SREs to roll out changes to our production environment and help mitigate database-related production incidents.
- OnCall support on rotation with the team.
- Provide database expertise to engineering teams (for example through reviews of database migrations, queries and performance optimizations).
- Work on automation of database infrastructure and help engineering succeed by providing self-service tools.
- Use the GitLab product to run GitLab.com as a first resort and improve the product as much as possible.
- Plan the growth of GitLab's database infrastructure.
- Design, build and maintain core database infrastructure components that allow GitLab to scale to support hundreds of thousands of concurrent users.
- Support and debug database production issues across services and levels of the stack.
- Make monitoring and alerting alert on symptoms and not on outages.
- Document every action so your learnings turn into repeatable actions and then into automation.
Requirements
- Have primary experience running PostgreSQL in high-growth, large production environments using both self-managed (VM, Kubernetes with modern PostgreSQL Operators) as well DBaaS services.
- Have hands-on experience using data from PostgreSQL internals to design, build and troubleshoot systems.
- Have primary experience with infrastructure automation, orchestration and configuration management (Chef, Ansible, Puppet, Terraform)
- Have solid understanding of SQL and PL/pgSQL
- Significant experience working in a Large SaaS distributed Systems production environment
- Share our values, and work in accordance with those values.
- Have excellent written and verbal English communication skills, with an urge to collaborate and communicate asynchronously.
- Have an urge to document all the things so you don't need to learn the same thing twice, and an urge for delivering quickly and iterating fast.
- Have a proactive, go-for-it attitude. When you see something broken, you can't help but fix it
- Solid data modeling and data structure design skills
- Bonus: Solid programming skills as a (former) backend engineer - Preferably with Ruby and/or Go.
- Bonus: Experience with Clickhouse, or other modern OLAP database.
Benefits
- GitLab is proud to be an equal opportunity workplace
- GitLab’s policies and practices are based solely on merit
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
PostgreSQLSQLPL/pgSQLinfrastructure automationorchestrationconfiguration managementdata modelingdata structure designRubyGo
Soft skills
written communicationverbal communicationcollaborationdocumentationproactive attitudeproblem-solvingiterationasynchronous communicationteamworkurgency