Vultr

Senior Site Reliability Engineer, Core Cloud Engineering

Vultr

full-time

Posted on:

Location Type: Remote

Location: United States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $120,000 - $130,000 per year

Job Level

About the role

  • Operate and scale Vultr’s control plane, ensuring availability, correctness, and performance across global datacenters.
  • Design, implement, and maintain automation to manage hypervisor fleets (KVM, QEMU, libvirt) and supporting infrastructure at scale.
  • Develop tooling and automation for Open vSwitch (OVS), BGP routing, and other networking components to ensure resilient and self-healing network operations.
  • Continuously analyze and improve system performance across compute, storage, and network layers, with an emphasis on reducing toil and eliminating single points of failure.
  • Implement advanced monitoring, logging, and tracing solutions (Grafana, Sentry, SumoLogic) while leading incident response to minimize impact and drive postmortem culture.
  • Maintain and evolve infrastructure pipelines (GitLab CI/CD, Puppet) to enable safe, fast, and reliable changes to both control plane and hypervisor infrastructure.
  • Work closely with Software Engineers, Network Engineers, and Product teams to align platform reliability with business and user needs.
  • Produce clear technical documentation for runbooks, operational procedures, and automation frameworks to improve team efficiency and reliability standards.
  • Coach and mentor team members in best practices for site reliability, incident handling, automation, and low-level Linux systems debugging.

Requirements

  • Proficiency in PHP with strong scripting and automation skills.
  • Experience running large-scale distributed systems and control plane infrastructure in production.
  • Strong background in hypervisor technologies (libvirt, QEMU, KVM) and Linux systems administration.
  • Expertise in networking protocols and tools, particularly BGP and Open vSwitch (OVS), with automation experience.
  • Deep knowledge of observability and monitoring frameworks (Grafana, Sentry, SumoLogic) and incident management.
  • Advanced troubleshooting skills across compute, networking, and storage subsystems.
  • Experience building and maintaining CI/CD pipelines (GitLab) and configuration management (Puppet).
  • Familiarity with MySQL or similar databases, with an understanding of operational considerations for reliability and scale.
  • Strong problem-solving abilities and the drive to tackle complex, low-level reliability challenges.
  • Effective cross-team communication and collaboration skills.
  • A commitment to continuous improvement and fostering a culture of operational excellence.
Benefits
  • Excellent Medical Benefits w/ 100% company paid premiums for employee only plan + 100% company paid dental & vision premiums
  • 401(k) plan that matches 100% up to 4% with immediate vesting
  • Professional Development Reimbursement of $2,500 each year
  • 11 Holidays + Paid Time Off Accrual + Rollover Plan
  • Increased PTO at 3 year & 10 year anniversary + 1 month paid sabbatical every 5 years + Anniversary Bonus each year
  • $500 first year remote office setup + $400 each following year for new equipment
  • Internet reimbursement up to $75 per month
  • Gym membership reimbursement up to $50 per month
  • Company paid Wellable subscription
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
PHPscriptingautomationhypervisor technologieslibvirtQEMUKVMBGPOpen vSwitchCI/CD
Soft Skills
problem-solvingcross-team communicationcollaborationcoachingmentoringcontinuous improvementoperational excellence