The Site Reliability Engineer II will be responsible for supporting, enhancing, and maintaining Restaurant365’s cloud infrastructure and applications.
Collaborate with DevOps, development, and infrastructure teams to resolve moderately complex issues, propose improvements, and strengthen the reliability, scalability, and security of our SaaS platform.
Respond to production incidents, perform triage and troubleshooting, and contribute to post-incident analysis.
Identify and automate manual processes to improve efficiency and reduce risk.
Enhance and evolve monitoring tools and platforms to improve observability.
Promote and apply best practices for reliability, scalability, and performance across engineering.
Implement and support cloud automation using Terraform, Ansible, or CloudFormation.
Work within change management protocols to provide maximum uptime for production systems.
Participate in on-call rotation, providing 24x7 support for incidents and contributing to root cause analysis.
Partner with developers, architects, vendors, and IT teams to ensure reliable system operations.
Research and remediate vulnerabilities in coordination with security teams.
Maintain documentation of infrastructure, monitoring, runbooks, and incident response procedures.

Requirements

BS in Computer Science, Information Systems, or related field (or equivalent experience).
2–4 years of experience in site reliability engineering, DevOps, or cloud operations.
Experience with cloud platforms (Azure or AWS), including services such as AKS, ECS/EKS, Functions/Lambda, S3, and Blob storage.
Proficiency with infrastructure-as-code and automation (Terraform, Ansible, YAML, Python, Bash, PowerShell).
Strong Linux engineering skills; working knowledge of Windows administration.
Experience supporting production environments and participating in on-call rotations.
Familiarity with web servers and middleware (Nginx, Apache Tomcat).
Experience with CI/CD tools (GitLab, Git, or similar).
Strong written, oral, and interpersonal communication skills.
Preferred Qualifications
Experience with monitoring tools (Prometheus, Grafana, ELK, Site24x7, Nagios).
Knowledge of performance analysis and system vulnerability remediation.
Cloud certification (AWS or Azure) preferred.
Familiarity with restaurant industry SaaS platforms and customer-facing applications.

Benefits

📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

site reliability engineeringDevOpscloud operationsinfrastructure-as-codeautomationLinux engineeringWindows administrationperformance analysisvulnerability remediationmonitoring

Soft Skills

communicationinterpersonal skillstroubleshootingproblem-solvingcollaborationdocumentationincident responseroot cause analysisefficiency improvementchange management

Certifications

AWS certificationAzure certification