Monitor Production and Enterprise Infrastructure and react to alarms according to documented SLAs
Work with Engineering and Customer Support teams to remediate alarms and incidents
Continually strive to improve the environment through optimization and automation
Create and update documentation as necessary to share new methods and knowledge around troubleshooting
Perform operational tasks as assigned by Engineering and Customer Support teams
Support incident response, deployments, and infrastructure training as the role evolves
Work with international teams to diagnose and resolve critical issues
Build, tune, and maintain alerting rules and monitors to ensure every alert is actionable, including investigating root cause, not just symptom mitigation
Participate in post-incident reviews and contribute to blameless post-mortems

Requirements

2+ years of experience in a help desk environment or NOC role, ideally in a cloud-based environment
Experience managing and creating alerts and monitors using enterprise monitoring tools such as Nagios, Zabbix, SolarWinds and Datadog (Datadog preferred)
Experience with Incident Management platforms such as Pagerduty, Opsgenie or Firehydrant
Experience working with ticketing systems such as Jira and Zendesk
Experience following runbooks and troubleshooting guides to remediate infrastructure or application issues
Experience with Infrastructure operations (Cloud Infrastructure AWS/Azure preferred)
Technical aptitude with the ability & willingness to quickly learn and understand complex products or services
Highly self-motivated, strong work ethic and ability to multitask in a fast-paced environment
Demonstrates experience in adept problem-solving abilities, and organizational skills, ensuring successful outcomes and efficient execution of incident response and initiatives
Strong written and verbal communication skills in English
Ability to work flexible shifts and participate in a 24x7 on-call rotation
Experience building log-based alert rules (e.g., ElastAlert or equivalent) and investigating issues using centralized logging platforms (e.g., ELK/Kibana or equivalent)
Comfort with Kubernetes and Docker container-based environments, including pod-level health triage
Comfort working in a command-line environment (Linux/bash, Windows CMD/PowerShell, or equivalent) the team regularly uses CLI tools for infrastructure triage, pod inspection, and operational scripts

Benefits

Competitive salary and benefits and equity participation
A dynamic, flexible and fun start-up work environment with a highly talented team

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

incident managementalert managementmonitoring toolstroubleshootinginfrastructure operationslog-based alert rulesKubernetesDockercommand-line environmentcloud infrastructure

Soft Skills

problem-solvingorganizational skillscommunication skillsself-motivatedmultitaskingadaptabilityteam collaborationfast-paced environmentattention to detailwillingness to learn