Support production applications and participate in on-call rotation for incident response
Proactively automate discoveries and reduce recurring incidents and recovery time
Improve application availability, latency, performance, efficiency, and proactive monitoring
Interface with business users, development teams and system administrators to meet business needs
Develop, coordinate, and conduct technical reliability studies on engineering designs
Measure and analyze reliability of designs, materials, processes, cost, and final products
Recommend design or test methods and statistical process control procedures to achieve required reliability
Complete risk analysis studies of new designs and processes
Undertake testing and analysis on failures and propose changes to improve system/process reliability
Requirements
Bachelor's degree, or equivalent work experience
Five to seven years of relevant work experience in business and risk analysis, IT Service Management, production support, product/project management, or application development
Proven experience as a Site Reliability Engineer
Strong knowledge of monitoring tools and incident management
Proficiency with database technologies (DB2, Oracle, Postgres, SQL scripting)
Strong Linux skills (command line, scripting, cron)
System administration skills (restarting JVMs, F5 Pool management, autosys, etc)
Experience with observability, monitoring and logging tools such as Data Dog, Splunk, AppDynamics, Kibana
Experience with AWS or Azure services
Experience with Docker and container clustering technologies like AWS ECS or Kubernetes
Experience using GitLab/GitHub for version control
Strong communication and collaboration abilities
Must be open to doing production support, on call rotation and occasional after-hours work
Benefits
Healthcare (medical, dental, vision)
Basic term and optional term life insurance
Short-term and long-term disability
Pregnancy disability and parental leave
401(k) and employer-funded retirement plan
Paid vacation (from two to five weeks depending on salary grade and tenure)
Up to 11 paid holiday opportunities
Adoption assistance
Sick and Safe Leave accruals of one hour for every 30 worked, up to 80 hours per calendar year unless otherwise provided by law
Hybrid/flexible schedule (in-office expectation of 3+ days per week)
Incentive and recognition programs, equity stock purchase, 401(k) contribution and pension
Disability accommodations during application/hiring process
ATS Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
Site Reliability Engineermonitoring toolsincident managementDB2OraclePostgresSQL scriptingLinuxDockerKubernetes