Lead or participate in managing all installed systems and infrastructure within the systems Operations functional area
Contribute to increasing system efficiencies and lowering the human intervention time on related tasks
Review and analyze moderately complex operational support systems, application software, and system management tools to ensure the highest levels of systems and infrastructure availability
Work with vendors and other technical personnel for problem resolution
Lead team to meet technical deliverables while leveraging solid understanding of technical process controls or standards
Collaborate with vendors and other technical personnel to resolve technical issues and achieve highest levels of systems and infrastructure availability
Incident Management: Triage and resolve production incidents, engage partner teams, and communicate status updates to business users
Problem Management: Manage support tickets and perform root cause analysis to drive long-term solutions
Monitoring & Alerting: Implement and customize alerting tools based on application thresholds; enable business transaction monitoring
BCP Support: Coordinate and document efforts to ensure application resiliency; participate in scheduled BCP test events
Capacity Management: Support capacity planning and provide application metrics to planning teams
Audit & Compliance: Participate in audit activities and provide production environment data to auditors
Automation: Develop scripts and dashboards to automate routine platform tasks
On-call Support: Provide deployment support and carry pager for after-hours incident response
Requirements
4+ years of Systems Engineering, Technology Architecture experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
4+ years of experience in IT Service Management (ITSM), with a strong background in incident, problem, and change management processes
3+ years of Proficiency in leveraging observability platforms such as BigPanda, ThousandEyes, Grafana, Prometheus, Splunk Observability, and AppDynamics
3+ years of experience working with Red Hat Enterprise Linux and Kubernetes, with a strong focus on Red Hat OpenShift Container Platform (OCP)
Strong skills in deploying, managing, and troubleshooting containerized applications in hybrid cloud environments, ensuring high availability and scalability
Experience in project management and stakeholder engagement (desired)
Excellent problem-solving skills (desired)
Strong decision-making abilities (desired)
Excellent communication and collaboration skills (desired)
Need to be available for on-call support and flexible to work ad-hoc shifts
This position is not eligible for visa sponsorship
Benefits
Health benefits
401(k) Plan
Paid time off
Disability benefits
Life insurance, critical illness insurance, and accident insurance
Parental leave
Critical caregiving leave
Discounts and savings
Commuter benefits
Tuition reimbursement
Scholarships for dependent children
Adoption reimbursement
Hybrid work schedule
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
Systems EngineeringTechnology ArchitectureIT Service ManagementIncident ManagementProblem ManagementChange ManagementAutomationContainerized ApplicationsRed Hat Enterprise LinuxKubernetes