Tech Stack
AnsibleAWSAzureCassandraChefCloudCyber SecurityDNSFirewallsGoogle Cloud PlatformGrafanaLinuxMongoDBMySQLNoSQLOraclePrometheusPuppetPythonRedisServiceNowSplunkSQLTableauTCP/IPTerraformVMware
About the role
- Handle escalated incidents from OCCE1 with advanced troubleshooting and problem resolution across network, system, and cloud platforms
- Proactively monitor system health, performance, and uptime using monitoring and observability tools
- Identify recurring incidents and perform root cause analysis for long-term resolution
- Collaborate with Applications, Infrastructure, Security, and Cloud teams to resolve incidents
- Configure, troubleshoot, and maintain network devices (routers, switches, firewalls) and secure remote access (VPN, RDP)
- Manage and maintain cloud infrastructure (AWS, Azure, GCP), virtualization (VMware, Hyper-V) and automation (Terraform, Ansible)
- Develop and refine runbooks, playbooks, and response procedures; improve cloud governance and security
- Participate in on-call rotations and prepare post-incident reports, root cause analysis, and lessons learned
- Ensure SLAs for response times, escalation, and ticket handling are met and coordinate shift handovers
- Lead system administration efforts (Windows, Linux, Mac OS), backup and disaster recovery, and server management
- Contribute to monitoring tool improvements, capacity planning, risk management, and project work
Requirements
- Minimum of High School diploma or equivalent required (Bachelor's preferred)
- Minimum of 3 years of experience in IT operations, HelpDesk, or similar roles (36 months)
- Minimum of 1 year of experience with VPN, remote access technologies, and network monitoring
- Experience with Windows, Linux, and/or Mac OS administration
- Network configuration and troubleshooting, DNS, DHCP, TCP/IP
- Experience with cloud platforms (AWS, Azure, GCP) and cloud networking
- Virtualization experience (VMware, Hyper-V, KVM)
- Experience with automation tools (Terraform, Ansible) and scripting (Python, Bash, PowerShell)
- Monitoring and observability tools (Prometheus, Grafana, Nagios, Zabbix, SolarWinds)
- Incident and change management, on-call experience
- Familiarity with security tools and practices (firewalls, IDS/IPS, SIEMs, vulnerability management)
- Familiarity with backup, disaster recovery, and server management
- Preferred: certifications such as CompTIA Network+, CCNA, Azure Administrator, AWS Solutions Architect, RHCSA, CEH, CySA+, CISA, GSEC