Tech Stack
AnsibleAWSAzureCloudFirewallsGrafanaLinuxPythonSQLSwitchingTerraformVMware
About the role
- Monitor cloud and on-prem infrastructure for errors or problems and resolve them in a timely manner.
- Work with Development and Product to design and implement strategies to increase performance, reliability, and scalability of the infrastructure.
- Identify single points of failure in platform design and make cost-effective recommendations for remediation.
- Stay up to date with the latest technologies and advancements in cloud computing and infrastructure operations to improve resiliency and security.
- Develop, document, and enforce policies, standards, and procedures for cloud and infrastructure maintenance, change management, and security.
- Participate in organizational Change Management activities, including risk assessments, change approval reviews, and post-change validation.
- Ensure timely patching and lifecycle management across hardware, operating systems, virtualization platforms, and cloud resources, following security and compliance requirements.
- Collaborate with Security and Compliance teams to remediate vulnerabilities outside normal patch cycles, including emergency fixes, configuration changes, and compensating controls.
- Manage cloud-based data backup and disaster recovery procedures to ensure business continuity.
- Maintain and support physical server infrastructure (Dell PowerEdge and related hardware), ensuring firmware, drivers, and hardware components are kept current.
- Document and maintain infrastructure architecture diagrams, configuration details, SOPs, and runbooks and ensure knowledge is captured in Atlassian (Jira/Confluence) and ITGlue.
- Manage vendor relationships and contracts related to infrastructure and cloud services, ensuring SLA adherence and effective escalation.
- Optimize telephony carrier network for cost effectiveness, capacity, flexibility, and resiliency.
- Create and manage budgets for cloud and infrastructure, recommending cost-savings where appropriate.
- Contribute to monitoring, alerting, and observability practices using tools such as Grafana, Zabbix, Site24x7, or similar to reduce MTTD/MTTR.
- Provide point of contact, technical support, and guidance to other employees on infrastructure-related issues.
- Actively participate in incident response, root cause analysis, and post-incident reviews; ensure lessons learned feed into continuous improvement.
- Contribute to capacity planning and forecasting to anticipate future growth and resource needs.
- Mentor and delegate work to IT staff; provide training on infrastructure tools, processes, and best practices.
- Maintain overall accountability for the performance, availability, and security of the cloud and infrastructure platform.
Requirements
- Need to be local – NH, MA.
- 24/7 on-call availability.
- Experience with Active Directory and Windows/Linux system administration.
- Strong knowledge of VMware technologies (ESXi, vCenter, vSphere) for virtualization and datacenter management.
- Experience with Nutanix Enterprise Cloud (AHV, Prism, cluster management) for virtualization and hyperconverged infrastructure.
- Hands-on experience with Dell PowerEdge server hardware — including installation, firmware updates, lifecycle management, and integration with Nutanix/VMware environments.
- Advanced understanding of Microsoft SQL.
- Automation technologies experience (Terraform, Ansible, or similar).
- Hands-on scripting experience (Python, PowerShell, or Bash) to support automation and integration.
- Network engineering experience (routing, switching, firewalls, VPN/IPSec).
- Knowledge of cloud technologies (AWS, Azure) and hybrid integration.
- Telephony technologies experience (SIP, DID).
- Experience with patch management tools and processes (WSUS, SCCM, Ansible, or equivalent) for both Windows and Linux environments.
- Familiarity with vulnerability management tools and ability to work with security teams on remediation.
- Experience with Atlassian tools (Jira, Confluence) and ITGlue or equivalent knowledge/documentation platforms.
- Familiarity with monitoring and observability tools (Grafana, Zabbix, Site24x7, or similar).
- Understanding of ITIL practices in Change, Incident, and Problem Management.
- Exposure to compliance frameworks (ISO 27001, SOC2, NIST) a plus.
- Excellent documentation and communication skills; ability to explain technical issues in business terms.
- Ability to prioritize effectively and recognize opportunities to delegate.
- Collaborative mindset in working with product, engineering, IT, and security teams.
- Hands-on engineer who can also bring a strategic mindset.