
Lead Systems Engineer – HPC, AI
CARBON3
full-time
Posted on:
Location Type: Hybrid
Location: 🇬🇧 United Kingdom
Visit company websiteJob Level
Senior
Tech Stack
AnsibleCloudDistributed SystemsGrafanaKubernetesLinuxSplunk
About the role
- Coordinate resolution of complex issues (L3) to (vendor) product/engineering teams and manage vendor responses
- Monitor system health, alerts, and customer usage patterns
- Document solutions/workarounds, create and maintain knowledge, document support procedures
- Automate common tasks and fixes
- Configure and integrate tooling to support optimal operation of the platform, and support tool selection
- Assist customers with platform configuration, onboarding, and usage best practices
- Collaborate with platform and infrastructure support/engineering teams to resolve platform integration issues
- Ensure SLAs and customer satisfaction targets are met
- L1 support for customer-reported issues and requests
- L2 support by diagnosing, replicating, and troubleshooting issues across platform and infrastructure
- Work with customers and multiple stakeholders to understand requirements and challenges, provide reporting on usage, workflow and billing
Requirements
- Extensive experience in technical support, system engineering, or platform operations
- Solid understanding of L1 and L2 support processes (ticketing, escalation, troubleshooting)
- Familiarity with cloud-based platforms, APIs, and distributed systems
- Understanding of AI/ML concepts and tooling (model training, inference, data pipelines basics)
- Experience with monitoring/logging tools (e.g., Grafana, Kibana, Splunk)
- Excellent communication skills to interface with both customers and internal / vendor teams
- Good understanding of tools requirements for ML engineers and data scientists, and how to optimize the experience
- System administration experience with OS's like RHEL/CentOS, Ubuntu, tuning Linux kernel
- Proficiency with Ansible, Nvidia and CUDA toolkits, Kubernetes and container orchestration
- Understanding of automation, monitoring and security with GPU as a service.
Benefits
- Health insurance
- Professional development opportunities
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
technical supportsystem engineeringplatform operationsL1 supportL2 supporttroubleshootingautomationsystem administrationAnsibleKubernetes
Soft skills
communicationcollaborationcustomer satisfactionproblem-solvingdocumentation