Salary
💰 $144,000 - $228,000 per year
Tech Stack
LinuxOraclePrometheusUnix
About the role
- Coordinate physical hardware activities and serve as OpenAI’s primary on-site hardware contact at a partner-operated campus
- Collaborate with Oracle teams, vendors, and delivery teams to plan and coordinate maintenance, repairs, and lifecycle activities
- Share technical requirements and verify that partner work supports OpenAI’s compute needs and quality targets
- Coordinate schedules, spare-parts planning, and issue escalation to minimize downtime
- Work with OpenAI fleet-health engineers to translate software-detected issues into on-site hardware actions
- Track hardware trends and provide joint recommendations for design or operational improvements
- Prepare documentation and runbooks that capture joint best practices for additional campuses
- Offer technical guidance and context to partner personnel while respecting their operational ownership
- Collaborate with supply-chain teams to plan spares and manage hardware lifecycle activities
- Capture lessons learned and help develop standards and playbooks for future OpenAI infrastructure projects
Requirements
- 7+ years of experience in datacenter hardware operations, hardware engineering, or large-scale server maintenance
- At least 2 years in a senior or lead technician capacity
- Deep knowledge of high-density server hardware, including x86 platforms, GPUs, storage devices, and power/cooling systems
- Ability to diagnose hardware issues and coordinate complex repairs
- Experience maintaining strong working relationships across organizations and vendors
- Comfortable setting technical expectations and validating outcomes through collaboration (not direct management)
- Adaptability to changing operational conditions and strategic and on-site problem solving
- Clear communication skills and ability to build trust across partner teams, vendors, and internal stakeholders
- Willingness and ability to be based full-time at a partner-operated campus
- Candidates must be able to sit onsite in Abilene, Texas 5 days per week
- Familiarity with large-scale cluster management or monitoring tools (IPMI, BMC, Prometheus, Nagios)
- Experience with GPU-accelerated compute clusters or other high-performance computing hardware
- Knowledge of Linux/Unix system administration and command-line diagnostic tools for hardware validation
- Industry certifications such as CompTIA Server+, OEM hardware certifications, or equivalent
- Experience applying Environmental Health and Safety best practices in mission-critical environments