OpenAI

Datacenter Hardware Operations Technician, AI Compute Infrastructure

OpenAI

full-time

Posted on:

Origin:  • 🇺🇸 United States

Visit company website
AI Apply
Apply

Salary

💰 $144,000 - $228,000 per year

Job Level

SeniorLead

Tech Stack

LinuxOraclePrometheusUnix

About the role

  • Coordinate physical hardware activities and serve as OpenAI’s primary on-site hardware contact at a partner-operated campus
  • Collaborate with Oracle teams, vendors, and delivery teams to plan and coordinate maintenance, repairs, and lifecycle activities
  • Share technical requirements and verify that partner work supports OpenAI’s compute needs and quality targets
  • Coordinate schedules, spare-parts planning, and issue escalation to minimize downtime
  • Work with OpenAI fleet-health engineers to translate software-detected issues into on-site hardware actions
  • Track hardware trends and provide joint recommendations for design or operational improvements
  • Prepare documentation and runbooks that capture joint best practices for additional campuses
  • Offer technical guidance and context to partner personnel while respecting their operational ownership
  • Collaborate with supply-chain teams to plan spares and manage hardware lifecycle activities
  • Capture lessons learned and help develop standards and playbooks for future OpenAI infrastructure projects

Requirements

  • 7+ years of experience in datacenter hardware operations, hardware engineering, or large-scale server maintenance
  • At least 2 years in a senior or lead technician capacity
  • Deep knowledge of high-density server hardware, including x86 platforms, GPUs, storage devices, and power/cooling systems
  • Ability to diagnose hardware issues and coordinate complex repairs
  • Experience maintaining strong working relationships across organizations and vendors
  • Comfortable setting technical expectations and validating outcomes through collaboration (not direct management)
  • Adaptability to changing operational conditions and strategic and on-site problem solving
  • Clear communication skills and ability to build trust across partner teams, vendors, and internal stakeholders
  • Willingness and ability to be based full-time at a partner-operated campus
  • Candidates must be able to sit onsite in Abilene, Texas 5 days per week
  • Familiarity with large-scale cluster management or monitoring tools (IPMI, BMC, Prometheus, Nagios)
  • Experience with GPU-accelerated compute clusters or other high-performance computing hardware
  • Knowledge of Linux/Unix system administration and command-line diagnostic tools for hardware validation
  • Industry certifications such as CompTIA Server+, OEM hardware certifications, or equivalent
  • Experience applying Environmental Health and Safety best practices in mission-critical environments
DDN

Senior Staff Engineer – AI In-Market Engineering

DDN
Seniorfull-time🇺🇸 United States
Posted: 17 days agoSource: careers-ddn.icims.com
CloudDistributed SystemsGoGrafanaKubernetesLinuxNFSPrometheusPythonTCP/IP
NVIDIA

Senior Platform Telemetry Engineer

NVIDIA
Seniorfull-time$148k–$288k / yearCalifornia · 🇺🇸 United States
Posted: 2 days agoSource: nvidia.wd5.myworkdayjobs.com
GrafanaPrometheusPython
Pythian

Senior Zabbix Administrator

Pythian
Seniorfull-time🇮🇳 India
Posted: 25 days agoSource: jobs.lever.co
AnsibleAWSCloudGrafanaLinuxMySQLOraclePostgresPuppetPythonVMware
DDN

Senior Staff Engineer – AI In-Market Engineering

DDN
Seniorfull-time🇺🇸 United States
Posted: 17 days agoSource: referrals-ddn.icims.com
CloudDistributed SystemsGoGrafanaKubernetesLinuxNFSPrometheusPythonTCP/IP
NVIDIA

Senior Platform Telemetry Engineer

NVIDIA
Seniorfull-time$148k–$288k / yearCalifornia · 🇺🇸 United States
Posted: 22 days agoSource: nvidia.wd5.myworkdayjobs.com
GrafanaPrometheusPython