High-Performance Computing Engineer

OpenAI
Apply Now

Job Description

Help OpenAI build some of the biggest & fastest AI supercomputing clusters in the world! As a High-Performance Computing engineer, you’ll work at the intersection of hardware and software, designing systems that deliver the maximum possible performance for running large-scale AI models. We work at the very cutting edge of speed and scale, combining the traditions of High-Performance Computing (HPC) in a modern cloud and containerized environment.

For this role, it’s important you understand how to combine CPUs, GPUs, and network devices into systems that are then deployed at a large scale to peak efficiency. You understand the lowest levels of the software platforms that sit on top of this hardware, including how to best optimize the Linux kernel and user-space code. You are capable of writing code to automate the health checks and healing of these systems, commanding a large number of servers with few people.

We recently launched our newest cluster, “Owl” with over 250K cores, 10K GPUs, and 400Gbps of networking per node. This would be in the top 5 of the TOP500 supercomputers in the world. See this blog post to get a sense of what kind of challenges we solve in our day-to-day work.

In this role, you will work closely with and directly accelerate machine learning researchers, but don't need to be a machine learning expert yourself. We value people who can quickly obtain a deep technical understanding of new domains and enjoy being self-directed and identifying the most important problems to solve. We believe that increasing compute is a huge lever to AI progress. You will have a direct impact on our ability to grow to an unprecedented scale and likewise produce unprecedented results.

We’re looking for a blend of:

  • Expertise with high-performance networking, ideally with MPI, NCCL, RDMA, and/or Infiniband
  • Experience with GPUs in large scale networks strongly preferred
  • Deep understanding of TCP/IP and the Linux networking stack
  • Experience developing high-quality software in a general-purpose programming language (Python, C, C++, Go, etc)
  • Experience with virtualization and container architecture in cloud environments
  • Tenacious at troubleshooting hardware and network topology failures in distributed systems
  • Independently driven and able to own problems and build solutions from end-to-end
  • Experience with large scale data center operations, particularly thermal and power constraints
  • Familiar with Kubernetes, Terraform, and/or Chef as orchestration and system tools
  • Experience with Azure

About OpenAI

We’re building safe Artificial General Intelligence (AGI), and ensuring it leads to a good outcome for humans. We believe that unreasonably great results are best delivered by a highly creative group working in concert. We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

This position is subject to a background check for any convictions directly related to its duties and responsibilities. Only job-related convictions will be considered and will not automatically disqualify the candidate. Pursuant to the San Francisco Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.

We will ensure that individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Please contact us to request accommodations via accommodation@openai.com.

Benefits

- Health, dental, and vision insurance for you and your family

- Unlimited time off (we encourage 4+ weeks per year)

- Parental leave

- Flexible work hours

- Lunch and dinner each day

- 401(k) plan with matching

Company Info.

OpenAI

OpenAI is an American artificial intelligence (AI) research laboratory. OpenAI conducts AI research with the declared intention of promoting and developing a friendly AI. OpenAI systems run on the fifth most powerful supercomputer in the world. Microsoft announced that it is building AI technology based on the same foundation as ChatGPT into Microsoft Bing, Edge, Microsoft 365 and other products.

  • Industry
    Artificial intelligence,Computer software
  • No. of Employees
    380
  • Location
    San Francisco, CA, USA
  • Website
  • Jobs Posted

Get Similar Jobs In Your Inbox

OpenAI is currently hiring High Performance Computing Engineer Jobs in San Francisco, CA, USA with average base salary of $120,000 - $190,000 / Year.

Similar Jobs View More