AI/HPC System Engineer

Meta Platforms, Inc.
Apply Now

Job Description

The AI/HPC System Engineering team is looking for a System Hardware Design Engineer to design, implement, grow, and maintain the proof of concept development, test, and emulation platforms for Meta's AI/HPC systems comprised of Scalar Compute, Accelerated Compute, Network Technologies and Software. This team is working on the design, implementation, and support of one of the world’s largest AI/HPC deployments. In this role, you will have a unique opportunity to shape the future AI/HPC of Meta by specifying technical requirements, driving specifications and implementations, and steering the industry and ecosystem.

The ideal candidate will be able to operate in a highly multi-tasked, fast-paced and very cross-functional engineering environment. They will have experience with hardware development, software development and integration of HPC and machine learning clusters, both server and fabric. This role will be working across many projects in our AI/HPC System team to shape our system design and development requirements across System and Hardware projects. A successful candidate will be a HW System and platform builder, equally comfortable working on rack scale solutions including server and mechanical design involving thermals (Air-Cooling, Air Assist and Direct Liquid Cooling), boards, sensors, FPGA RTL design/verification, performance and power test and evaluation, OS/RTOS kernel and driver software and architectures. The position requires a senior lead developer and architect, able to debug and extract solutions from vague descriptions of system architecture and workloads, and define such system architecture through deriving requirements from internal customers and industry strategies and partners, who also builds cross functional relationships across teams to find ideas, assets and assistance to explore faster.

The AI Systems and Accelerated Platforms team designs, builds, brings-up, tests and integrates hardware systems that power Meta’s platforms, deployed in data centers worldwide, from silicon to sheet metal. Designs are published through the Open Compute Project Foundation.

This is a rare opportunity to join our team and help us build some of the world’s most open and efficient machine learning platforms.

AI/HPC System Engineer Responsibilities

  • Work as part of the AI Systems and Accelerated Platforms team to design, develop, test and integrate Meta AI hardware platforms.
  • Collect requirements and develop specification for Rackscale AI/HPC systems, develop and maintain code, tooling, practices, and infrastructure to collect, analyze, and interpret data for emulation and validation platforms and derive feature and performance requirements for our hardware systems and ASICs
  • Collaborate with multidisciplinary Hardware and Software Engineering and Capacity Management teams to develop detailed hardware and operating specifications for our platforms, including in-depth performance, operational, and reliability analysis.
  • Work cross-functionally with stakeholders and partner teams to develop and optimize full stack solutions, from hardware up to software
  • Design, bring up, and integrate the HW prototypes with firmware, and driver and deliver a working end-to-end system that is ready for proof of concept performance validation and tests
  • Work with Release to Production, Data Center Operations, and Production Engineering teams to understand installation, operation and maintenance considerations within Meta data centers and incorporate feedback into current and future hardware designs.
  • Work closely with software and hardware subsystem subject matter experts to bring disparate technologies together to produce highly efficient and powerful systems.
  • Change the landscape of datacenter computing through development and collaboration with open source hardware and software communities.
  • Specify and Design the clustered AI/HPC so that they integrated well with the data centers and the fleet. Develop best practices and guidelines for such.

Minimum Qualifications

  • Master’s/PhD degree in Computer Engineering, Electrical Engineering or similar Engineering field or BS and 5+ years Industry experience.
  • Experience with implementing accelerated compute server systems for AI / HPC
  • English language communications skills, experience tailoring communication style and depth for audiences with varied levels of subject matter expertise.
  • Experience with discovering problem statements in large scale and complex AI/HPC systems and coming up with solutions and prototype, model & emulate the desired end-systems (e.g. via using existing technology or new FPGA).
  • Familiarity with hardware design and system level troubleshooting, as part of new
  • hardware development processes (e.g. board bring-up, system and cluster validation).
  • System level design, bring-up, software integration, and debug experience.
  • Troubleshooting skills and the experience diving into software, firmware, hardware, and network problems. Debug wherever the problem leads.
  • Basic proficiency in C++ and/or Python.
  • Experience with general understanding of architectural trade-offs including board design, high speed signal integrity, bus architecture, rack design and topology considerations.
  • Experience with management under ambiguity in a fast changing field.
  • Experience working effectively as an individual and in a multidisciplinary team.
  • Must obtain work authorization in the country of employment at the time of hire and maintain ongoing work authorization during employment.

Preferred Qualifications

  • Solid hands-on experience in designing, building and deploying large-scale AI/HPC systems consisting of 1000s of processors, air cooled and liquid cooled, supporting 100s of Terabytes of Storage
  • Network/fabric bus design & operation experience: Ethernet, Infiniband, RoCE networks, popular HPC fabrics.
  • Core domain knowledge in servers and networking and at least one (1) additional domain: high performance computing, storage, or silicon (ASIC or FPGA) development.
  • Experienced in complex, multi-subsystem and system-level troubleshooting.
  • Experience in power test and evaluation in prototyping platforms.
  • Experienced with system performance analysis, debug, and optimization practices.
  • Experience leading small technical teams and manage larger programs
  • Experience with lab system debug with logic analyzers, scopes, meters, etc.
  • Familiarity with Cluster management tools and IaaS architecture
  • Basic knowledge about chip architecture, µarchitecture, design and interconnects
  • Familiarity with GPU/accelerator low level programming and performance validation.
  • Knowledge about chip architecture, µarchitecture, design and interconnects
  • Knowledge of system and memory buses (including e.g. UPI, XGMI, NVLink).
  • Familiarity with FPGA hardware tuning (SerDes, voltage, etc.).
  • Knowledge of typical system IO and management buses (PCIe, I2C and/or LPC).
  • Knowledge of Coherent accelerator bus experience: (including IAL, CCIX, CAPI, NVLink, CXL).
  • Scripting and configuration automation in one or more environments: Chef, Ansible, (or similar environments), or BASH.
  • Firmware development and debugging (especially BMC and system firmware).
  • RTL language experience: System Verilog, Verilog.
  • Understanding of storage protocols such as e.g.: NVMe, NVMeoF, RDMA with storage, SATA, SAS, etc.
  • Detail oriented with careful and balanced rapid execution in a fast-paced environment.
  • Familarity implementing, debugging, and performance-optimizing systems leveraging popular machine learning tool chains (including communication libraries).
  • Experience with Liquid Cooling

Company Info.

Meta Platforms, Inc.

Meta Platforms (formerly known as Facebook Inc.) is a large technology company that was founded in 2004 by Mark Zuckerberg and several of his college roommates. The company is based in Menlo Park, California, and is primarily known for its flagship social media platform, Facebook. In addition to its consumer-facing products, Meta Platforms also offers a range of advertising and marketing services to businesses and organizations.

  • Industry
    Advertising,Consumer electronics,Social media Company,Artificial intelligence
  • No. of Employees
    76,000
  • Location
    1 Hacker Way, Menlo Park, CA 94025, USA
  • Website
  • Jobs Posted

Get Similar Jobs In Your Inbox

Meta Platforms, Inc. is currently hiring High Performance Computing Engineer Jobs in Oslo, Norway with average base salary of kr510,500 - kr730,800 / Year.

Similar Jobs View More