Reliability, Availability and Serviceability Expert, Datacenter AI Products Development

NVIDIA
Apply Now

Job Description

What you’ll be doing:

  • The focal point SME for manufacturing test requirements, test methodology, test plan and test flow for AI system RAS/Resilience features to ensure good test coverage and successful production ramp-ups.
  • Own the AI system RAS/Resilience models, Benchmarking and Risk assessment.
  • Own the troubleshooting and root-causing of AI system RAS/Resilience related failures at factory and in the field.
  • Drive the end-to-end RAS efforts of chip-board-system to reduce FIT rates.
  • Lead the data analysis of RAS/Resilience logs to refine, revise and overhaul test methodology and manufacturing flows; influence and drive software tools/infrastructure required for new product development, validation, and productization.
  • Opportunity to work closely and partner with architecture, hardware, software, and product engineering teams through the product development lifecycle.
  • Be ready to be challenged to assess new hardware features and architect manufacturing RAS tests, flows, methodologies.
  • You'll nurture a deep understanding of NVIDIA's AI hardware and software architecture.

What we need to see:

  • BS or higher in EE, CE, CS, Mathematics, or equivalent experience.
  • 12+ years proven hands-on experiences in design, testing, benchmarking, and risk assessment of system RAS / Resiliency features of large Compute or AI or HPC systems.
  • Proficient in Compute System RAS/Resilience model theory and methodology.
  • Proficient in HPC or AI system architecture and Cluster Interconnect technologies.
  • Proficient in using test equipment, Linux commands and benchmark utilities to test and trouble-shoot compute system RAS & Resiliency features.
  • Strong problem-solving and trouble-shooting expertise; and institutionalizing root-cause analysis.
  • Self-initiative, strong interpersonal skills, and flexibility to adapt to new technologies.
  • Solid Knowledge and/or Experience in HPC or MLPerf benchmarking is a plus.

NVIDIA is widely considered to be one of the technology world’s most desirable employers! We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you!

The base salary range is 180,000 USD - 339,250 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

Company Info.

NVIDIA

NVIDIA’s invention of the GPU sparked the PC gaming market. The company’s pioneering work in accelerated computing—a supercharged form of computing at the intersection of computer graphics, high performance computing and AI—is reshaping trillion-dollar industries, such as transportation, healthcare and manufacturing, and fueling the growth of many others.

  • Industry
    Cloud computing,Video games,Computer software,Semiconductors,Computer hardware,Consumer electronics,Artificial intelligence
  • No. of Employees
    22,473
  • Location
    2701 San Tomas Expressway, Santa Clara, CA 95050, USA
  • Website
  • Jobs Posted

Get Similar Jobs In Your Inbox

NVIDIA is currently hiring AI Product Engineer Jobs in Austin, TX, USA with average base salary of $180,000 - $339,250 / Year.

Similar Jobs View More