Senior Infrastructure System Software Engineer

NVIDIA
Apply Now

Job Description

We are seeking a Senior Infrastructure System Software Engineer with profound expertise in High-Performance Computing (HPC) and AI workload management, as well as Kubernetes-based infrastructure, to join our Omniverse Infrastructure team. The ideal candidate will have a strong understanding of system software design principles and extensive experience in deploying, managing, optimizing, and scaling sophisticated AI and cloud environments using workload management software systems such as SLURM, Flux, PBS Pro, and Kubernetes. Proficiency in integrating HPC/AI workload managers with Cloud resource provisioning APIs (e.g., AWS EC2 with topology-aware configurations) is desirable.

As a key member of the NVIDIA Omniverse™ Cloud team, you will be tasked with designing and developing advanced system software solutions within large AI clusters to efficiently manage and schedule resources for converged HPC/AI and cloud-native workloads. This role requires close collaboration with multi-functional teams to ensure our infrastructure meets the stringent demands of advanced AI workloads, including extreme scalability, elasticity, multi-tenancy, high availability, and the optimization of large-scale applications and workflows.

What you will be doing:

  • Architect and implement system software within a converged environment that incorporates both HPC/AI workload managers like SLURM with services running on Kubernetes, enhancing resource/process management, scheduling, and resilience of large AI workloads.
  • Develop long-running system service solutions to accelerate the training of extensive AI models.
  • Work closely with Omniverse infrastructure teams and customers to fully understand and meet their compute and storage needs, ensuring seamless integration with AI/HPC workload managers and cloud APIs.
  • Tackle key system software challenges in compute, networking, and storage to enhance the overall performance, efficiency, and resilience of large language model training and other computational tasks.

What we need to see:

  • 8+ years of experience in system software engineering with a focus on developing and enhancing AI/HPC workload management software such as SLURM and Flux framework.
  • BS Degree or equivalent experience
  • Demonstrated ability in developing fault-tolerant, distributed services at scale.
  • Strong familiarity with HPC workload managers such as SLURM, Flux, PBS Pro, and their integration with Cloud APIs to create large AI/HPC cluster instances within major CSPs, for example, by using AWS EC2 provisioning, reservation and topology-aware configurations APIs.
  • Proficiency in Python, C/C++, with a solid background in systems programming, including event-based programming, multi-threading, concurrency, and parallelism.
  • A deep understanding of cloud technologies including Clouds’ managed services, distributed computing systems, and microservices architecture.
  • An advanced degree in Computer Science or a related field, or equivalent professional experience.
  • Excellent collaborative skills and the ability to work effectively across multi-functional teams and geographies.

Ways to stand out from the crowd:

  • In-depth, practical knowledge of significantly modifying HPC workload managers such as SLURM and Flux.
  • Demonstrated experience in maximizing advanced cloud services for AI optimization that use NVIDIA high-end GPUs.
  • Experience with low level system provisioning tools such as Forward-thinking Cluster Manager (BCM).

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most hard-working and dedicated people in the world working for us. If you're creative and passionate about developing cloud services we want to hear from you!

The base salary range is 176,000 USD - 333,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Company Info.

NVIDIA

NVIDIA’s invention of the GPU sparked the PC gaming market. The company’s pioneering work in accelerated computing—a supercharged form of computing at the intersection of computer graphics, high performance computing and AI—is reshaping trillion-dollar industries, such as transportation, healthcare and manufacturing, and fueling the growth of many others.

  • Industry
    Cloud computing,Video games,Computer software,Semiconductors,Computer hardware,Consumer electronics,Artificial intelligence
  • No. of Employees
    22,473
  • Location
    2701 San Tomas Expressway, Santa Clara, CA 95050, USA
  • Website
  • Jobs Posted

Get Similar Jobs In Your Inbox

NVIDIA is currently hiring Senior Infrastructure Engineer Jobs in Santa Clara, CA, USA with average base salary of $176,000 - $333,500 / Year.

Similar Jobs View More