Senior Machine Learning Engineer

NVIDIA
Apply Now

Job Description

NVIDIA is looking for a Machine Learning engineer joining E2E Verification group to profile Innovative large scale Distributed training on NVIDIA AI End-to-End solutions in a large-scale supercomputing clusters. Provide insights on at-scale system design and tuning mechanisms for large-scale compute runs. You will work with the latest Accelerated Computing and Deep Learning software and hardware platforms, with researchers, developers, and customers to craft improved workflows and develop new, leading differentiated solutions. You will interact with HPC, OS, Switch, HCA, CPU and GPU compute, and systems specialist to architect, develop and bring up large scale performance platforms.

What you’ll be doing:

  • Implementing Machine Learning algorithms and models specifically tailored for large-scale deep learning training on NVIDIA supercomputers with a focus on High-performance networking.
  • Profiling, Benchmarking, and Analyzing Deep Learning models to identify areas for optimization and improvement in terms of performance, efficiency, and accuracy, with a strong emphasis on networking aspects.
  • Collaborating closely with Architect, Research, and Development teams to design and implement scalable training pipelines and frameworks that leverage high -performance networking capabilities.
  • Staying up-to-date with the latest advancements in deep learning algorithms, architectures, NVIDIA GPU technologies, and high-performance networking solutions.
  • Optimizing high-performance networking solutions on NVIDIA supercomputers for best Deep-learning models performance.
  • Providing insights and recommendations based on the analysis of large-scale training results, specifically focusing on networking bottlenecks and optimizations, to improve model outcomes and achieve business objectives.
  • Collaborating with Software and Hardware engineers to guide the development and integration of efficient networking solutions for deep learning, including exploring network architecture optimizations and leveraging technologies such as RoCE or InfiniBand.

What we need to see:

  • Sc in Computer Science, Software Engineering, or 12+ years equivalent experience.
  • Strong understanding and practical experience with Machine Learning algorithms and techniques, with a specialization in deep learning and expertise in high-performance networking.
  • Proficiency in programming with NVIDIA GPUs and experience with CUDA programming for deep learning frameworks like TensorFlow, PyTorch, combined with expertise in networking libraries (such as NCCL) and protocols (such as RoCE and RDMA).
  • Ability to profile and optimize deep learning workflows, focusing on networking-related bottlenecks and optimizations, to improve overall performance and efficiency.
  • Exceptional analytical and problem-solving skill, with a keen attention to detail, particularly in identifying and resolving networking performance issues.
  • Excellent communication and collaboration skills, enabling effective teamwork and cooperation.
  • Familiarity with supercomputers, parallel computing, distributed systems, and high- performance networking technologies like RDMA or InfiniBand.

Ways to stand out from the crowd:

  • Demonstrated experience in successfully profiling and optimizing large-scale Deep Learning training on NVIDIA supercomputers, with a significant focus on high-performance networking enhancements.
  • Experience with distributed Deep Learning, distributed training frameworks, or large-scale data pipelines enhanced by high-performance networking solutions.
  • Expertise in optimizing networking parameters, such as bandwidth, latency, or congestion control, for Deep Learning workloads.
  • Familiarity with NVIDIA's networking technologies, such as Mellanox InfiniBand, and their integration with Deep Learning workflows.
  • Strong understanding of high-performance networking protocols and standards and their application to Deep Learning.

We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, sex, gender, gender expression, sexual orientation, age, marital status, veteran status, or disability status. We will ensure that individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Please contact us to request accommodation.

Company Info.

NVIDIA

NVIDIA’s invention of the GPU sparked the PC gaming market. The company’s pioneering work in accelerated computing—a supercharged form of computing at the intersection of computer graphics, high performance computing and AI—is reshaping trillion-dollar industries, such as transportation, healthcare and manufacturing, and fueling the growth of many others.

  • Industry
    Cloud computing,Video games,Computer software,Semiconductors,Computer hardware,Consumer electronics,Artificial intelligence
  • No. of Employees
    22,473
  • Location
    2701 San Tomas Expressway, Santa Clara, CA 95050, USA
  • Website
  • Jobs Posted

Get Similar Jobs In Your Inbox

NVIDIA is currently hiring Senior Machine Learning Engineer Jobs in Yokneam, Israel with average base salary of ₪360,000 - ₪500,000 / Year.

Similar Jobs View More