Lead MLOps / LLM - AI Platform/HPC GPU Cluster

Qualcomm
Apply Now

Job Description

Architect, deploy, and optimize the ML platform that supports training of Large Language Models (LLMs) (and related generative models) using High-performance GPU clusters. Become an integral part of a cross-functional team comprised of data scientists, software engineers, and IT specialists, ensuring the smooth operation and scalability of our ML infrastructure.

Minimum Qualifications:

  • Bachelor's degree in Computer Science, Engineering, Information Systems, or related field and 6+ years of Hardware Engineering, Software Engineering, Systems Engineering, or related work experience.

OR

Master's degree in Computer Science, Engineering, Information Systems, or related field and 5+ years of Hardware Engineering, Software Engineering, Systems Engineering, or related work experience.

OR

PhD in Computer Science, Engineering, Information Systems, or related field and 4+ years of Hardware Engineering, Software Engineering, Systems Engineering, or related work experience.

Responsibilities:

  • Architect, develop, and maintain the ML platform to support training on Large Language Models.

    Design and implement scalable and reliable infrastructure solutions for GPU Clusters.

  • Collaborate with data scientists and software engineers to define requirements and ensure seamless integration of ML workflows into the platform.
  • Optimize the platform's performance and scalability, considering factors such as GPU resource utilization, data ingestion, model training, and deployment.
  • Monitor and troubleshoot system performance, identifying and resolving issues to ensure the availability and reliability of the ML platform. Create and maintain dashboards for tracking performance and utilization.
  • Implement and maintain CI/CD pipelines for automated model training, evaluation, and deployment.
  • Stay updated with the latest advancements in MLOps/DevOps, distributed computing, and GPU acceleration technologies, and proactively propose and implement improvements to enhance the ML platform.
  • Provide technical guidance and mentorship to junior team members.

Ideal Candidates will demonstrate the following:

  • Proven experience as an MLOps/DevOps Engineer or similar role, with a focus on large-scale ML infrastructure and GPU clusters.
  • Strong expertise in configuring and optimizing clusters for deep learning workloads, ideally NVIDIA DGX including Bright Cluster Manager, Megatron, etc.
  • Proficient in using the Slurm scheduler or similar job scheduling systems.
  • Solid programming skills Python or other scripting language
  • Experience with relevant ML frameworks (e.g., TensorFlow, PyTorch).
  • In-depth understanding of distributed computing, parallel computing, and GPU acceleration techniques (Deepspeed).
  • Advanced-level skills with containerization technologies, specifically Docker including components, installation, deployment, scaling workflow, Docker images, Docker containers, port mapping, logging, creating a CI pipeline, etc.
  • Advanced-level skills with orchestration tools, for example Kubernetes.
  • Advanced-level skills with visualization and monitoring tools (e.g., Grafana) including experience utilizing those tools for communication, cluster optimization, and root-cause infrastructure issues.
  • Advanced-level skills with LINUX.
  • Experience with CI/CD pipelines and automation tools for ML workflows (e.g., Jenkins, GitLab CI).
  • Strong problem-solving skills and the ability to troubleshoot complex technical issues.
  • Excellent communication and collaboration skills to work effectively within a cross-functional team.

Preferred Skills:

  • Experience with training and deploying Large Language Models (LLMs).
  • Knowledge of ML model optimization techniques and memory management on GPUs.
  • Understanding of security and compliance requirements in ML infrastructure.

Although this role has some expected minor physical activity, this should not deter otherwise qualified applicants from applying. If you are an individual with a physical or mental disability and need an accommodation during the application/hiring process, please call Qualcomm’s toll-free number found here for assistance. Qualcomm will provide reasonable accommodations, upon request, to support individuals with disabilities as part of our ongoing efforts to create an accessible workplace.

Qualcomm is an equal opportunity employer and supports workforce diversity.

To all Staffing and Recruiting Agencies: Our Careers Site is only for individuals seeking a job at Qualcomm. Staffing and recruiting agencies and individuals being represented by an agency are not authorized to use this site or to submit profiles, applications or resumes, and any such submissions will be considered unsolicited. Qualcomm does not accept unsolicited resumes or applications from agencies. Please do not forward resumes to our jobs alias, Qualcomm employees or any other company location. Qualcomm is not responsible for any fees related to unsolicited resumes/applications.

EEO Employer: Qualcomm is an equal opportunity employer; all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, Veteran status, or any other protected classification.

Qualcomm expects its employees to abide by all applicable policies and procedures, including but not limited to security and other requirements regarding protection of Company confidential information and other confidential and/or proprietary information, to the extent those requirements are permissible under applicable law.

Pay range:

$181,000.00 - $272,000.00

The above pay scale reflects the broad, minimum to maximum, pay scale for this job code for the location for which it has been posted. Even more importantly, please note that salary is only one component of total compensation at Qualcomm. We also offer a competitive annual discretionary bonus program and opportunity for annual RSU grants (employees on sales-incentive plans are not eligible for our annual bonus). In addition, our highly competitive benefits package is designed to support your success at work, at home, and at play. Your recruiter will be happy to discuss all that Qualcomm has to offer!

If you would like more information about this role, please contact Qualcomm Careers.

Company Info.

Qualcomm

Qualcomm is an American multinational corporation headquartered in San Diego, California, and incorporated in Delaware. It creates semiconductors, software, and services related to wireless technology. It owns patents critical to the 5G, 4G, CDMA2000, TD-SCDMA and WCDMA mobile communications standards.

  • Industry
    Semiconductors,Computer hardware,Computer software
  • No. of Employees
    45,000
  • Location
    San Diego, California, USA
  • Website
  • Jobs Posted

Get Similar Jobs In Your Inbox

Qualcomm is currently hiring MLOps Engineer Jobs in Santa Clara, CA, USA with average base salary of $181,000 - $272,000 / Year.

Similar Jobs View More