NVIDIA Senior Site Reliability Engineer Salary?

The average base salary for a Senior Site Reliability Engineer is $164,000 - $316,250 / Year at NVIDIA.

Senior Site Reliability Engineer Salary in Santa Clara, CA, USA

The average base salary for a Senior Site Reliability Engineer is $164,000 - $316,250 / Year in Santa Clara, CA, USA

Technical Skills Required to Become a Senior Site Reliability Engineer ?

To work as a Senior Site Reliability Engineer - You must have Degree in a relevant discipline in Degree in Artificial intelligence- AI Degree in Machine Learning

https://www.karkidi.com/upload-nct/company-logo/th1_nvidia_30f02.png

Looking for a Senior Site Reliability Engineer in Santa Clara, CA, USA?

NVIDIA is currently hiring Senior Site Reliability Engineer in Santa Clara, CA, USA and looking for candidates have skills and work experience of 8-10 year.

Does NVIDIA hire Senior Site Reliability Engineer now?

NVIDIA seeks to hire qualified Senior Site Reliability Engineer with at least 8-10 year experience.

Posted on:16 Jan 2024 BACK TO SEARCH

Senior Site Reliability Engineer

NVIDIA

Apply Now

Job Type
Full Time
Experience
8-10 year
Salary
$164,000 - $316,250 / Year
Location

Santa Clara, CA, USA
Job Function

Senior Site Reliability Engineer
Industry
Information Technology
Qualification

Degree in Artificial intelligence- AI
Degree in Machine Learning

Key Skills

Aartificial intelligence, AWS, Azure, Computer Graphics, Continuous Integration & Continuous Delivery - CI/CD, Deep Learning, Effective communication skills, Generative AI, Google Cloud Platform (GCP), GPUs, KPIs, Leadership Skill, Machine learning techniques, Presentation skills, Python Programming, PyTorch, Site Reliability Engineering (SRE), TensorFlow, TensorRT

Job Description

We are now looking for a Sr. Site Reliability Engineer (SRE)! NVIDIA has been redefining computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s motivated by outstanding technology and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. NVIDIA is at the forefront of generative AI models, from language to images. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, encouraging environment where everyone is inspired to do their best work.

NVIDIA is looking for a Senior Site Reliability Engineer (SRE) to join its cloud service team for supporting, triaging, and building generative AI-powered visual applications. As SREs are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. We live SRE practices that are key to product quality, such as limiting time spent on reactive operational work, blameless postmortems, proactive identification of potential outages, and iterative improvements, which all make for interesting and dynamic day-to-day work. The person in this position will be responsible for Service Response and workflow and will drive tools/service development to maintain and improve service SLOs. We partner with Service Owners to drive the reliability of the service.

What you will be doing:

Support and work on groundbreaking Generative AI inferencing workloads running in a globally-distributed heterogeneous environment spanning 60+ edge locations plus all major cloud service providers. Ensure the best possible performance and availability on current and next-generation GPU architectures.

Collaborate closely with the service owner, architecture, research, and tools teams at NVIDIA to achieve ideal results for AI problems at hand.

Monitoring & supporting critical high-performance, large-scale services running multi-cloud.

Participate in the triage & resolution of complex infra-related issues.

Maintain services once live by measuring and monitoring availability, latency, and overall system health using metrics, logs, and traces.

Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.

Practice balanced incident response and blameless postmortems.

Be part of an on-call rotation to support production systems.

Lead significant production improvement around tooling, automation, and process.

Architect, design, and code using your expertise to optimize, deploy and productize services.

What we need to see:

8+ years of experience operating & owning end-to-end availability and performance of mission-critical services in a live-site production environment, either as an SRE or Service Owner.

Bachelors or equivalent experience.

Solid understanding of containerization and microservices architecture, K8s. Excellent understanding of the Kubernetes ecosystem and best practices with K8s.

Ability to dissect complex problems into simple sub-problems and use available solutions to resolve them.

Technical leadership beyond development that includes scoping, requirements capturing, leading and influencing multiple teams of engineers on broad development initiatives.

Lead significant production activities, including change management, post-mortem reviews, workflow processes, software design, and delivering software automation in various languages (Python, or Go ) and technologies (CI/CD auto-remediation, alert correlation).

Best in understanding SLO/SLIs, error budgeting, KPIs, and configuring for highly complex services.

Experience with the ELK and Prometheus stacks as a power user and administrator.

Excellent understanding of cloud environments and technologies, especially AWS, Azure, GCP, or OCI.

Proven strengths in identifying, mitigating, and root-causing issues while continuously seeking ways to drive optimization, efficiency, and the bottom line.

Ways to stand out from the crowd:

Exposure to containerization and cloud-based deployments for AI models.

Excellent coding: Python, Go (Any similar language).

Understanding of Deep Learning / Machine Learning / AI.

Experience with Cuda, PyTorch, TensorRT, TensorFlow, and/or Triton.

Excellent communication, presentation, social, and analytical skills; the ability to communicate complex concepts clearly and persuasively across different audiences and varying levels of the organization.

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you.

The base salary range is 164,000 USD - 316,250 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Company Info.

NVIDIA

NVIDIA’s invention of the GPU sparked the PC gaming market. The company’s pioneering work in accelerated computing—a supercharged form of computing at the intersection of computer graphics, high performance computing and AI—is reshaping trillion-dollar industries, such as transportation, healthcare and manufacturing, and fueling the growth of many others.

Industry

Cloud computing,Video games,Computer software,Semiconductors,Computer hardware,Consumer electronics,Artificial intelligence
No. of Employees

22,473
Location

2701 San Tomas Expressway, Santa Clara, CA 95050, USA
Website

https://www.nvidia.com/
Jobs Posted

Get Similar Jobs In Your Inbox

NVIDIA is currently hiring Senior Site Reliability Engineer Jobs in Santa Clara, CA, USA with average base salary of $164,000 - $316,250 / Year.

Senior Site Reliability Engineer

Job Type

Experience

Salary

Location

Job Function

Industry

Qualification

Key Skills

Job Description

Company Info.

Get Similar Jobs In Your Inbox

NVIDIA is currently hiring Senior Site Reliability Engineer Jobs in Santa Clara, CA, USA with average base salary of $164,000 - $316,250 / Year.

Similar Jobs View More