Site Reliability Engineer

Adobe
Apply Now

Job Description

We're looking for an outstanding, hands-on SRE for Adobes AI Inference Platform called Adobe Firefly. You will closely work with the Engineering teams on building, scaling, and securing the AI super highway. This enables Adobe product teams to easily manage and deploy Machine Learning capabilities and Generative AI features used by Adobe client applications. The Applied Research groups from Adobe Research and other App Teams in Adobe will deploy thousands of models onto this platform in a variety of lifecycle stages (early research, development, productization, optimization, etc). This platform will offer an ML model serving at scale, with high-cost efficiency and on a wide variety of hardware platforms across multiple clouds.

What You'll Do

  • Engage with Firefly Engineering teams to understand their needs and goals to drive the platform's reliability.
  • Identify and implement methodologies and solutions to increase reliability, scalability, security, and efficiency on our fast developing Generative AI platform.
  • Implement and maintain robust monitoring, alerting, and incident response to ensure the highest level of uptime and Quality of Service (QoS) to Adobe’s customers through operational excellence.
  • Collaborate with cross-functional teams to identify and address reliability issues, ensuring a flawless user experience.
  • Lead incident response efforts during outages or performance degradation, and conduct thorough post-mortem analyses to prevent recurrence.
  • Develop and maintain automated deployment, scaling, and configuration management solutions using the latest technologies e.g. Kubernetes, Helm and ArgoCD.
  • Champion the use of infrastructure-as-code practices to streamline and optimize operational processes.
  • Collaborate with stakeholders to forecast capacity needs based on usage trends, ensuring the scalability of Adobe Firefly to meet growing demands.
  • Define service level objectives (SLOs) and service level indicators (SLIs) to represent and measure service quality.
  • Support and maintain globally distributed, multi-cloud (public and/or private) environments.
  • Automate common, repeatable tasks at a large scale to streamline operational procedures.
  • Identify areas to improve service resiliency through techniques such as chaos engineering, performance/load testing, etc
  • Collaborate within the global Reliability organisation to drive Developer Platforms core mission of Delivering better software faster!.

What You’ll Need to Succeed

  • Experience in building and scaling distributed systems, as well as experience with containerization and orchestration technologies like Kubernetes.
  • Strong communication and collaboration skills - building strong relationships with internal customers and external partners.
  • Dedication to team-work, self-organization, and continuous improvement
  • Strong analytical and problem-solving skills, with the ability to think strategically and make data-driven decisions.
  • Experience with monitoring and observability tools (e.g., Cortex, Prometheus, Grafana).
  • A passion for staying up to date with the latest trends and technologies in AI and Machine Learning, especially in operating AI & ML in public cloud environments.
  • Production level expertise with containerization technologies (e.g. Docker, Kubernetes) and proven understanding of modern, continuous development techniques and pipelines (IaC, CI/CD, ArgoCD, Git)
  • Fundamental programming skills, ideally practical experience in Python.
  • Fundamental knowledge about web services and related technologies (e.g. HTTP, JSON, Envoy).
  • Familiarity with Generative AI frameworks and libraries (e.g., TensorFlow, PyTorch) is highly desirable. An understanding of AI/ML, including ML frameworks, public cloud, and commercial AI/ML solutions is a plus.
  • A Bachelor's or Master's degree in Computer Science, Electrical Engineering, a related field, or equivalent industry experience.

Our compensation reflects the cost of labor across several  U.S. geographic markets, and we pay differently based on those defined markets. The U.S. pay range for this position is $108,000 -- $198,500 annually. Pay within this range varies by work location and may also depend on job-related knowledge, skills, and experience. Your recruiter can share more about the specific salary range for the job location during the hiring process.

At Adobe, for sales roles starting salaries are expressed as total target compensation (TTC = base + commission), and short-term incentives are in the form of sales commission plans. Non-sales roles starting salaries are expressed as base salary and short-term incentives are in the form of the Annual Incentive Plan (AIP).

In addition, certain roles may be eligible for long-term incentives in the form of a new hire equity award.

Adobe is proud to be an Equal Employment Opportunity and affirmative action employer. We do not discriminate based on gender, race or color, ethnicity or national origin, age, disability, religion, sexual orientation, gender identity or expression, veteran status, or any other applicable characteristics protected by law. Learn more.

Adobe aims to make Adobe.com accessible to any and all users. If you have a disability or special need that requires accommodation to navigate our website or complete the application process, email accommodations@adobe.com or call (408) 536-3015.

Company Info.

Adobe

Adobe is the global leader in digital media and digital marketing solutions. Our creative, marketing and document solutions empower everyone – from emerging artists to global brands – to bring digital creations to life and deliver immersive, compelling experiences to the right person at the right moment for the best results. In short, Adobe is everywhere, and we’re changing the world through digital experiences.

  • Industry
    Media,Artificial intelligence,Computer software
  • No. of Employees
    25,988
  • Location
    San Jose, CA, USA
  • Website
  • Jobs Posted

Get Similar Jobs In Your Inbox

Adobe is currently hiring Site Reliability Engineer Jobs in Lehi, UT, USA with average base salary of $108,000 - $198,500 / Year.

Similar Jobs View More