Principal AI Operations Engineer

The Walt Disney Company
Apply Now

Job Description

Within Disney Enterprise Technology, the Disney Technology Operations Command Center (DTOC) is a 24x7x365 critical services operation center responsible for service availability, with main focus to rapidly respond to, correlate for, and reduce impact of outages. We are accountable for identifying and facilitating the resolution of service impacting events, and collaborating with other technology teams to prevent future impact through proactive event management, incident and problem analysis. Our DTOC team drives the execution of the major incident process including communication to executives and key partners, including owning and implementing Crisis Management plans and processes.

We are seeking a Principal AI Operations (AIOPS) Engineer to join our team! In this role, you will collaboratively work across all levels and technology organizations, using your experience to apply innovative AI OPS techniques and platforms to measurably improve reliability of Disney’s system, applications, and infrastructure – all while enabling predictive and self-healing capabilities for our operations organization. This role will also drive improved technology stability while increasing operational efficiencies.

Role Responsibilities:

  • Collaborating with multi-functional and IT Infrastructure and Service teams to create standardized operational data models, which will facilitate the collection of relevant telemetry data, enable valuable insights to be derived from complex datasets, and optimize IT operations
  • Design, develop, and deploy AIOPS strategies, enhance IT infrastructure monitoring, incident prediction, and automated resolution capabilities
  • Implementing, maintaining, and optimizing AIOPS solutions to efficiently capture operational, application, and infrastructure telemetry data at scale from a diverse range of sources
  • Analyzing sophisticated data sets to uncover trends, patterns, and anomalies, applying insights to develop proactive operational reliability solutions and improvements for IT infrastructure management
  • Applying Machine Learning and AIOPS tools to significantly reduce IT operational noise and improve real-time detection and prediction of critical operational events
  • Streamlining incident resolution by improving and accelerating root cause analysis, thereby reducing the time required to address issues
  • Designing and implementing operational automation platforms to boost service availability through the use of auto-remediation runbooks
  • Perform integration and develop full interoperability capabilities with various operations management platforms, including service management, privileged access management system, etc.
  • Build comprehensive documentation and training materials to ensure smooth knowledge transfer and effective use of AIOPS technologies across the organization
  • Stay ahead of emerging trends and advancements in AIOPS, machine learning, and related fields to ensure the organization remains at the forefront of technology adoption

Basic Requirements:

  • Proven experience and understanding working with AIOPS technologies and platforms (BigPanda, MoogSoft)
  • Expertise in IT Operations Telemetry platforms (Data Dog, New Relic, Splunk, App Dynamics)
  • Deep background and understanding of Machine Learning: developing, training, and applying machine learning models across large operational datasets
  • Proven experience as an AIOPS Engineer, SRE, or Data Scientist, preferably in an enterprise IT operations environment
  • Excellent knowledge of IT observability and operations event management solutions, and telemetry data & management
  • Proficiency in defining, implementing and measuring operational service level indicators & objectives
  • Solid understanding of IT operations, including infrastructure, networks, applications, and services
  • Expertise in data visualization tools (R, Grafana) and programming/scripting languages (Python, R, Java)
  • Bachelor's degree in Computer Science, Data Science, Applied Mathematics, AI, ML, or related; or equivalent work experience

Preferred:

  • Technical designation/s in operating systems, virtualization, or hardware platforms
  • ITIL v3 Certification
  • Master’s Degree in IT Systems, Business Administration (MBA), or a technical field

Company Info.

The Walt Disney Company

From classic animated features and exhilarating theme park attractions to cutting edge sports coverage, and the hottest shows on television, The Walt Disney Company has been making magic since 1923, creating unforgettable stories that connect with audiences around the world. And we’re just getting started! Disney Streaming Services is a business unit within Disney’s Direct-to-Consumer and International (DTCI) segment that oversees all consumer

Get Similar Jobs In Your Inbox

The Walt Disney Company is currently hiring Principal AI Engineer Jobs in Lake Buena Vista, FL, USA with average base salary of $110,000 - $220,000 / Year.

Similar Jobs View More