
Job Description
What You’ll Do:
- Design and implement scalable, reliable systems for training orchestration, artifact tracking, and model registration across multiple data centers and cloud regions.
- Improve and streamline ML experimentation workflows by integrating tooling like Ray, Airflow, and interactive notebooks.
- Develop APIs and services that enable applied scientists to seamlessly launch, debug, and track training jobs.
- Ensure reproducibility and traceability by building robust version control and metadata systems for model artifacts.
- Collaborate with AI infra teams (LLMObs, Compute, etc.) to deliver consistent user experiences and integrated telemetry.
- Mentor engineers and help drive architectural decisions and technical standards.
Who You Are:
- You have 6+ years of experience in backend, distributed systems, or platform engineering roles.
- You have worked on ML platforms or infrastructure, ideally supporting real-world training or model lifecycle workflows.
- You’re comfortable designing APIs, managing data at scale, and architecting systems for reliability and observability.
- You’re fluent in Python or Go and have experience with cloud-native tools (e.g., Kubernetes, object stores, queueing systems).
- You’re comfortable navigating cross-functional environments and translating scientific requirements into reliable systems.
- Bonus points: experience with model registries, experiment tracking tools (e.g., MLflow, Weights & Biases), or distributed training frameworks.
Datadog values people from all walks of life. We understand not everyone will meet all the above qualifications on day one. That's okay. If you’re passionate about technology and want to grow your skills, we encourage you to apply.
Benefits and Growth:
- New hire stock equity (RSUs) and employee stock purchase plan (ESPP)
- Continuous professional development, product training, and career pathing
- Intradepartmental mentor and buddy program for in-house networking
- An inclusive company culture, ability to join our Community Guilds (Datadog employee resource groups)
- Access to Inclusion Talks, our internal panel discussions
- Free, global mental health benefits for employees and dependents age 6+
- Competitive global benefits
Benefits and Growth listed above may vary based on the country of your employment and the nature of your employment with Datadog.
Company Info.
Datadog, Inc.
Datadog is the essential monitoring platform for cloud applications. We bring together data from servers, containers, databases, and third-party services to make your stack entirely observable. These capabilities help DevOps teams avoid downtime, resolve performance issues, and ensure customers are getting the best user experience.
-
Industry
Information Technology
-
No. of Employees
3,400
-
Location
New York, NY, USA
-
Website
-
Jobs Posted
Get Similar Jobs In Your Inbox
Datadog, Inc. is currently hiring Senior Software Engineer Jobs in Paris, France with average base salary of €77,600 - €127,500 / Year.