Research Data Science Intern - Tabular Data Augmentation

Dataiku
Apply Now

Job Description

We are looking for a research intern to join Dataiku’s Lab in our Paris office for a 6-month internship on Tabular Data Augmentation (TDA). 

Data Augmentation has been successfully used to improve the predictive power and generalisation of deep neural nets in visual tasks. However, the difficulty of defining invariances for tabular data as well as dealing with categorical variables has long limited the use of TDA. Nevertheless, TDA in the latent space, based on generative models, such as Variational Auto-Encoders or Generative Adversarial Networks, seem to overcome these difficulties providing realistic synthetic samples, especially useful to augment minority classes.

In imbalanced classification tasks or in settings where some groups are poorly represented, TDA should help improve local performances on those classes or groups. With this internship we aim at exploring the generative approaches to TDA and compare them to traditional oversampling techniques in imbalanced classification settings.

Several models have been proposed to generate synthetic data, such as TVAE [1], CTGAN [1], CopulaGAN, MixupGAN, Great [2], and various strategies to augment data both in the input and in the latent space (TailCalibration [3]), but an exhaustive benchmark on imbalanced data is currently missing.

In order to build a trustful TDA method, it is critical to provide practitioners with quality metrics showing the similarity of the synthetic distributions with the real data. Indeed, as synthetic data is automatically generated it could be based on perturbations that can change the class of samples and thus be harmful for training ML models.

This internship focuses on designing optimal strategies to perform data augmentation in imbalanced tabular classification tasks. We will first study state-of-the-art TDA models and how TDA can improve local accuracy of ML models trained on real and synthetic data. In a second step the intern will design the best TDA strategy and quality metrics to ensure synthetic and real data have similar distributions. This study will be used to recommend a TDA tool to be included in Dataiku’s data science software.

Your mission will be to:

  • Get familiar with the domain
  • Run a through benchmark of TDA strategies on datasets of various imbalance
  • Define and validate quality metrics of synthetic/real data similarity
  • Identify/define a robust TDA strategy and check metrics

You are our ideal candidate if:

  • You are eager to get your hands dirty and dive into coding
  • You know that bagging and boosting trees is not about gardening

Ideal technical skills:

  • Good understanding of parametric machine learning algorithms and their optimisation
  • Good experience with Python development; alternatively experience with an object-oriented language such as Java or Scala
  • Some experience working with deep learning frameworks, esp. Keras or Pytorch, for both supervised and unsupervised text/tabular learning

Company Info.

Dataiku

Dataiku is an artificial intelligence (AI) and machine learning company which was founded in 2013. In December 2019, Dataiku announced that CapitalG - the late-stage growth venture capital fund financed by Alphabet Inc. - joined Dataiku as an investor and that it had achieved unicorn status, valued at $1.4 billion. Dataiku currently employs more than 500 people worldwide between offices in New York, Paris, London, Munich, Sydney, Singapore, and D

  • Industry
    Information Technology
  • No. of Employees
    600
  • Location
    New York City, NY, United States
  • Website
  • Jobs Posted

Get Similar Jobs In Your Inbox

Dataiku is currently hiring Research Data Science Internship Jobs in France, Paris with average base salary of €60,000 - €101,000 / Year.

Similar Jobs View More