Java Programming, Python Programming, C++, C Programming, SQL, Cloud computing, Scala Programming, Machine learning techniques, Data science techniques, MATLAB Programming, PyTorch, TensorFlow, R Programming
We are looking for a research intern to join Dataiku’s Lab in our Paris office for a 6-month internship on Tabular Data Augmentation (TDA).
Data Augmentation has been successfully used to improve the predictive power and generalisation of deep neural nets in visual tasks. However, the difficulty of defining invariances for tabular data as well as dealing with categorical variables has long limited the use of TDA. Nevertheless, TDA in the latent space, based on generative models, such as Variational Auto-Encoders or Generative Adversarial Networks, seem to overcome these difficulties providing realistic synthetic samples, especially useful to augment minority classes.
In imbalanced classification tasks or in settings where some groups are poorly represented, TDA should help improve local performances on those classes or groups. With this internship we aim at exploring the generative approaches to TDA and compare them to traditional oversampling techniques in imbalanced classification settings.
Several models have been proposed to generate synthetic data, such as TVAE [1], CTGAN [1], CopulaGAN, MixupGAN, Great [2], and various strategies to augment data both in the input and in the latent space (TailCalibration [3]), but an exhaustive benchmark on imbalanced data is currently missing.
In order to build a trustful TDA method, it is critical to provide practitioners with quality metrics showing the similarity of the synthetic distributions with the real data. Indeed, as synthetic data is automatically generated it could be based on perturbations that can change the class of samples and thus be harmful for training ML models.
This internship focuses on designing optimal strategies to perform data augmentation in imbalanced tabular classification tasks. We will first study state-of-the-art TDA models and how TDA can improve local accuracy of ML models trained on real and synthetic data. In a second step the intern will design the best TDA strategy and quality metrics to ensure synthetic and real data have similar distributions. This study will be used to recommend a TDA tool to be included in Dataiku’s data science software.
Your mission will be to:
You are our ideal candidate if:
Ideal technical skills:
Dataiku is an artificial intelligence (AI) and machine learning company which was founded in 2013. In December 2019, Dataiku announced that CapitalG - the late-stage growth venture capital fund financed by Alphabet Inc. - joined Dataiku as an investor and that it had achieved unicorn status, valued at $1.4 billion. Dataiku currently employs more than 500 people worldwide between offices in New York, Paris, London, Munich, Sydney, Singapore, and D
France, Paris
2-4 year
Paris, France
4-6 year
Paris, France
6-8 year