Aim
The goal is to create and validate AI solutions that radically reduce the need for huge amounts of labeled data and demonstrate strong generalization capabilities to enable companies with less labeled data to be able to present competitive AI solutions. compared to major players (GAFAMs)
What's at stake and challenges
The performance obtained by AI-based systems is mainly based on supervised learning using a very large amount of carefully annotated data. However, on the one hand, problems related to data collection (existence, access, occurrence, lack of representativeness) make it difficult to acquire large databases in many areas. On the other hand, the labeling process is costly, in terms of time and money. It is therefore important to implement learning methods to minimize the need for annotation or to artificially increase the quantity of annotated data.
Proposed solutions: weakly supervised learning
Reduce the amount of labeled data
- Pre-training with "Self-supervised learning" + Fine-Tuning on the target task
- Semi-supervised learning (using data with AND without labels)
Improving the learning process
- Active learning (Oracle-in-the loop)
‘Artificial' increase in data quantity
- Transformation
- Data generation by simulation or generative AI models
- Use of data with lower-quality labels
◦ Multiple labels (consensus)
◦ Label with errors (noisy data labeling)
◦ Weak labels ( partial data )
Domain Adaptation