Doctoral subjects | TRAIL Factory

Augmentation des données

Description

Dans le monde actuel, l'intelligence artificielle (IA) est omniprésente, mais son développement se heurte souvent à un obstacle majeur : le manque de données. En effet, l'évolution vers des modèles toujours plus volumineux nécessite des quantités croissantes de données d'entraînement. Face à cette pénurie, la génération synthétique de données apparaît comme une solution prometteuse.
Ce domaine de recherche vise à explorer comment divers systèmes d'IA générative peuvent contribuer à résoudre les défis actuels du machine learning. L'accent est particulièrement mis sur l'étude de nouvelles architectures, telles que les modèles de diffusion, ainsi que sur le développement de métriques de qualité permettant de valider la pertinence et l'efficacité des données synthétiques générées.

Tache en 2023 :

Data augmentation with stable diffusion + ControlNet

Object détection
Médical Image classification

Tache en 2024 :

a) Données Courantes

Un groupe de chercheurs s’est concentré sur des tâches liées à l’augmentation de données pour des applications générales telles que la détection de voitures, de personnes ou de bâtiments, ainsi que la reconnaissance des émotions faciales. Parmi les réalisations notables, on peut mettre en avant:
- la présentation de l'article intitulé "CIA: Controllable Image Augmentation Framework Based on Stable Diffusion" à la conférence internationale **MIPR 2024** aux États-Unis.
- la réalisation d’un projet intitulé **Data Augmentation for Face Emotion Recognition** lors du **TRAIL Summer Workshop 2024 à Lisbonne**. Ce projet, co-écrit par Multitel, visait à créer une base de données équilibrée (en age et ethnie) pour une tâche de classification des émotions faciales.
- l'amélioration continue du framework de génération d'image débuté au **TRAIL Workshop 2023 à Nantes**, notamment en intégrant des méthodes permettant de filtrer les données générées par le framework afin de ne conserver que les données de bonne qualité.

b) Données Médicales

Un deuxième groupe s’est penché sur des images médicales de différentes modalités telles que l’IRM, la radiographie, l'échographie, les images histopathologiques et la tomographie par émission de positons (PET-SCAN). Les données médicales, de par leur nature complexe et différente, ont requis des approches spécifiques et plus classique (pas de GenAI) d’augmentation de données. Ce groupe a travaillé sur un projet intitulé "**Benchmarking Data Augmentation Techniques Across Various Medical Imaging Modalities**" lors du workshop à Lisbonne.
Un papier est en cours de rédaction dont le but est d'établir une comparaison des différentes approches d'augmentation de données en fonction du type d'image médicale.

Uncertainty Estimation in GenAI

Task : Uncertainty Estimation in QA Task

The goal of this task is to estimate the uncertainty of different Large Language Models (LLMs)
in a Question Answering (QA) task. To achieve this, the CRAG dataset—one of the most recent
datasets in this domain—is utilized. Originally released by Meta for Retrieval-Augmented
Generation (RAG) benchmarks, CRAG provides valuable classes for evaluating uncertainty
methods. The dataset categorizes questions across five domains (finance, sports, music, movies, and
open) and four temporalities (static, slow-changing, fast-changing, and real-time).
This temporal classification facilitates the study of the relationship between a model's uncertainty
and its access to external information. For instance, questions requiring real-time information are
expected to yield higher uncertainty when the model lacks access to up-to-date external data.
Uncertainty estimation methods typically rely on sampling strategies. For a single question, the
internal stochasticity of the model is leveraged to generate a finite set of answers. Subsequently,
various strategies can quantify the consistency of these answers, thus assessing the model's
uncertainty. These strategies may focus on diverse aspects, given the complexity of natural language
data. They may evaluate lexical similarity, semantic meaning, or logit probabilities, or employ
hybrid or abstract representations. Some methods are applicable to black-box models, while others
require access to intermediate states, i.e., white-box models.
This task seeks to provide a comprehensive understanding of LLM uncertainty in QA tasks by
addressing several aspects:
1. Correlation Between Uncertainty and Question Temporality: The task will investigate
this correlation when models lack access to external information.
2. Impact of RAG Techniques on Model Uncertainty: By incorporating additional context
through RAG techniques, the project will explore how such methods influence model
uncertainty.
3. Comparison of Uncertainty Estimation Methods: A systematic evaluation of uncertainty
methods, based on their requirements (white-box vs. black-box), will be conducted to
determine whether greater access to model internals improves uncertainty estimation.
Ultimately, this enhanced understanding may lead to the development of a novel metric for more
reliable uncertainty estimation, grounded in the insights gained from this research.

Explainable recommander system

Task : Empirical Study of LLM-Enhanced Explainable Recommendation From an HCI and ML Perspective

Recommender systems (RS) regroup a set of information filtering techniques whose purpose is to
recommend to a user a selection of items from a generally large corpus. These items are chosen based
on the user’s preferences and characteristics, deduced from the history of their interactions with items
of the given corpus. Explainable RS simply provide explanations of the recommendation process to
help achieving goals like trust, acceptance and transparency.
Our project aims to tackle this issue by using Large Language Models (LLMs) to create clearer and
more meaningful explanations for recommendations. Our initial research, including a pilot study
conducted at TRAIL'23, showed promising results: LLM-enhanced explanations were more detailed
and engaging than traditional methods1.
We therefore focus this year on the design and the implementation of a working prototype that allows
real-time user interaction with an explainable recommender system (providing movie
recommendation). This prototype is based on a two-components recommendation pipeline: a graph-
based explainable recommendation component, and a LLM-based explanation enhancing component.
Various methods have been investigated for both components. The prototype is available on the TRAIL
Factory2 and on Github3.
In 2025, we plan to conduct a mixed-methods evaluation of our prototype. We want to combine user-
based and heuristic-based method to offer a more complete picture of LLM-enhanced explainable
recommendation.

____________________________________________________________________________________________

1. Albert, Julien, et al. "User Preferences for Large Language Model versus Template-Based Explanations of Movie Recommendations: A Pilot Study." arXiv preprint arXiv:2409.06297 (2024). https://arxiv.org/abs/2409.06297.
2. https://factory.trail.ac/fr/sofware-package/explainflix-interactive-recommender-system-llm-generated-explanations
3. https://github.com/balfroim/TRAIL24

Adaptation de domaine : Une clé pour débloquer le potentiel de l'IA dans l'industrie

Dans le monde de l'intelligence artificielle (IA), l'un des domaines les plus passionnants et
prometteurs est l'adaptation de domaine. Mais qu'est-ce que c'est exactement et
Comment cela pourrait-il être avantageux pour votre entreprise? Analysons cela en détail.
Qu'est-ce que l'adaptation de domaine ?
L'adaptation de domaine est une technique utilisée en apprentissage automatique où un
modèle prédictif, entraîné sur un domaine (ou source), est adapté pour être efficace sur un
domaine différent, mais connexe (ou cible). Imaginez que vous ayez un modèle excellent
pour reconnaître les chiens sur des images prises pendant la journée. Mais que se passe-t-
il si vous voulez qu'il reconnaisse les chiens sur des images prises la nuit ? Le modèle
pourrait avoir du mal car les conditions d'éclairage sont différentes. C'est là que
l'adaptation de domaine intervient. Elle aide le modèle à s'adapter et à bien fonctionner
dans le nouveau domaine (images de nuit), même s'il a été initialement entraîné sur un
domaine différent (images de jour).
Le rôle de la caractérisation de domaine
Alors, comment la caractérisation de domaine s'inscrit-elle dans ce tableau ? La
caractérisation de domaine est le processus de compréhension et de description des
caractéristiques d'un domaine. C'est comme créer un profil pour un domaine. Dans notre
exemple de reconnaissance de chiens, la caractérisation de domaine consisterait à
comprendre des caractéristiques telles que les conditions d'éclairage, la présence
d'autres objets, les couleurs typiques dans l'image, etc. Cette compréhension peut
ensuite être utilisée pour guider le processus d'adaptation.L'impact sur les entreprises
L'adaptation de domaine peut constituer un véritable avantage pour les entreprises. Grâce
à cela, les modèles d'IA peuvent être plus souples et adaptés à différentes situations. Un
détaillant pourrait, par exemple, ajuster un modèle entraîné sur le comportement d'achat
en ligne afin de prédire le comportement d'achat en magasin. Pour optimiser une ligne de
production, un fabricant pourrait ajuster un modèle entraîné sur une autre. Les possibilités
sont infinies.
Conclusion
En conclusion, l'adaptation de domaine, aidée par la caractérisation de domaine, peut
aider les entreprises à exploiter pleinement le potentiel de l'IA en rendant les modèles plus
polyvalents et efficaces. Il s'agit d'un outil puissant dans le domaine de l'intelligence
artificielle qui peut favoriser l'innovation et la croissance dans de nombreux secteurs
industriels.

Reinforcement learning for local search methods in combinatorial optimization

In this subject, we investigate ways to accelerate local search-based algorithms for combinatorial optimization problems through reinforcement learning techniques.

Méthodes combinées d'active learning et semi-supervised learning

L'apprentissage actif et semi-supervisé sont des techniques importantes lorsque les données étiquetées
sont rares. L'apprentissage semi-supervisé combine à la fois des exemples étiquetés et des exemples
non étiquetés pour former un meilleur classifieur. L'apprentissage actif est le processus de
hiérarchisation d'un ensemble d'instances non étiquetées qui doivent être étiquetées par des experts
afin d'avoir le plus grand impact sur la formation d'un classifieur. Il peut être judicieux d'utiliser
l'apprentissage actif en conjonction avec l'apprentissage semi-supervisé pour améliorer les
performances. Plus précisément, nous sélectionnons d'abord un ensemble d'exemples non étiquetés à
étiqueter par des experts. Ensuite, les exemples étiquetés et les exemples non étiquetés sont utilisés
pour entraîner les classificateurs par apprentissage semi-supervisé.
L'objectif de ce stage sera d'implémenter une méthode d'apprentissage profond AL-SSL pour la
classification d'images. Dans un premier temps, le stagiaire réalisera un état de l'art succinct sur les
méthodes existantes de SSL et AL et choisira la ou les méthode(s) à mettre en œuvre, de préférence en
fonction des implémentations open-source existantes. Il s'agira alors d'améliorer l'algorithme AL-SSL
en essayant de combiner plusieurs méthodes d'apprentissage semi-supervisé et d’apprentissage actif.

Perturbation-based XAI methods for Visual Transformers

Objectives:

We develop a XAI technique for Visual Transformers, named Transformers Input Sampling (TIS), and compare it to state of the art methods (ViT-CX, G-LIME, TAM, Attention rollout, …). The comparison is done for several metrics (Insertion/deletion, Pointing Game, …) and for two visual transformer networks: the vanilla Vision Transformers (ViT) and the Data Efficient Image Transformers (DeiT).

Roadmap:

Review of the state of the art: current explainability techniques for transformer networks
Research of publicly available codes on GitHub: efficient implementations of the different XAI methods and metrics
Development of a Python framework for TIS
Comparison/Evaluation of TIS to the state of the art
Writing of the paper
Improvement and comparison of TIS for multimodal transformers

Expected deliverables:

Paper submitted to an international conference
Public GitHub for TIS
Extension of TIS for multimodal transformer networks

Physical Field Prediction

Propose ML-based methodology to predict physical (2D/3D) fields

"Real-time" physical fields prediction
Merge heterogeneous data (experimental data, simulated data of different levels of fidelity)
Hybridize expert knowledge with ML via Informed Neural Networks

Clustering and forecasting of time series

The main purpose of this topic is to develop, test and analyse clustering methods for residential energy consumption data to:

Decrease time series forecasters training time with transfer learning
Increase forecasting accuracy with better consumer behaviour understanding
Identify and interpret energy consumption patterns in the data

Federated Inductive Logic Programming

It is a commonly accepted fact that machine learning requires large amounts of data. Fortunately, the sources of information are more and more numerous and the amount of data available in all domains is constantly increasing. However, this evolution has reached a point where it is no longer realistic to think of storing the whole set of data needed for a machine learning task on a single computer. This has led J. Konecny, H.B. MacMahan and D. Ramage to propose a new learning model in which the data is scattered on distributed nodes and the model is learned in a distributed manner. This technique is known as Federated Learning or federated learning.

In addition to providing a solution to a data storage problem, federated learning makes sense in Wallonia, where the economic entities are small and dispose of relatively little data, but of such quality that pooling this data together gives a considerable mass of data. The hospitals are good examples. In fact, each hospital has quality data but in relatively small quantities. On the contrary, as the INAH project has demonstrated, pooling this data allows for an appreciable level of quality.

Federated learning is classically implemented for learning neural networks. It implies the sharing of data, sometimes very sensitive as in the medical domain, and consequently, generates fears about privacy. Some anonymization techniques have been proposed in this context, but they come up against attacks that can potentially de-anonymize the data used.

Building on symbolic artificial intelligence, Inductive Logic Programming offers a more secure framework that naturally supports encryption approaches. Indeed, starting from a basic theory, positive and negative examples, Inductive Logic Programming aims at producing a set of rules explaining the positive and negative examples. Our work aims at showing that it is possible to to learn a mini-theory on each distributed node and to combine these mini-theories to produce a general theory. This new form of federated learning offers two important advantages. First, the communication of mini-theories is significantly less sensitive than the communication of individual data. By learning the mini-theories locally, it is thus possible to provide a naturally privacy-friendly framework. On the other hand, Inductive Logic Programming is based on symbols and not on particular values and therefore the learning of mini-theories can be naturally conceived on encrypted data.