Séminaire interne - Emanuele Loffredo et Mauro Pastore (département de physique)

Lundi 3 juin 2024 Lundi 3 juin 2024
FRESK building (2 - 10 Rue d'Oradour-sur-Glane, 75015 Paris)

Nous accueillons lundi prochain Emanuele Loffredo et Mauro Pastore, du département de physique de l'ENS. Le titre du séminaire est "Restoring data balance for classification: from theory to application for antigen-binding prediction"

Abstract: Class imbalance in real-world data poses a common bottleneck for machine learning tasks, since achieving good generalization on under-represented examples is often challenging. Mitigation strategies, such as under or  oversampling the data depending on their abundances, are routinely proposed and tested empirically, but how  they should adapt to the data statistics remains poorly understood.

In the first part of this talk, we will review the ubiquitous problem of supervised learning with imbalance and  obtain analytical expressions of the generalization curves in the high-dimensional regime for linear classifiers (Support Vector Machines). We will also provide a sharp prediction of the effects of under/oversampling strategies depending on class imbalance, showing that mixed strategies involving under and oversampling of  data lead to performance improvement.

In the second part of this talk, we will apply the restoring balance mindset to the problem of classifying biological sequences from the adaptive immune system (T-cell receptors) able to  recognize pathogens and protect the host from diseases. Predicting whether a receptor binds a pathogenic peptide is a fundamental computational problem, made difficult by the imbalance in available data: relatively few binding pairs are known compared to all possible pairs of receptors and peptides.  We will show how to improve performances by generating putative binding pairs through state-of-the-art data augmentation machine-learning methods.

Lundi 3 juin 2024