Multimodal Representation Learning

Multimodal representation learning (MRL) involves the process of acquiring a shared representation that encompasses multiple data modalities. This typically involves integrating and encoding info from different sources or data types, such as combining video and text, or images and text. Our research advances MRL for theses tasks:

  1. Zero-shot video classification: Zero-shot video classification assigns class labels to videos from categories that were not encountered during training. To enhance zero-shot generalization, besides conventional alignment property, we propose a new uniformity property that improves MRL. By enforcing semantic feature distributed on the representation as uniform as possible, our approach enhances the model's performance even in unseen categories. (check our paper at CVPR22 and code)
  2. Learning with noisy data: Noisy images refer to images that may exhibit corruption or interference, including out-of-distribution (OOD) images or those with noisy labels. Our research involves techniques that leverage large-scale pretraining multimodal representations, such as CLIP, to enhance the training of downstream models using noisy datasets obtained from sources like websites or weak annotations (check the progress).