Human Perception

Human perception involves the ability to interpret visual cues from the human body and faces in images or videos, allowing us to understand and gain insights into human dynamics and behavior. Our study revolves around these contributions:

  1. Human detection: Our work focuses on detecting humans with diverse poses and aspect ratios (e.g., pedestrians with occlusions or in a crowded scenario). To address the challenge of partial occlusions, we propose a single-stage anchor-based detector that achieves fast and accurate detection by improving anchor initialization through high-confident visual parts (check our paper at ECCV20 and code). We further improve model robustness against aspect ratios and occlusions through a recursive anchor-free detector.
  2. Crowd counting: Crowd counting refers to the task of estimating the number of individuals present in a crowd from images or videos. We developed a lightweight ConvNet that handles varying densities using multi-channel feature aggregation. Our model achieved an MAE of 10 in scenarios with ∼200 people, using only 0.82M parameters, and is 20X smaller than the 16.2M used in SoTA methods (check the demo).
  3. Facial expression & Action Unit (AU) analysis: Facial expression and AU analysis refers to the process of interpreting facial muscle movements and expressions to understand human emotions, intentions, and communication signals. We studied automated techniques using weakly supervised spectral clustering from 1M web images (CVPR18, code), deep region and multi-label learning (CVPR16, code), and structured multi-label learning (CVPR15, code, TIP16).