Understanding what makes a video memorable has important applications in advertising or education technology. Towards this goal, we investigate spatio-temporal attention mechanisms underlying video memorability. Different from previous works that fuse multiple features, we adopt a simple CNN+Transformer architecture that enables analysis of spatio-temporal attention while matching state-of-the-art (SoTA) performance on video memorability prediction. We compare model attention against human gaze fixations collected through a small-scale eye-tracking study where humans perform the video memory task. We uncover the following insights: (i) Quantitative saliency metrics show that our model, trained only to predict a memorability score, exhibits similar spatial attention patterns to human gaze, especially for more memorable videos. (ii) The model assigns greater importance to initial frames in a video, mimicking human attention patterns. (iii) Panoptic segmentation reveals that both (model and humans) assign a greater share of attention to things and less attention to stuff as compared to their occurrence probability.
For more details, please visit our project website or read our paper.
- Clone the repository
git clone [repository-url]
cd [repository-name]- Install dependencies
pip install -r requirements.txtThe Memento dataset can be downloaded from http://memento.csail.mit.edu/#Dataset.
.
├── main.py # Main training script
├── embed.py # Video embedding generation
├── attention.py # Attention matrix extraction
├── panoptic.py # Panoptic segmentation
├── requirements.txt # Python dependencies
├── eyetracking/ # Eye-tracking data and related processing
├── utils/
│ ├── model.py # Transformer model implementation
│ └── dataset.py # Dataset handling
python embed.py --path /path/to/videospython main.py \
--path /path/to/embeddings \
--train_data_path /path/to/train.csv \
--val_data_path /path/to/val.csvExtract attention matrices to analyze model's focus:
python attention.py \
--model_path /path/to/trained/model.pt \
--val_path /path/to/val.csv \
--features_path /path/to/featuresGenerate panoptic segmentation results:
python panoptic.py \
--video_path /path/to/videosIf you use this code in your research, please cite our paper:
@article{kumar2025eyetoai,
title = {{Seeing Eye to AI: Comparing Human Gaze and Model Attention in Video Memorability}},
author = {Kumar, Prajneya and Khandelwal, Eshika and Tapaswi, Makarand and Sreekumar, Vishnu},
year = {2025},
booktitle = {IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}
}