Welcome to the official repository of the Dargk team submission for the RecSys 2025 Challenge. This repository contains the full pipeline and implementation of our model, BEHAV-E: Behavioral Embedding via Hybrid Action Variational Encoder.
Our approach is centered on modeling user behavior through a hybrid representation combining semantic, categorical, and temporal information, and training via a contrastive learning framework.
BEHAV-E is a representation learning model designed to encode complex and multi-faceted user behavior. The model processes various user actions such as:
- Product buys
- Product add-to-cart
- Product remove-from-cart
- URL visits
- Search queries
Key modeling elements include:
- Kernel Density Estimation (KDE) to capture temporal distribution of actions.
- LSTM-based Autoencoder to embed user search queries.
- Shared Embedding Bags with linear transformations to efficiently represent items, categories, and URLs.
- Variational Encoder to output user behavior embeddings using reparameterization trick.
- Contrastive Learning (InfoNCE loss) with dual models to learn discriminative embeddings.
During inference, we concatenate the mean embeddings from two BEHAV-E models as an ensemble strategy.
EmbeddingAnalysis: contains the code for reproduce the analysis as well as more details than the ones presented on the paper.data/: Folder where the decompressed dataset should be placed01-TextEncodeTrain.py: Trains the LSTM autoencoder for text embeddings02-TextEncodeProcess.py: Encodes the search text using the trained autoencoder03-Polars_DS_enc_search.py: Preprocesses data: one row per client format04-Emb-8ShortVAECATURLFastNAEMA.py: Trains BEHAV-E using EMA (Exponential Moving Average)04-Emb-8ShortVAECATURLFastNAOneCycleLarge.py: Trains BEHAV-E using One Cycle LR schedule05-Emb-8ShortVAECATURLFastNAEMA_gen.py: Generates embeddings using EMA-trained model05-Emb-8ShortVAECATURLFastNAOneCycleLarge_gen.py: Generates embeddings using OneCycle-trained model06-Merger.py: Concatenates embeddings for final submissionenvironment.yml: Conda environment definition for WindowsDockerfile: Dockerfile to build a docker container for out model.README.md: This file
Execution Note: Files should be executed sequentially from 01 to 06. Files with the same prefix number (e.g., both 04-*) can be run in any order.
There are two ways to setup teh environment. The first one is in the local environment through conda with the provided environment.yml file. This has been tested only on a Windows environment. The second way is using a docker container with the provided Dockerfile.
Notice that this project only depends on the data, no pretrained model is required to run the pipeline end-to-end. However, we provide our trained models to reproduce the embeddings submitted to the challenge. The link is provided below.
- Clone the repository:
git clone https://github.com/<your-org>/recsys2025-dargk.git
cd recsys2025-dargk- Set up the conda environment:
conda env create -f environment.yml
conda activate behav-eFollow these instructions to build and run the RecSys2025 Docker container with GPU support.
- Build the Docker Image
docker build -t recsys2025 .- Create the Docker Container
- Windows
docker container create -i -t -v "%CD%":/recsys2025 --gpus=all --name recsys2025 recsys2025- Linux/macOS
docker container create -i -t -v "$PWD":/recsys2025 --gpus=all --name recsys2025 recsys2025 - Start and Attach to the Container
docker container start --attach -i recsys2025To run the system is necesary to download the dataset, which is compressed in a file calles ubc_data.tar.gz. This file must be extracted into the data/ directory.
A detailed description of the dataset can be found at the challenge Web site. It containes six parquet files:
-
product_buy.parquet- client_id (int64): Numeric ID of the client (user).
- timestamp (object): Date and time of the event in the format YYYY-MM-DD HH:mm:ss.
- sku (int64): Numeric ID of the item.
-
add_to_cart.parquet- client_id (int64): Numeric ID of the client (user).
- timestamp (object): Date and time of the event in the format YYYY-MM-DD HH:mm:ss.
- sku (int64): Numeric ID of the item.
-
remove_from_cart.parquet- client_id (int64): Numeric ID of the client (user).
- timestamp (object): Date and time of the event in the format YYYY-MM-DD HH:mm:ss.
- sku (int64): Numeric ID of the item.
-
product_properties.parquet- sku (int64): Numeric ID of the item.
- category (int64): Numeric ID of the item category.
- price (int64): Numeric ID of the item's price bucket.
- embedding (object): A textual embedding of a product name, compressed using the product quantization method.
-
page_visit.parquet- client_id (int64): Numeric ID of the client.
- timestamp (object): Date and time of the event in the format YYYY-MM-DD HH:mm
- url (int64): Numeric ID of a visited URL. The explicit information about what (e.g., which item) is presented on a particular page is not provided.
-
search_query.parquet- client_id (int64): Numeric ID of the client.
- timestamp (object): Date and time of the event in the format YYYY-MM-DD HH:mm:ss.
- query (object): The textual embedding of the search query, compressed using the product quantization method. The dataset also containes two more directories related to the task.
-
inputdirectory: This directory stores a NumPy file containing a subset of 1,000,000 client_ids for which Universal Behavioral Profiles should be generated:- relevant_clients.npy
-
targetdirectory: This directory stores the labels for propensity tasks. For each propensity task, target category names are stored in NumPy files:propensity_category.npy: Contains a subset of 100 categories for which the model is asked to provide predictionspopularity_propensity_category.npy: Contains popularity scores for categories from thepropensity_category.npyfile. Scores are used to compute the Novelty measure.propensity_sku.npy: Contains a subset of 100 products for which the model is asked to provide predictionspopularity_propensity_sku.npy: Contains popularity scores for products from the propensity_sku.npy file. These scores are used to compute the Novelty measure.active_clients.npy: Contains a subset of relevant clients with at least one product_buy event in history (data available for the participants). Active clients are used to compute churn target.
Follow the steps below to reproduce the full embedding generation and submission process.
- Train the LSTM Autoencoder for Search Query Embeddings
python 01-TextEncodeTrain.py- Encode the Search Texts
python 02-TextEncodeProcess.py- Preprocess the Dataset to Generate One Row per User
python 03-Polars_DS_enc_search.py- Train the BEHAV-E Models
- Exponential Moving Average (EMA) Variant:
python 04-Emb-8ShortVAECATURLFastNAEMA.py- One Cycle Learning Rate Variant:
python 04-Emb-8ShortVAECATURLFastNAOneCycleLarge.py- Generate User Embeddings from Trained Models
- From EMA-trained model:
python 05-Emb-8ShortVAECATURLFastNAEMA_gen.py- *From OneCycle-trained model:
python 05-Emb-8ShortVAECATURLFastNAOneCycleLarge_gen.py- Merge the Generated Embeddings for Final Submission
python 06-Merger.pyAt inference time:
- BEHAV-E uses the mean of the latent distribution, i.e.,
$$\mu = \mathbb{E}[z]$$ , instead of sampling from$$\mathcal{N}(\mu, \sigma^2)$$ . - Dropout is disabled to ensure deterministic outputs.
- The final user embedding is the concatenation of the embeddings produced by both trained BEHAV-E models (EMA and OneCycle variants). Although both models are trained on the same data, this concatenation acts as a lightweight ensemble and improves robustness.
The pretrained models used for generate our submission can be downloaded from an external repository.
We would like to thank the organizers of the RecSys 2025 Challenge for providing a valuable dataset and a well-designed competition platform.
@inproceedings{10.1145/3758126.3758130,
author = {Rodriguez, Juan Manuel and Tommasel, Antonela},
title = {BEHAV-E! You are Not Just a Number to Us, but an {$\mathbb{R}^{2048}$} Embedding},
year = {2025},
isbn = {9798400720994},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3758126.3758130},
doi = {10.1145/3758126.3758130},
booktitle = {Proceedings of the Recommender Systems Challenge 2025},
pages = {16–20},
numpages = {5},
series = {RecSysChallenge '25}
}For inquiries, collaboration, or questions regarding this submission, please contact the Dargk Team at:
📧 Antonela Tommasel (antonela.tommasel@isistan.unicen.edu.ar)