Dargk Team Submission - RecSys 2025 Challenge

Welcome to the official repository of the Dargk team submission for the RecSys 2025 Challenge. This repository contains the full pipeline and implementation of our model, BEHAV-E: Behavioral Embedding via Hybrid Action Variational Encoder.

Our approach is centered on modeling user behavior through a hybrid representation combining semantic, categorical, and temporal information, and training via a contrastive learning framework.

🧠 Model Overview

BEHAV-E: Behavioral Embedding via Hybrid Action Variational Encoder

BEHAV-E is a representation learning model designed to encode complex and multi-faceted user behavior. The model processes various user actions such as:

Product buys
Product add-to-cart
Product remove-from-cart
URL visits
Search queries

Key modeling elements include:

Kernel Density Estimation (KDE) to capture temporal distribution of actions.
LSTM-based Autoencoder to embed user search queries.
Shared Embedding Bags with linear transformations to efficiently represent items, categories, and URLs.
Variational Encoder to output user behavior embeddings using reparameterization trick.
Contrastive Learning (InfoNCE loss) with dual models to learn discriminative embeddings.

During inference, we concatenate the mean embeddings from two BEHAV-E models as an ensemble strategy.

📁 Repository Structure

EmbeddingAnalysis: contains the code for reproduce the analysis as well as more details than the ones presented on the paper.
data/: Folder where the decompressed dataset should be placed
01-TextEncodeTrain.py: Trains the LSTM autoencoder for text embeddings
02-TextEncodeProcess.py: Encodes the search text using the trained autoencoder
03-Polars_DS_enc_search.py: Preprocesses data: one row per client format
04-Emb-8ShortVAECATURLFastNAEMA.py: Trains BEHAV-E using EMA (Exponential Moving Average)
04-Emb-8ShortVAECATURLFastNAOneCycleLarge.py: Trains BEHAV-E using One Cycle LR schedule
05-Emb-8ShortVAECATURLFastNAEMA_gen.py: Generates embeddings using EMA-trained model
05-Emb-8ShortVAECATURLFastNAOneCycleLarge_gen.py: Generates embeddings using OneCycle-trained model
06-Merger.py: Concatenates embeddings for final submission
environment.yml: Conda environment definition for Windows
Dockerfile: Dockerfile to build a docker container for out model.
README.md: This file

Execution Note: Files should be executed sequentially from 01 to 06. Files with the same prefix number (e.g., both 04-*) can be run in any order.

⚙️ Setup

There are two ways to setup teh environment. The first one is in the local environment through conda with the provided environment.yml file. This has been tested only on a Windows environment. The second way is using a docker container with the provided Dockerfile.

Notice that this project only depends on the data, no pretrained model is required to run the pipeline end-to-end. However, we provide our trained models to reproduce the embeddings submitted to the challenge. The link is provided below.

Clone the repository:

git clone https://github.com/<your-org>/recsys2025-dargk.git
cd recsys2025-dargk

💻 Local

Set up the conda environment:

conda env create -f environment.yml
conda activate behav-e

🐳 Docker (Recommended)

Follow these instructions to build and run the RecSys2025 Docker container with GPU support.

Build the Docker Image

docker build -t recsys2025 .

Create the Docker Container

Windows

docker container create -i -t -v "%CD%":/recsys2025 --gpus=all --name recsys2025 recsys2025

Linux/macOS

docker container create -i -t -v "$PWD":/recsys2025 --gpus=all --name recsys2025 recsys2025

Start and Attach to the Container

docker container start --attach -i recsys2025

💾 Preparing the dataData

To run the system is necesary to download the dataset, which is compressed in a file calles ubc_data.tar.gz. This file must be extracted into the data/ directory.

A detailed description of the dataset can be found at the challenge Web site. It containes six parquet files:

product_buy.parquet
- client_id (int64): Numeric ID of the client (user).
- timestamp (object): Date and time of the event in the format YYYY-MM-DD HH:mm:ss.
- sku (int64): Numeric ID of the item.
add_to_cart.parquet
- client_id (int64): Numeric ID of the client (user).
- timestamp (object): Date and time of the event in the format YYYY-MM-DD HH:mm:ss.
- sku (int64): Numeric ID of the item.
remove_from_cart.parquet
- client_id (int64): Numeric ID of the client (user).
- timestamp (object): Date and time of the event in the format YYYY-MM-DD HH:mm:ss.
- sku (int64): Numeric ID of the item.
product_properties.parquet
- sku (int64): Numeric ID of the item.
- category (int64): Numeric ID of the item category.
- price (int64): Numeric ID of the item's price bucket.
- embedding (object): A textual embedding of a product name, compressed using the product quantization method.
page_visit.parquet
- client_id (int64): Numeric ID of the client.
- timestamp (object): Date and time of the event in the format YYYY-MM-DD HH:mm
- url (int64): Numeric ID of a visited URL. The explicit information about what (e.g., which item) is presented on a particular page is not provided.
search_query.parquet
- client_id (int64): Numeric ID of the client.
- timestamp (object): Date and time of the event in the format YYYY-MM-DD HH:mm:ss.
- query (object): The textual embedding of the search query, compressed using the product quantization method. ‍ The dataset also containes two more directories related to the task.
input directory: This directory stores a NumPy file containing a subset of 1,000,000 client_ids for which Universal Behavioral Profiles should be generated:
- relevant_clients.npy
target directory: This directory stores the labels for propensity tasks. For each propensity task, target category names are stored in NumPy files:
- propensity_category.npy: Contains a subset of 100 categories for which the model is asked to provide predictions
- popularity_propensity_category.npy: Contains popularity scores for categories from the propensity_category.npy file. Scores are used to compute the Novelty measure.
- propensity_sku.npy: Contains a subset of 100 products for which the model is asked to provide predictions
- popularity_propensity_sku.npy: Contains popularity scores for products from the propensity_sku.npy file. These scores are used to compute the Novelty measure.
- active_clients.npy: Contains a subset of relevant clients with at least one product_buy event in history (data available for the participants). Active clients are used to compute churn target.

🚀 Pipeline Execution

Follow the steps below to reproduce the full embedding generation and submission process.

Train the LSTM Autoencoder for Search Query Embeddings

python 01-TextEncodeTrain.py

Encode the Search Texts

python 02-TextEncodeProcess.py

Preprocess the Dataset to Generate One Row per User

python 03-Polars_DS_enc_search.py

Train the BEHAV-E Models

Exponential Moving Average (EMA) Variant:

python 04-Emb-8ShortVAECATURLFastNAEMA.py

One Cycle Learning Rate Variant:

python 04-Emb-8ShortVAECATURLFastNAOneCycleLarge.py

Generate User Embeddings from Trained Models

From EMA-trained model:

python 05-Emb-8ShortVAECATURLFastNAEMA_gen.py

*From OneCycle-trained model:

python 05-Emb-8ShortVAECATURLFastNAOneCycleLarge_gen.py

Merge the Generated Embeddings for Final Submission

python 06-Merger.py

⚠️ Note: All scripts with the same prefix number (e.g., 04-, 05-) can be run in any order.

🧪 Inference Details

At inference time:

BEHAV-E uses the mean of the latent distribution, i.e., $$\mu = \mathbb{E}[z]$$, instead of sampling from $$\mathcal{N}(\mu, \sigma^2)$$.
Dropout is disabled to ensure deterministic outputs.
The final user embedding is the concatenation of the embeddings produced by both trained BEHAV-E models (EMA and OneCycle variants). Although both models are trained on the same data, this concatenation acts as a lightweight ensemble and improves robustness.

Submission

The pretrained models used for generate our submission can be downloaded from an external repository.

🙌 Acknowledgements

We would like to thank the organizers of the RecSys 2025 Challenge for providing a valuable dataset and a well-designed competition platform.

📝Cite us!

@inproceedings{10.1145/3758126.3758130,
   author = {Rodriguez, Juan Manuel and Tommasel, Antonela},
   title = {BEHAV-E! You are Not Just a Number to Us, but an {$\mathbb{R}^{2048}$} Embedding},
   year = {2025},
   isbn = {9798400720994},
   publisher = {Association for Computing Machinery},
   address = {New York, NY, USA},
   url = {https://doi.org/10.1145/3758126.3758130},
   doi = {10.1145/3758126.3758130},
   booktitle = {Proceedings of the Recommender Systems Challenge 2025},
   pages = {16–20},
   numpages = {5},
   series = {RecSysChallenge '25}
}

📫 Contacts

For inquiries, collaboration, or questions regarding this submission, please contact the Dargk Team at:

📧 Antonela Tommasel (antonela.tommasel@isistan.unicen.edu.ar)

📧 Juan Manuel Rodriguez (jmro@cs.aau.dk)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dargk Team Submission - RecSys 2025 Challenge

🧠 Model Overview

BEHAV-E: Behavioral Embedding via Hybrid Action Variational Encoder

📁 Repository Structure

⚙️ Setup

💻 Local

🐳 Docker (Recommended)

💾 Preparing the dataData

🚀 Pipeline Execution

🧪 Inference Details

Submission

🙌 Acknowledgements

📝Cite us!

📫 Contacts

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
EmbeddingAnalysis		EmbeddingAnalysis
data		data
.gitignore		.gitignore
01-TextEncodeTrain.py		01-TextEncodeTrain.py
02-TextEncodeProcess.py		02-TextEncodeProcess.py
03-Polars_DS_enc_search.py		03-Polars_DS_enc_search.py
04-Emb-8ShortVAECATURLFastNAEMA.py		04-Emb-8ShortVAECATURLFastNAEMA.py
04-Emb-8ShortVAECATURLFastNAOneCycleLarge.py		04-Emb-8ShortVAECATURLFastNAOneCycleLarge.py
05-Emb-8ShortVAECATURLFastNAEMA_gen.py		05-Emb-8ShortVAECATURLFastNAEMA_gen.py
05-Emb-8ShortVAECATURLFastNAOneCycleLarge_gen.py		05-Emb-8ShortVAECATURLFastNAOneCycleLarge_gen.py
06-Merger.py		06-Merger.py
Dockerfile		Dockerfile
README.md		README.md
environment.yml		environment.yml

knife982000/RecSys2025Challenge

Folders and files

Latest commit

History

Repository files navigation

Dargk Team Submission - RecSys 2025 Challenge

🧠 Model Overview

BEHAV-E: Behavioral Embedding via Hybrid Action Variational Encoder

📁 Repository Structure

⚙️ Setup

💻 Local

🐳 Docker (Recommended)

💾 Preparing the dataData

🚀 Pipeline Execution

🧪 Inference Details

Submission

🙌 Acknowledgements

📝Cite us!

📫 Contacts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages