Multimedia recommendation methods aim to discover the user preference on the multi-modal information to enhance the collaborative filtering (CF) based recommender system. Nevertheless, they seldom consider the impact of feature extraction on the user preference modeling and prediction of the user-item interaction, as the extracted features contain excessive information irrelevant to the recommendation. To capture the informative features from the extracted ones, we resort to Transformer model to establish the correlation between the items historically interacted by the same user. Considering its challenges in effectiveness and efficiency, we propose a novel Transformer-based recommendation model, termed as Light Graph Transformer model (LightGT). Therein, we develop a modal-specific embedding and a layer-wise position encoder for the effective similarity measurement, and present a light self-attention block to improve the efficiency of self-attention scoring. Based on these designs, we can effectively and efficiently learn the user preference from the off-the-shelf items' features to predict the user-item interactions. Conducting extensive experiments on Movielens, Tiktok and Kwai datasets, we demonstrate that LigthGT significantly outperforms the state-of-the-art baselines with less time. Our code is publicly available at: https://github.com/Liuwq-bit/LightGT.

PDF Abstract

Datasets


Results from the Paper


 Ranked #1 on Multi-Media Recommendation on Kwai (Recall@10 metric)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Multi-Media Recommendation Kwai LightGT Recall@10 0.0546 # 1
nDCG@10 0.0441 # 1
Multi-Media Recommendation MovieLens LightGT Recall@10 0.2650 # 1
nDCG@10 0.1771 # 1
Multi-Media Recommendation Tiktok LightGT Recall@10 0.1213 # 1
nDCG@10 0.0751 # 1

Methods