Paper

Character 3-gram Mover's Distance: An Effective Method for Detecting Near-duplicate Japanese-language Recipes

In user-generated recipe websites, users post their-original recipes. Some recipes, however, are very similar in major components such as the cooking instructions to other recipes. We refer to such recipes as "near-duplicate recipes". In this study, we propose a method that extends the "Word Mover's Distance", which calculates distances between texts based on word embedding, to character 3-gram embedding. Using a corpus of over 1.21 million recipes, we learned the word embedding and the character 3-gram embedding by using a Skip-Gram model with negative sampling and fastText to extract candidate pairs of near-duplicate recipes. We then annotated these candidates and evaluated the proposed method against a comparison method. Our results demonstrated that near-duplicate recipes that were not detected by the comparison method were successfully detected by the proposed method.

Results in Papers With Code
(↓ scroll down to see all results)