A reproduction of Apple's bi-directional LSTM models for language identification in short strings

Language Identification is the task of identifying a document's language. For applications like automatic spell checker selection, language identification must use very short strings such as text message fragments. In this work, we reproduce a language identification architecture that Apple briefly sketched in a blog post. We confirm the bi-LSTM model's performance and find that it outperforms current open-source language identifiers. We further find that its language identification mistakes are due to confusion between related languages.

PDF Abstract EACL 2021 PDF EACL 2021 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Language Identification OpenSubtitles Apple bi-LSTM Accuracy 91.37 # 1
Language Identification Universal Dependencies Apple bi-LSTM Accuracy 86.93 # 1

Methods