Language Identification

79 papers with code • 3 benchmarks • 13 datasets

Language identification is the task of determining the language of a text.

Most implemented papers

The WiLI benchmark dataset for written language identification

birolkuyumcu/language_identification 23 Jan 2018

This paper describes the WiLI-2018 benchmark dataset for monolingual written natural language identification.

Universal Dependency Parsing for Hindi-English Code-switching

irshadbhat/nsdp-cs NAACL 2018

We present a treebank of Hindi-English code-switching tweets under Universal Dependencies scheme and propose a neural stacking model for parsing that efficiently leverages part-of-speech tag and syntactic tree annotations in the code-switching treebank and the preexisting Hindi and English treebanks.

Word-level Embeddings for Cross-Task Transfer Learning in Speech Processing

bepierre/SpeechVGG 22 Oct 2019

Recent breakthroughs in deep learning often rely on representation learning and knowledge transfer.

VoxLingua107: a Dataset for Spoken Language Recognition

alumae/torch-xvectors-wav 25 Nov 2020

Speech activity detection and speaker diarization are used to extract segments from the videos that contain speech.

Finding Structure in Text, Genome and Other Symbolic Sequences

rn123/japanese_text_analysis 8 Jul 2012

A variety of applications for these methods are examined in detail.

TweetCaT: a tool for building Twitter corpora of smaller languages

nljubesi/tweetcat LREC 2014

This paper presents TweetCaT, an open-source Python tool for building Twitter corpora that was designed for smaller languages.

Automatic Dialect Detection in Arabic Broadcast Speech

Qatar-Computing-Research-Institute/dialectID 23 Sep 2015

We used these features in a binary classifier to discriminate between Modern Standard Arabic (MSA) and Dialectal Arabic, with an accuracy of 100%.

A Semisupervised Approach for Language Identification based on Ladder Networks

udibr/LRE 1 Apr 2016

In this study we address the problem of training a neuralnetwork for language identification using both labeled and unlabeled speech samples in the form of i-vectors.