Distributed Representations of Words and Documents for Discriminating Similar Languages.

Autores UPV
Año
CONGRESO Distributed Representations of Words and Documents for Discriminating Similar Languages.

Abstract

Discriminating between similar languages or language varieties aims to detect lexical and semantic variations in order to classify these varieties of languages. In this work we describe the system built by the Pattern Recognition and Human Language Technology (PRHLT) research center - Universitat Politècnica de València and Autoritas Consulting for the Discriminating between similar languages (DSL) 2015 shared task. In order to determine the language group of similar languages, we first employ a simple approach based on distances with language prototypes with 99.8% accuracy in the test sets. For classifying intra-group languages we focus on the use of distributed representations of words and documents using the continuous Skip-gram model. Experimental results of classification of languages in 14 categories yielded accuracies of 92.7% and 90.8% when classifying unmodified texts and text with hidden named entities, respectively.