Volver atrás Publicación

Bilingual sentence selection strategies: comparative and combination in statistical machine translation systems

Imprimir

¿Quieres contarnos tu reto? Pincha aquí y te ayudamos a encontrar una solución

Autores UPV

Chinea Ríos Mara, Sanchis Trilles Germán, Casacuberta Nolla Francisco

Año

2014

CONGRESO

Bilingual sentence selection strategies: comparative and combination in statistical machine translation systems

Abstract

Abstract. Bilingual corpora constitute an indispensable resource for translation model training in statistical machine translation. However, it is not really clear if including all the training data available actually helps to improve translation quality. Bilingual sentence selection aims to select the best subset of the bilingual sentences from an available pool of sentences, with which to train a SMT system. This article studies, compares, and combines two kinds of data selection methods: the first method is based on cross-entropy difference, and the second method is based on infrequent n-gram occurrence. Experimental results report improvements compared with a system trained only with in-domain data. In addition, the results obtained with the system trained with the data selected are comparable to those obtained with a system trained with all the available data.