Bilingual sentence selection strategies: comparative and combination in statistical machine translation systems

Autores UPV
Año
CONGRESO Bilingual sentence selection strategies: comparative and combination in statistical machine translation systems

Abstract

Abstract. Bilingual corpora constitute an indispensable resource for translation model training in statistical machine translation. However, it is not really clear if including all the training data available actually helps to improve translation quality. Bilingual sentence selection aims to select the best subset of the bilingual sentences from an available pool of sentences, with which to train a SMT system. This article studies, compares, and combines two kinds of data selection methods: the first method is based on cross-entropy difference, and the second method is based on infrequent n-gram occurrence. Experimental results report improvements compared with a system trained only with in-domain data. In addition, the results obtained with the system trained with the data selected are comparable to those obtained with a system trained with all the available data.