Abstract
Abstract. Bilingual corpora constitute an indispensable resource for
translation model training in statistical machine translation. However,
it is not really clear if including all the training data available actually
helps to improve translation quality. Bilingual sentence selection aims to
select the best subset of the bilingual sentences from an available pool
of sentences, with which to train a SMT system. This article studies,
compares, and combines two kinds of data selection methods: the first
method is based on cross-entropy difference, and the second method
is based on infrequent n-gram occurrence. Experimental results report
improvements compared with a system trained only with in-domain data.
In addition, the results obtained with the system trained with the data
selected are comparable to those obtained with a system trained with all
the available data.