Abstract
This work addresses the issue of cross-language high similarity and
near-duplicates search, where, for the given document, a highly similar one is to
be identified from a large cross-language collection of documents. We propose
a concept-based similarity model for the problem which is very light in computation
and memory. We evaluate the model on three corpora of different nature
and two language pairs English-German and English-Spanish using the Eurovoc
conceptual thesaurus. Our model is compared with two state-of-the-art models
and we find, though the proposed model is very generic, it produces competitive
results and is significantly stable and consistent across the corpora.