Abstract
Web templates are one of the main development resources for website engineers. Templates allow
them to increase productivity by plugin content into already formatted and prepared pagelets. For
the final user templates are also useful, because they provide uniformity and a common look and
feel for all webpages. However, from the point of view of crawlers and indexers, templates are an
important problem, because templates usually contain irrelevant information such as advertisements,
menus, and banners. Processing and storing this information is likely to lead to a waste of resources
(storage space, bandwidth, etc.). It has been measured that templates represent between 40% and
50% of data on the Web. Therefore, identifying templates is essential for indexing tasks. In this work
we propose a novel method for automatic template extraction that is based on similarity analysis
between the DOM trees of a collection of webpages that are detected using menus information. Our
implementation and experiments demonstrate the usefulness of the technique.