Hostname: page-component-89b8bd64d-4ws75 Total loading time: 0 Render date: 2026-05-12T17:55:40.489Z Has data issue: false hasContentIssue false

An unsupervised perplexity-based method for boilerplate removal

Published online by Cambridge University Press:  21 February 2023

Marcos Fernández-Pichel*
Affiliation:
Centro Singular de Investigación en Tecnoloxas Intelixentes (CiTIUS), Universidade de Santiago de Compostela, 15782 Santiago de Compostela, Galicia (Spain)
Manuel Prada-Corral
Affiliation:
Centro Singular de Investigación en Tecnoloxas Intelixentes (CiTIUS), Universidade de Santiago de Compostela, 15782 Santiago de Compostela, Galicia (Spain)
David E. Losada
Affiliation:
Centro Singular de Investigación en Tecnoloxas Intelixentes (CiTIUS), Universidade de Santiago de Compostela, 15782 Santiago de Compostela, Galicia (Spain)
Juan C. Pichel
Affiliation:
Centro Singular de Investigación en Tecnoloxas Intelixentes (CiTIUS), Universidade de Santiago de Compostela, 15782 Santiago de Compostela, Galicia (Spain)
Pablo Gamallo
Affiliation:
Centro Singular de Investigación en Tecnoloxas Intelixentes (CiTIUS), Universidade de Santiago de Compostela, 15782 Santiago de Compostela, Galicia (Spain)
*
*Corresponding author. E-mail: marcosfernandez.pichel@usc.es
Rights & Permissions [Opens in a new window]

Abstract

The availability of large web-based corpora has led to significant advances in a wide range of technologies, including massive retrieval systems or deep neural networks. However, leveraging this data is challenging, since web content is plagued by the so-called boilerplate: ads, incomplete or noisy text and rests of the navigation structure, such as menus or navigation bars. In this work, we present a novel and efficient approach to extract useful and well-formed content from web-scraped data. Our approach takes advantage of Language Models and their implicit knowledge about correctly formed text, and we demonstrate here that perplexity is a valuable artefact that can contribute in terms of effectiveness and efficiency. As a matter of fact, the removal of noisy parts leads to lighter AI or search solutions that are effective and entail important reductions in resources spent. We exemplify here the usefulness of our method with two downstream tasks, search and classification, and a cleaning task. We also provide a Python package with pre-trained models and a web demo demonstrating the capabilities of our approach.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press
Figure 0

Figure 1. Perplexity model for boilerplate removal.

Figure 1

Table 1. General statistics of search collections

Figure 2

Table 2. TREC 2019 Decision Track. Effect of different perplexity cut-off values (reported in brackets) on retrieval performance. The $\uparrow$/$\downarrow$ symbols indicate whether the method significantly improves or not (Wilcoxon test, $\alpha=0.05$) over the baseline (no perplexity-based removal)

Figure 3

Table 3. TREC 2013 Web Track. The $\uparrow$/$\downarrow$ symbols indicate whether the method significantly improves or not (Wilcoxon test, $\alpha=0.05$) over the baseline (no perplexity-based removal)

Figure 4

Figure 2. Space savings derived from using our model in the IR search task.

Figure 5

Table 4. Time measurements (HH:MM:SS) for the baseline and some perplexity-based variants

Figure 6

Table 5. Carbon emissions in kg of CO$_2$ for the baseline and some perplexity-based variants

Figure 7

Table 6. Classification results for WebKB dataset. For each block, the top performer (F1 macro) is bolded

Figure 8

Table 7. CleanEval shared task results for two perplexity-based methods and the jusText algorithm. The percentage scores represent the average similarity between the webpages cleaned automatically and the ground truth webpages

Figure 9

Figure 3. Some of the Pyplexity capabilities shown up and running in a Jupyter Notebook.

Figure 10

Figure 4. Demo webpage that showcases the capabilities of the methodology presented in this study.