Hostname: page-component-77f85d65b8-lfk5g Total loading time: 0 Render date: 2026-03-29T18:56:45.383Z Has data issue: false hasContentIssue false

Distributed Subweb Specifications for Traversing the Web

Published online by Cambridge University Press:  25 April 2023

BART BOGAERTS
Affiliation:
Vrije Universiteit Brussel, Belgium (e-mails: Bart.Bogaerts@vub.be, bas.ketsman@vub.be, younes.zeboudj@vub.be)
BAS KETSMAN
Affiliation:
Vrije Universiteit Brussel, Belgium (e-mails: Bart.Bogaerts@vub.be, bas.ketsman@vub.be, younes.zeboudj@vub.be)
YOUNES ZEBOUDJ
Affiliation:
Vrije Universiteit Brussel, Belgium (e-mails: Bart.Bogaerts@vub.be, bas.ketsman@vub.be, younes.zeboudj@vub.be)
HEBA AAMER
Affiliation:
Universiteit Hasselt, Hasselt, Belgium (e-mail: heba.mohamed@uhasselt.be)
RUBEN TAELMAN
Affiliation:
Ghent University – imec – IDLab, Belgium (e-mails: ruben.taelman@ugent.be, ruben.verborgh@ugent.be)
RUBEN VERBORGH
Affiliation:
Ghent University – imec – IDLab, Belgium (e-mails: ruben.taelman@ugent.be, ruben.verborgh@ugent.be)
Rights & Permissions [Opens in a new window]

Abstract

Link traversal–based query processing (ltqp), in which a sparql query is evaluated over a web of documents rather than a single dataset, is often seen as a theoretically interesting yet impractical technique. However, in a time where the hypercentralization of data has increasingly come under scrutiny, a decentralized Web of Data with a simple document-based interface is appealing, as it enables data publishers to control their data and access rights. While (ltqp allows evaluating complex queries over such webs, it suffers from performance issues (due to the high number of documents containing data) as well as information quality concerns (due to the many sources providing such documents). In existing ltqp approaches, the burden of finding sources to query is entirely in the hands of the data consumer. In this paper, we argue that to solve these issues, data publishers should also be able to suggest sources of interest and guide the data consumer toward relevant and trustworthy data. We introduce a theoretical framework that enables such guided link traversal and study its properties. We illustrate with a theoretic example that this can improve query results and reduce the number of network requests. We evaluate our proposal experimentally on a virtual linked web with specifications and indeed observe that not just the data quality but also the efficiency of querying improves.

Information

Type
Original Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press
Figure 0

Results 1: Possible results of ltqp of the query in Query 1 with https://uma.ex/ as seed.

Figure 1

Table 1. Value of link path expressions

Figure 2

Fig. 1. Example wold used in ldql inexpressivity proof.

Figure 3

Fig. 2. Schema of the virtual linked web used in the experiments (from https://www.npmjs.com/package/ldbc-snb-decentralized).

Figure 4

Table 2. Performance results for (Eval. Query 1). In evaluating ${c_\textsf{All}}$ for this query, the experiment did not yield any results for one of the twelve tested seed sources. The reason was that the number of triples collected from traversing the links phase was so huge, which made the sparql query engine crash. Hence, the number of triples, the query evaluation time, and the number of results for ${c_\textsf{All}}$ are the averages of the other eleven runs.

Figure 5

Table 3. Performance results for (Eval. Query 2). In evaluating ${c_\textsf{All}}$ for this query, the experiment did not yield any results for three of the twelve tested seed sources. The reason was the same reason as the one mentioned in Table 2. Hence, the number of triples, the query evaluation time, and the number of results for ${c_\textsf{All}}$ are the averages of the other nine runs.

Figure 6

Table 4. Performance results for (Eval. Query 3).

Figure 7

Table 5. Performance results for (Eval. Query 4).

Figure 8

Table A 1. The subweb specifications used in the experiment. The specifications at * are modified in swsl1 and swsl2.