Querying Incomplete Data: Complexity and Tractability via Datalog and First-Order Rewritings

To answer database queries over incomplete data, the gold standard is ﬁnding certain answers: those that are true regardless of how incomplete data is interpreted. Such answers can be found eﬃciently for conjunctive queries and their unions, even in the presence of constraints. With negation added, the problem becomes intractable however. We concentrate on the complexity of certain answers under constraints and on eﬀﬁciently answering queries outside the usual classes of (unions) of conjunctive queries by means of rewriting as Datalog and ﬁrst-order queries. We ﬁrst notice that there are three diﬀerent ways in which query answering can be cast as a decision problem. We complete the existing picture and provide precise complexity bounds on all versions of the decision problem, for certain and best answers. We then study a well-behaved class of queries that extends unions of conjunctive queries with a mild form of negation. We show that for them, certain answers can be expressed in Datalog with negation, even in the presence of functional dependencies, thus making them tractable in data complexity. We show that in general, Datalog cannot be replaced by ﬁrst-order logic, but without constraints such a rewriting can be done in ﬁrst order.


Introduction
Answering queries over incomplete databases is crucial in many different scenarios such as data integration (Lenzerini 2002), data exchange (Arenas et al. 2014), inconsistency management (Bertossi 2011), data cleaning (Geerts et al. 2013), ontology-based data access (OBDA) (Bienvenu and Ortiz 2015), and many others.The common thread running through all these applications lies in computing certain answers (Imielinski and Lipski 1984).Intuitively, this produces answers that are true in all possible worlds, that is, complete databases that an incomplete database represents.An incomplete database in itself is a set of tuples with missing information, plus integrity constraints.One can think, for example, of relations with nulls on which keys can be specified.Then, a possible world is obtained by substituting values for nulls so that all the keys are satisfied.
The notion of certain answers is sometimes too restrictive (for example, for some queries no answers are certain).In that case, an alternative is best answers: for them, there is no other tuple that is an answer in more possible worlds.However, computationally one encounters serious problems with both approaches.To start with, computing certain answers and best answers is intractable for first-order queries (Abiteboul et al. 1991;Libkin 2018) (already for data complexity).Finding such answers in restricted subclasses of first-order queries often relies on sophisticated algorithms -not naturally expressible by other queries -that are therefore difficult to implement in a DBMS.We know that restricting to unions of conjunctive queries allows one to overcome this difficulty by using naïve evaluation which computes certain answers in polynomial time (Imielinski and Lipski 1984).This amounts to evaluating queries over incomplete databases as if nulls were usual data values, thus merely using the standard database query engine to compute certain answers.
We address these problems in the present paper whose goal is two-fold.
1. We start by filling gaps in our knowledge of the complexity of answering queries over incomplete databases.Intractable bounds on certain and best answers cited above were obtained under different formulations of query answering as a decision problem.We show that there are three natural ways to represent query answering as a decision problem and classify the complexity of certain and best answers for all of them.2. We then look at a way of finding query answers by leveraging the existing database technology, namely by finding query rewritings which, when evaluated on the incomplete database, give us certain answers.We show that for a class extending unions of conjunctive queries with a form of negation (but still falling short of all first-order queries), such rewritings can be found in Datalog with negation, thus giving us a tractable complexity bound.
To elaborate on the first point, the two existing decision versions of the query answering problem are as follows: (a) is a tuple in the answer?and (b) is the answer a member of a given family of sets?We add a third: (c) is the answer equal to a given set.We then prove that for certain answers, the complexity is coNP, P NP[log n] , and DP-complete for (a), (b), (c).The result for (a) has long been known of course.For best answers, the complexity is uniform: P NP[log n] -complete for all variations (the result for (b) was previously known).We shall define these complexity classes in the next section; for the reader not familiar with them, they all lie within the second level of the polynomial hierarchy.
For the second theme of the paper, we look at query rewritings.This is a standard way of leveraging database technology in the case of incomplete or imperfect information, and such rewritings were heavily used in data integration, data exchange, OBDA, query answering using views, consistent query answering, etc. (Calvanese et al. 2000;2007;Calì et al. 2013;2003b).First-order rewritings are particularly useful, as they allow to use the power of standard database query engines.In fact when they exist, the rewritten queries can be implemented in any relational query engine by expressing them in SQL, with no need to implement ad-hoc algorithms.Next best are rewritings into Datalog (with negation): these let us express queries using recursive features of SQL.
As already mentioned, for unions of conjunctive queries (and even some mild restrictions with guarded negation Gheerbrant et al. 2014) certain answers are computed by naïve evaluation without the presence of constraints.Under constraints, even such simple ones as keys, the picture is less complete.Indeed, keys and in general equality-generating dependencies (EGD) change the syntactic shape of a query that makes naive evaluation work.
• Certain answers to a conjunctive query Q (or a union of CQs) on a database D under key constraints Σ can be found by naïve evaluation of Q on the result of the chase of D with Σ. Mathematically, cert Σ (Q, D) = Q(chase Σ (D)), where on the lefthand side we have certain answers under constraints and on the right-hand side the naïve evaluation of Q over the result of the chase.Here, chase Σ refers to the classical textbook chase procedure with keys or more generally functional dependencies.In fact, the above result applies when Σ is a set of functional dependencies or equality generating dependencies (EGDs), not just keys.
Unfortunately, the above result does not work when we move outside the class of selectproject-join-union queries or unions of CQs.In fact even without constraints, certain answers to a query of the form Q 1 − Q 2 , where both Q 1 and Q 2 are CQs, are not necessarily produced by naïve evaluation.To see why, take a database containing one fact R(1, ⊥) where ⊥ is a null and Q 1 returning R while Q 2 is given by a formula R(x, y) ∧ x = y.Here, naïve evaluation of Q 1 − Q 2 returns R (as 1 is not equal to ⊥), while certain answers is empty (as 0 is a possible value for ⊥).
This motivates our question whether we can extend the class of CQs and their unions to obtain tractable evaluation of certain answers under constraints such as functional dependencies and EGDs.The answer is positive; in fact the query of the form Q 1 − Q 2 above will be an example of a query in this class.To start with, the class must be such that finding certain answers for its queries without constraints is already tractable.We know one such class: it consists of arbitrary Boolean combinations of CQs, not just their union.We shall denote it by BCCQ.It was proved in Gheerbrant and Libkin (2015) that certain answers for it can be found in polynomial time (for data complexity), though the procedure was tableau-based and not suitable for implementation in a database system.
This is precisely what we do in the second part of this paper.We establish three main results: 1.For an arbitrary BCCQ Q and a set of EGDs Σ, one can construct a Datalog (with negation) query Q whose naive evaluation computes certain answers, thereby ensuring their polynomial-time data complexity.2. There are however simple BCCQs, in fact even CQs, and keys, such that certain answers cannot be expressed as a first-order queries.Therefore, using Datalog was necessary.
3. Without constraints present, certain answers to BCCQs are not only polynomialtime computable as had been shown previously, but also can be expressed by firstorder queries and thus efficiently implemented in SQL databases without using recursion.
The Sections 4.2, 4.3, and 4.4 address these items, respectively.Note that the material from this paper is based on the two conference papers Gheerbrant and Sirangelo (2019) and Gheerbrant et al. (2022).

Incomplete databases and constraints
We represent missing information in relational databases in the standard way using nulls (Abiteboul et al. 1995;Imielinski and Lipski 1984;van der Meyden 1998).Incomplete databases are populated by constants and nulls, coming respectively from two countably infinite sets Const and Null.We denote nulls by ⊥, sometimes with sub-or superscript.We also allow them to repeat, thus adopting the model of marked nulls, as customary in the context of applications such as OBDA or data integration and exchange.
A relational schema, or vocabulary σ, is a set of relation names with associated arities.A database D over σ associates to each relation name of arity k in σ, a k-ary relation which is a finite subset of (Const ∪ Null) k .Sets of constants and nulls occurring in D are denoted by Const(D) and Null(D).A database is complete if it contains no nulls, that is The active domain of D is the set of all values appearing in D, that is adom(D) = Const(D) ∪ Null(D).
A valuation v : Null(D) → Const on a database D is a map that assigns constant values to nulls occurring in D. By v(D) and v(ā), we denote the result of replacing each null ⊥ by v(⊥) in a database D or in a tuple ā.The semantics [[D]] of an incomplete database D is the set {v(D) | v is a valuation on D} of all complete databases it can represent.Here as is common in research on incomplete data, we use closed-world assumption (Imielinski and Lipski 1984;Reiter 1977) (i.e.everything we do not know to be true is automatically assumed to be false and no new tuple can be added).
An equality-generating dependency (EGD) is a first-order sentence of the form ∀x (ϕ(x) → z = z ), where ϕ(x) is a conjunction of atoms (without constants), each variable in x occurs in some atom of ϕ, and z, z are distinct variables in x.As a special case, a functional dependency (FD) over a relation name R is of the form ∀x, ȳ, z, z (R(x, ȳ, z) ∧ R(x, ȳ , z ) → z = z ).Throughout this paper, we will assume that a (possibly empty) set of EGDs Σ is associated with the database schema σ.
A valuation v is consistent with Σ (or just consistent, when Σ is clear from the context) if v(D) |= Σ.We denote by V(D) the set of all consistent valuations defined on D.

Query answering
An m-ary query Q of active domain adom(Q) ⊆ Const is a map that associates with a database D a subset of (adom(D) ∪ adom(Q)) m .To answer an m-ary query Q over an incomplete database D, we follow (Lipski 1984) and adopt a slight generalization of the usual intersection-based certain answers notion, defined as ∩ v Q(v(D)), and furthermore incorporate constraints into query answering.
The set of certain answers to Q over D, with respect to a set of constraints Σ, is For queries that explicitly use constants, we shall expand this to allow ā range over adom(D) and those constants.The only difference with the usual notion is that we allow answers to contain nulls, to avoid pathological situations when answers known with certainty are not returned (e.g. in a query returning a relation R one would expect R to be returned while the intersection-based certain answer will only return null-free tuples).
If the set of constraints Σ is empty, we omit it and write simply cert(Q, D).Of course, every valuation is consistent with the empty set of constraints.
Following Libkin (2018), given a query Q, a database D, a set of constraints Σ, and a tuple ā over adom(D) ∪ adom(Q), we let the support of ā be the set of all valuations that witness it: Again if Σ = ∅ we omit the subscript.
Supports thus measure how close a tuple is to certainty.We consider one answer to be better than another if it has more support.That is, given a database D, a k-ary query Q, and k-tuples ā, b over adom(D) ∪ adom(Q), we let The set of best answers to Q over D is defined as the set of answers for which there is no better one: As the set of certain answers to Q over D is the set of answers that are witnessed by all valuations, note that it could also be defined using the notion of support.Namely,

Naïve evaluation and certain answers
For a query Q written in FO or Datalog, we write Q(D) to mean that such a query is evaluated naïve ly.That is, if D contains nulls, nulls of D are treated as new constants in the domain of D, distinct from each other, and distinct from all the other constants in D and ϕ.For example, the query ϕ(x, y) = ∃z (R(x, z) ∧ R(z, y)), on the database A. Gheerbrant et al.There are known connections between naïve evaluation and certain answers.If Σ is empty and Q is a union of conjunctive queries, then cert Σ (Q, D) = Q(D), see Imielinski and Lipski (1984).If Σ contains a set of EGDs, then cert Σ (Q, D) = Q chase Σ (D) ; cf.Greco et al. (2012).Here chase Σ refers to the standard chase procedure with a set of EGDs, see Abiteboul et al. (1995).

Query languages
Here we shall study best and certain answers to first-order (FO) queries, possibly in the presence of constraints, by means of their rewriting in FO and Datalog.FO queries of vocabulary σ use atomic relational and equality formulae and are closed under Boolean connectives ∧, ∨, ¬ and quantifiers ∃, ∀.We write ϕ(x) for an FO-formula ϕ with free variables x.With slight abuse of notation, x will denote both a tuple of variables and the set of variables occurring in it.The set of constants used by ϕ is denoted by adom(ϕ).We interpret FO-formulas under active domain semantics, that is quantified variables range over adom(D) ∪ adom(ϕ).Thus, an FO-formula ϕ(x) represents a query (of active domain adom(ϕ)) mapping each database D into the set of tuples { t over adom(D) ∪ adom(ϕ) | D |= ϕ( t)}.
To evaluate FO-formulas with free variables, we use assignments ν from variables to constants in the active domain.Note that with a little abuse of notation, we write D |= ϕ( t) for D |= ν ϕ(x) under the assignment ν sending x to t.
Here it is important to note that the query associated with ϕ is a mapping defined on all databases D, possibly with nulls.If D contains nulls, D |= ϕ( t) is to be intended "naïvely," that is nulls of D are treated as new constants in the domain of D, distinct from each other, and distinct from all the other constants in D and ϕ.For example, the query ϕ(x, y) = ∃z (R(x, z) ∧ R(z, y)), on the database D = {R(1, ⊥ 1 ), R(⊥ 1 , ⊥ 2 ), R(⊥ 3 , 2)} selects only the tuple (1, ⊥ 2 ).
Conjunctive queries (CQs) are given by the ∃, ∧-fragment of FO, and their unions (UCQs) by the ∃, ∧, ∨-fragment of FO; these are also captured by the positive fragment of relational algebra (select-project-union-join queries).
We also consider Boolean combination of conjunctive queries (BCCQs), that is, the closure of conjunctive queries under operations q ∩ q , q ∪ q , and q − q .
A Datalog rule (Abiteboul et al. 1995) is an expression of the form R 1 (u 1 ) ← R 2 (u 2 ), . . ., R n (u n ) where n ≥ 1, R 1 , . . ., R n are relation names and u 1 , . . ., u n are tuples of appropriate arities.Each variable occurring in u 1 must occur in at least one of u 2 , . . ., u n .A Datalog program is a finite set of Datalog rules.The head of the rule is the expression R 1 (u 1 ); and R 2 (u 2 ), . . ., R n (u n ) forms the body.The semantics is the standard fixed-point semantics.
As the language of our rewritings, we shall be using FO, but also a fragment of stratified Datalog with negation in bodies that can be seen in two different ways.
1.A program is evaluated in two steps.First, we can have a Datalog program P defining new idb predicates S 1 , . . ., S .Then, we ask an FO query over the schema extended with these predicates S 1 , . . ., S .2. We evaluate a stratified Datalog with negation program in which the first stratum has no negation (but may have recursion) and the second stratum has no recursion (but may have negation).
From the rewritings, we produce it will be clear that they fall in these classes.The key point about them is that they can be implemented in recursive SQL and that they both have PTIME data complexity, making their evaluation feasible.Note that recursive SQL as it is currently implemented, for example, in PostgreSQL 8.4, is actually Turing complete (Gierth 2011;Coelho 2013).

Complexity classes
In order to study the complexity of best and certain answer computation, we shall need two classes in the second level of the polynomial hierarchy.Both of these contain NP and coNP and are contained in Σ p 2 ∩ Π p 2 .The class DP consists of languages L 1 ∩ L 2 where L 1 ∈ NP and L 2 ∈ coNP.This class has appeared in database applications (Fagin et al. 2005;Barceló et al. 2014).The class P NP[log n] consists of problems that can be solved in polynomial time with a logarithmic number of calls to an NP oracle (Buss and Hay 1991).Equivalently, it can be described as the class of problems solved in P with an NP oracle where calls to the oracle are done in parallel, that is, independent of each other.This class has appeared in the context of AI, modal logic, OBDA (Gottlob 1995;Eiter and Gottlob 1997;Calvanese et al. 2006;Bienvenu and Bourgaux 2016), data exchange (Arenas et al. 2013).

Complexity of best and certain answers
We start by looking at complexity of certain and best answers of first-order queries and answer a few questions that are (perhaps somewhat surprisingly) missing in the literature.In this case, we look at arbitrary first-order queries; thus, we do not mention constraints since cert Σ (Q, D) = cert(Σ → Q, D) and likewise for best answers.In the subsequent sections, when we consider rewritings for sublanguages of first order, we shall again mention constraints explicitly since queries of the form Σ → Q will normally not belong to the same syntactic class as Q itself.In this context, we will refer to Answer Σ (Q).
As is common in database theory, we look at complexity in terms of complexity classes, which necessitates looking at decision versions of problems.The most common one that is found, stated here for ∈ {Certain, Best}, is the following problem: We are thus interested in data complexity: the query is fixed.We do not study combined complexity in this paper.In the remainder, we thus often omit the query and write Answer instead of Answer(Q).Recall that in this case, for a language L, we say that the problem Answer is C-complete in data complexity for a complexity class C, if Answer(Q) is solvable in C for every Q ∈ L, and there exists a specific Q 0 ∈ L so that Answer(Q 0 ) is hard for C. We know from Abiteboul et al. (1991) that CertainAnswer(Q) is coNP-complete in data complexity for first-order queries.
For best answers, it is a different version of the decision problem for which the complexity is known.Specifically, Libkin (2018) considered the problem of checking whether the set (Q, D) belongs to a specified family of sets: A database D, a family X of sets of tuples Question: For this decision version, the complexity of the problem was shown to be P NP[log n]complete.This version looks a bit artificial, but we include it for the sake of completeness, because it has appeared in the literature.
However, this presentation of a decision suggests another rather natural presentation of a decision version, namely asking if a given set is (Q, D): Our current state of knowledge is the complexity of CertainAnswer (coNPcomplete) and BestAnswer ∈ (P NP[log n] -complete).Thus, we now fill the gap and classify complexities of all the problems -for data complexity -in the case of FO queries.
We start by showing that all the alternatives for best answers -BestAnswer Σ , BestAnswer = , and BestAnswer ∈ -are computationally equivalent.
Theorem 3.1 For FO queries, the problems BestAnswer Σ , BestAnswer ∈ , and BestAnswer = are P NP[log n] -complete in data complexity.

Proof
The upper bound for BestAnswer = immediately follows from the upper bound for BestAnswer ∈ (take the family X to be a singleton {X}).As for BestAnswer Σ , we only need a slight modification of the upper bound proof in Libkin (2018).To check whether ā ∈ best(Q, D) we proceed as follows.Since the query is fixed, and has therefore fixed arity k, in polynomial time we can enumerate all the k-tuples of adom(D).Then, using parallel calls to the NP oracle, we can check for each such tuple b whether , ā).With this information, in polynomial time we know whether ā Q,D b for some b.
Assuming Σ empty, we prove the two remaining lower bounds, reducing from the same P NP[log n] -complete problem (Wagner 1990): given an undirected graph G, is its chromatic number χ(G) odd?With each undirected graph G = N, E with nodes N and edges E, we associate a database D G over binary relations L, E and unary relations C, O as follows.We use a null Remark that any valuation v of D G that maps each null into a constant of C represents an assignment of colors in {c 1 , . . ., c m } to nodes of G.Then, we define a query For any valuation v, φ(c) holds in v(D G ) iff (1) c = c j for some j = 1..m (ensured by the first conjunct).( 2) For such a c j , the valuation v maps each null into {c 1 , . . ., c j } (second conjunct), that is v represents an assignment of colors to nodes of G, using at most the first j colors.(3) Each color {c 1 , . . ., c j } is used by v, that is v represents an assignment of colors to nodes of G, using precisely the first j colors (third conjunct).( 4) There are no loops in E (fourth conjunct).
Thus, for a valuation v, the formula φ(c j ) is true in v(D G ) iff v represents a coloring of G using precisely the first j colors {c 1 , . . ., c j } (which in the sequel we refer to as an exact j-coloring of G).
Next, we define: For a valuation v, we have that Q(c i ) holds in v(D G ) iff either v represents an exact i-coloring of G; or v represents an exact j-coloring of G with j odd, and i ≤ j.In other words, valuations representing exact j-colorings, with j even, support only the maximal color c j ; while valuations representing exact j-colorings, with j odd, support all colors {c 1 ...c j }.
With this in place, we can conclude the reduction for the BestAnswer Σ problem: First, we prove the above claim.Let χ G be the chromatic number of G.Then, there exist no exact colorings of G which are prefixes of {c 1 , . . .c χG }, while {c 1 , . . .c χG } is an exact coloring of G.
Assume first that χ G is even.Then, there exist no valuations representing the exact coloring {c 1 }.Thus, the support of c 1 is the set of valuation representing an exact coloring {c 1 ...c j } of G with j odd and j > χ G .This support is not maximal.In fact, the support of c χG is: • the valuations representing the exact coloring {c 1 ...c χG } (there exists at least one); • the valuations representing an exact coloring {c 1 ...c j } of G with j odd and j > χ G .This support strictly contains the support of c 1 ; in fact valuations in the first item cannot be also in the second.
Assume now that χ G is odd.Then, the support of c 1 is the set of valuations representing an exact coloring {c 1 ...c j } of G with j odd and j ≥ χ G .We show that this set is maximal; that is, no color c k can have a support strictly containing it.
• if k ≤ χ G then the support of c k is the set of valuations representing an exact coloring {c 1 ...c j } of G with j odd, and j ≥ χ G .So same support as c 1 .
There exists at least one such valuation and it belongs to the support of c 1 .Thus, the support of c k does not contain the support of c 1 .
We now move to BestAnswer = .With any undirected graph G, we associate a relational structure D G obtained from D G by adding a new color c 0 in C with L(c 0 , c i ) for every 0 ≤ i ≤ m.We define a restriction ψ of the original formula φ by disallowing c 0 in colorings: to obtain ψ it suffices to replace L(y, x) in φ by L(y, x) ∧ y = c 0 , and We define a new query: Note that x + 2 < y is used as a shorthand, as it is definable in our language.
• either i is odd and v represents an exact j-coloring of G, with j odd and i ≤ j; • or i is even and: either v represents an exact coloring {c 1 ...c j } of G with j odd, and i + 2 < j; or v represents an exact coloring {c 1 ...c i } of G; or i < m and v(⊥ j ) = c 0 for all 1 ≤ j ≤ m; The following claim completes the reduction for BestAnswer = : In the following, we call v 0 the unique valuation such that v 0 (⊥ j ) = c 0 for all 1 ≤ j ≤ m.First assume that χ G is even.For all 0 . The inclusion holds whenever c i ≥ χ(G), as Supp(c i ) contains all valuations representing exact colorings {c 1 ...c i } of G, while no other Supp(c j ) with i = j contains them.Now take c i < χ(G) with i even, then Supp(c i ) contains v 0 together with all exact odd colorings (if there are any).First assume that there exists odd exact colorings of G, so there are χ(G) + 1 ones and valuations representing them are not contained in Supp(χ(G)).Also, v 0 ∈ Supp(c k ) with k odd and k < χ(G).It follows that Supp(c i ), which is the union of v 0 and of all valuations representing odd exact colorings, is maximal.Now assume that there is no exact odd coloring.This corresponds to the special case χ(G) = m where Supp(c m ) contains only the exact colorings {c 1 ...c m } of G, but not v 0 ; while Supp(c j ) = ∅ whenever j odd.In such a case, Supp(c i ) = {v 0 } is also maximal.
We assume now χ(G) is odd and show {c i | i is even} = best(Q , D G ). First notice that Supp(c 1 ) is maximal whenever χ(G) = 1, as neither Supp(c 0 ) nor any Supp(c i ) with i ≥ 2 contain valuations representing the exact {c 1 } colorings.So we assume χ(G) ≥ 3, from which it follows that there exists a constant c χ(G)−3 in the active domain which support contains v 0 together with all valuations representing exact odd colorings.As Supp(c χ(G)−1 ) contains exactly the same set of valuations, to the exclusion of those representing {c Now that we showed that all three formulations of best answers actually collapse computationally, another natural question arises.Does a similar result hold for certain answers ?It is well known that data complexity of CertainAnswer Σ is coNP-complete for FO queries (Abiteboul et al. 1991).We complete the picture as follows and summarize results in Figure 1.

Proof
To prove membership of CertainAnswer = in DP , notice that for a query Q, this problem is the intersection of two languages To prove membership of CertainAnswer ∈ in P NP[log n] , suppose the query Q is k-ary, and we are given a family of sets of k-ary tuples X = {X 1 , . . ., X n } and a database D. For each X i ∈ X , we use the NP oracle to decide in parallel whether X i = cert(Q, D) (for each X i , the two calls to the oracle do not depend on each other and they can also be done in parallel).
For DP -hardness, we reduce from the problem of checking whether χ(G), the chromatic number of an undirected graph G, equals 4 (Rothe 2003) and for P NP[log n] -hardness, we reduce from the related problem of checking whether χ(G) is odd.With such a graph G, we associate the same database D G as in the proof of Theorem 3.1.Using the exact coloring formula ϕ in the proof of Theorem 3.1, we define a query a color and there is no i < j such that v represents an exact i-coloring of the graph, which holds exactly whenever c j ∈ {c 1 , ..., c χ(G) }.

Query rewritings for tractable fragments
Considering arbitrary FO queries brought us an intrinsic intractability result for all variants of the considered decision problems.This motivates restricting to well-behaved fragments such as CQs and UCQs.Recall that conjunctive queries (CQs) are given by the ∃, ∧-fragment of FO, and their unions (UCQs) by the ∃, ∧, ∨-fragment of FO.We extend them with a mild form of negation (since adding negation leads to coNP-hardness of certain answers).This mild form comes in the shape of Boolean combination of conjunctive queries (BCCQs), that is, the closure of conjunctive queries under operations q ∩ q , q ∪ q , and q − q .
If there are no constraints in Σ, finding certain answers to BCCQs is known to be tractable (Gheerbrant and Libkin 2015), though by tableau-based techniques that are hard to implement in a database system.We now extend this in two ways.First, we show that tractability is preserved even in the presence of EGDs (and thus functional dependencies and keys).Second, we show that certain answers can be obtained by rewriting into a fragment of Datalog as described in Section 2. In particular, it means that certain answers can be found by a query expressible in recursive SQL (and even in SQL in the absence of constraints).
For BestAnswer Σ a polynomial-time evaluation algorithm (in data complexity) already exists (Libkin 2018).The resolution-based procedure is however in sharp contrast with naïve evaluation, which allows to compute certain answers to unions of conjunctive queries via usual model checking.We thus show how to apply our query rewriting techniques to the best answers problem.

A normal form for queries: neutralizing variable repetition
Towards our rewritings, we start by putting each conjunctive query in a normal form which eliminates repetition of variables, by introducing new equality atoms.

Definition 4.1 (NRV normal form)
A conjunctive query Q is in non-repeating variable normal form (NRV normal form) whenever it is of the form Q(x) = ∃ w (q( w) ∧ e(x, w)) where variables in x w are pairwise distinct, and: • q( w) is a conjunction of relational atoms without constants, where each free variable in w has at most one occurrence in q, • e(x, w) is a conjunction of equality atoms, possibly using constants, where each variable of x is involved in at least one equality.
We say that q( w) is the relational subquery of Q, and e(x, w) is the equality subquery Clearly every conjunctive query Q is equivalent to a query in NRV normal form; moreover, Q can be easily rewritten in NRV normal form (in linear time in the size of the query).Thus in what follows, we assume w.l.o.g. that conjunctive queries are given in NRV normal form.Intuitively, the NRV normal form allows us to separate the two ingredients of a conjunctive query: the existence of facts in some relations of the database, on the one side, and a set of equality conditions on data values occurring in these facts, on the other side.The existence of facts does not depend on the valuation of nulls and thus can be directly tested on the incomplete database.Instead equality atoms in an NRV normal form imply conditions that valuations need to satisfy in order for the query to hold.We can thus first concentrate on the support of equality subqueries.This will be encoded in FO and then integrated in the rewriting of the whole conjunctive query.
We introduce a notion of equivalence of database elements w.r.t. to a set of equalities.Intuitively, equivalent elements of a tuple t are the ones which should be collapsed into a single value in order for a valuation of t to satisfy all the given equalities.In what follows, we denote by ∼ γ the reflexive symmetric transitive closure of {(x, w) | x = w ∈ γ}.Note that this is an equivalence relation among variables and constants of γ.We will provide two syntactic encodings of this relation, one in Datalog and one in FO.

Datalog rewriting for certain answers for BCCQs with EGDs
Recall that, given a query Q, a database D, and a tuple ā over adom(D) ∪ adom(Q) we let the support of ā be the set of all valuations that witness it: In order to look for rewritings of BCCQs, a key observation is that ā is a certain answer to Q iff Supp(¬Q, D, ā) = ∅.When Q is a BCCQ, so is ¬Q, thus we look for ways of expressing (non-)emptiness of the support for BCCQs.
We start by concentrating on the support of equality subqueries.This will be encoded in Datalog and then integrated, as a key ingredient, in the rewriting of the whole query.We let γ(ȳ) be an arbitrary set of equality atoms among variables ȳ and possibly constants.Intuitively, we will be interested in the case that γ(ȳ) is the equality subquery e(x, w) of a CQ in NRV normal form (thus notice that in the Datalog program below ȳ encompasses variables x w of an equality subquery).
Remark that we can always write an EGD so that no variable in its body occurs more than once; it suffices to add to the body a set of variable equalities.Thus, we assume that EGDs in Σ are of the form ∀ū((ϕ(ū) ∧ ψ) → z = z ) where z, z are in ū, the conjunction of atoms ϕ(ū) contains no constants, no variable occurs twice in ϕ(ū), and ψ is a set of equalities among variables of ū.Remark also that membership in the set adom(D)∪adom(γ) can be expressed by a UCQ formula that we call Dom(x).We encode equivalence of database elements in adom(D) ∪ adom(γ) w.r.t. a set of equalities γ(ȳ) using the following Datalog program 1 : Intuitively, if t is a tuple of database elements assigned to ȳ, equivalent elements of D are the ones which should be collapsed into a single value in order for a valuation of D to satisfy all the equalities γ( t) and the EGDs.For fixed γ and t, the relation {(s, s ) | D |= equiv γ ( t, s, s )} is an equivalence relation over adom(D) ∪ adom(γ) where each element of adom(D) neither in t nor in adom(γ) forms a singleton equivalence class.
The formula equiv γ is a key ingredient in our rewriting; as formalized in the following lemma, it selects precisely the pairs of elements that a consistent valuation needs to collapse to satisfy a set of equalities.
Lemma 4.5 Let γ(ȳ) be a conjunction of equality atoms, D a database, and ν(ȳ) = t an assignment over adom(D)∪adom(γ).Assume v is a consistent valuation of nulls, then v(D) |= γ(v( t)) if and only if v(s) = v(s ) for all s, s such that D |= equiv γ ( t, s, s ).
and let s, s such that D |= equiv γ ( t, s, s ).We prove v(s) = v(s ).We proceed by induction on the derivation of equiv γ ( t, s, s ) by the fixpoint evaluation of the Datalog program.Assume equiv γ ( t, s, s ) is derived at the first iteration, and then, it follows from one of the first two rules.If it is derived by the fist rule then s = s and therefore v(s) = v(s ) trivially.Assume equiv γ ( t, s, s ) is derived using the second rule then there exists (y k = y l ) ∈ γ, and s = t k and s = t l ; now since v(D) |= γ(v( t)), we have v(t k ) = v(t l ).Now assume that equiv γ ( t, s, s ) is derived at some subsequent step.If it follows from the second rule, then it follows from 1 Queries we write hereafter can be domain dependent.So it is important to recall that we always use active domain semantics.for each (w = w ) ∈ ψ(ū)) and s = μ(z) and s = μ(z ); so by the induction hypothesis v(μ(w) In fact for each (y k = y l ) ∈ γ, we have D |= equiv γ ( t, t k , t l ) (derived by the second rule).By our hypothesis v(t k ) = v(t l ), thus γ(v( t)) holds.
Formulas we write in the remainder are over signature σ∪Null, where σ is the database schema.In any incomplete database D over σ ∪ Null, Null is always interpreted by the set of nulls occurring in D (in accordance with the semantics of the SQL construct IS NULL).That is we allow rewritings to test whether a database element is null or not.
For γ(ȳ) a conjunction of equality atoms, using equiv γ we define a new formula comp γ (ȳ) stating the existence of a consistent valuation that collapses all equivalent elements of a tuple: Let γ(ȳ) be a conjunction of equality atoms, D a database, and ν(ȳ) = t an assignment over adom(D) ∪ adom(γ), then D |= comp γ ( t) if and only if there exists a consistent valuation v of nulls such that v(D) |= γ(v( t)).Moreover if such valuation exists, there exists one further satisfying v(s) = v(s ) iff D |= equiv γ ( t, s, s ), for all s, s ∈ adom(D) ∪ adom(γ) .
s, s )} is an equivalence relation over adom(D) ∪ adom(γ), its equivalence classes form a partition of this set.In each equivalence class, there is at most one constant, so we define a valuation v mapping all nulls of a class to the unique constant of that class (or to a new fresh constant if the class does not contain any).Note that v has the property that v(s) = v(s ) iff D |= equiv γ ( t, s, s ); this allows to prove that v is a consistent valuation.
We have proved that v satisfies the characterization of Lemma 4.5, and can conclude that v(D) |= γ(v( t)).
So far, we have dealt with equality subqueries and we have characterized the emptiness and inclusion of their supports (cf.Propositions 4.7 and 4.8, respectively).We can now use this machinery to characterize the support of a BCCQ.We start by expressing membership in the support of an individual CQ: https://doi.org/10.1017/S1471068423000364Published online by Cambridge University Press Querying incomplete data 295 Lemma 4.9 Let D be a database, v a consistent valuation of D and Q(x) a conjunctive query in NRV normal form, with relational subquery q( w) and equality subquery γ(x, w).Then, v ∈ Supp(Q, D, r) if and only there exists s such that D |= q(s) ∧ comp γ (rs) and v(D) |= γ(v(rs)).
Recall that q( w) is a conjunction of relational atoms, with no constants and where each one of the free variables w has at most one occurrence in q.Thus, there exists a mapping ν of w over adom(D) such that D |= q(ν( w)) and v(ν( w)) = μ( w).We let s = ν( w).Recall that x and w do not share variables, so we can extend the mapping ν by setting ν(x) = r.We thus have that ν(x w) = rs and v(rs) = μ(x w).It follows that v is a consistent valuation for which v(D) |= γ(v(rs)); then by Proposition 4.7, we also have In the remainder, we consider BCCQs Q(x) := Q 1 (x)∨. ..∨Q n (x) in NRV disjunctive normal form (DNF) where for all 1 ≤ i ≤ n : and for all 1 ≤ j ≤ m : For convenience, we assume w.l.o.g every conjunction of literals to be of the same length m.We can also assume without loss of generality that for each i we have adom(γ ij ) = adom(γ i0 ) for all j.In fact, we can always pad any γ ij with dummy equalities c = c to extend its active domain.
Given a disjunct Q i in a BCCQ in DNF, we now define poss Qi , encoding the set of possible answers to Q i , and cons Qi , checking the compatibility of an answer with the negative literals in Q i .
Moreover, we can prove the following claim: Claim 4.11 For all conjunction of equalities γ (ȳ) with adom(γ ) = adom(γ i0 ) and all t over adom(D)
https://doi.org/10.1017/S1471068423000364Published online by Cambridge University Press Querying incomplete data 297 Theorem 4.12 (Datalog rewriting) Let D be a database whose schema contains a set of equality-generating dependencies Σ, and let Q(x) be a BCCQ in NRV normal form.

Corollary 4.13
For each fixed BCCQ query Q and a set of EGDs Σ, the complexity of CertainAnswer Σ( Q) is in PTIME.

Non-rewritability in FO
The basic starting points for our investigation was the fact that cert Σ (Q, D) = Q(chase Σ (D)) for a CQ Q and a set Σ of FDs, for every database D. This remained true for unions of CQs, but failed for BCCQs, forcing us to produce a Datalog rewriting to obtain certain answers.But can a first-order rewriting be obtained instead?This would make it possible to produce certain answers using the core of SQL as opposed to its recursive features which do not always perform as well in practice.
In this section, we show that the answer, in general, is negative even for CQs (and thus for BCCQs).In the next section, however we show that such rewritings can be obtained in FO for BCCQs whenever Σ is empty.
The main result of this section is the following.
Theorem 4.14 There exists a Boolean CQ Q and single FD Σ over a relational schema of binary and unary relations such that cert Σ (Q, D) is not expressible as an FO query.

Proof
Consider a schema with one binary relation E and two unary relations A and To prove inexpressibility of cert Σ (Q, •) in FO, for each n > 0 we create two databases D n and D n .In both of them, E is interpreted as a disjoint union T 1 ∪ T 2 where T 1 and T 2 are balanced binary trees of depth n in which all nodes are distinct nulls.In both A and B are singleton sets.In D n , the set A contains a leaf of T 1 and B contains a leaf of T 2 .In D n , both A and B contain leaves of T 1 such that their only common ancestor in the tree is the root (in other words, they are leaves of subtrees rooted at different children of the root of T 1 ).
Because of the constraint Σ, for every valuation v such that the resulting database satisfies it we have that both v(T 1 ) and v(T 2 ) are chains.Indeed, consider any node ⊥ with children (v(⊥), v(⊥ 2 )) violate the constraint.Thus, v(⊥ 1 ) = v(⊥ 2 ) and applying this construction inductively we see that v(T i ) is a chain.Hence, it has a single leaf, and thus cert Σ (Q, D n ) is true, since A and B must be interpreted as that leaf.On the other hand, cert Σ (Q, D n ) is false, since there is a valuation v that sends T 1 and T 2 into two disjoint chains, and thus, A and B are interpreted as two distinct elements.
Assume now that cert Σ (Q, •) is rewritable as an FO sentence φ.Then, for every n > 0, we have D n |= φ and D n |= ¬φ.We next show that such a sentence cannot exist, thereby proving non-FO-rewritability.
Recall that in a database (with one binary relation, like considered here), a radius r neighborhood of an element a is its restriction to the set of all elements reachable from a by a path of length at most r, where the path does not take into account the orientation of edges of E (e.g. if we have E(a, b) and E(c, a) then both b and c are in the radius 1 neighborhood of a).When two neighborhoods, of elements a and b, are isomorphic, it means that there is an isomorphism between them that sends a to b.In other words, centers of neighborhoods are viewed as distinguished elements when it comes to defining neighborhoods.It is known that each first-order sentence ψ is Hanf-local (Fagin et al. 1995): that is, there exists a number r > 0 such that for any two databases D 1 and D 2 , if there is a bijection f between D 1 and D 2 such that the radius r neighborhoods of a in D 1 and f (a) in D 2 are isomorphic then D 1 and D 2 agree on ψ, that is either both satisfy it or both do not.Now let r be such a number for the sentence φ we assumed exists.Consider D n and D n and let T 1a , T 1 * be the subtrees of the root of T 1 in D n such that the first contains A while the second contains neither A not B, and let T 2b , T 2 * be defined similarly for subtrees of the root of T 2 with respect to B. In D n we define T 1a , T 1b as subtrees of the root of the tree containing A, B such that the first contains the A leaf and the second contains the B leaf, while T 2 * , T 2 * * be the subtrees of the root of the tree having neither A nor B elements.Then, it is easy to see that the following pairs of trees are isomorphic: T 1a and T 1a , T 2b and T 1b , T 1 * and T 2 * , T 2 * and T 2 * * .
We now define the bijection f as the union of those isomorphisms plus mapping roots of trees T i in D into roots of T i in D .It is an immediate observation that if n > r + 1 (i.e.leaves are not in the radius r neighborhood of children of roots) then f satisfies the condition that neighborhoods of a and f (a) of radius r are isomorphic.This would tell us that D n and D n agree on φ but we know they do not.This contradiction completes the proof.
As a corollary to the proof, we obtain the following result showing that non-recursive SQL is incapable of computing cert Σ (Q, D) in the setting of Theorem 4.14.

Corollary 4.15
There exists a Boolean CQ Q and single FD Σ over a relational schema of binary and unary relations such that cert Σ (Q, D) is not expressible in the basic SELECT-FROM-WHERE-GROUP BY-HAVING fragment of SQL with arbitrary aggregate functions.This is due to the fact that queries in this fragment of SQL with grouping and aggregation can be translated into a logic with aggregate functions (Libkin 2003) which itself is known to be Hanf-local (Hella et al. 2001).

FO rewriting for certain answers
We now focus on the special case where Σ is empty.First notice that the only Datalog component in our rewriting was the equiv γ formula; moreover, notice that without constraints, equiv γ simply computes a reflexive symmetric transitive closure.More precisely, for a given t = ν(ȳ), one has that equiv γ ( t, s, s ) holds in D iff (s, s ) belongs to the reflexive symmetric transitive closure of {(ν(x), ν(w As Σ is empty, we can rewrite as follows the equiv γ formula in FO, where m is the number of equivalence classes of ∼ γ : ui∼γ vi for all 1≤i≤m Proposition 4.17 Given an incomplete database D, a conjunction of equality atoms γ(ȳ) and an assignment ν(ȳ) = t over adom(D) ∪ adom(γ), given s, s in t ∪ adom(γ), we have that D |= equivF O γ ( t, s, s ) if and only if s ≡ ν γ s .Intuitively, this holds because each disjunct of equivF O γ ( t, s, s ) corresponds to a possible derivation of (s, s ) in the reflexive symmetric transitive closure of {(ν(x), ν(w)) | x = w ∈ γ}, and one can prove that there is a bound only depending on γ on the number of steps of this derivation.
In the general inductive case, there exists r such that (r, s ) = (ν(x), ν(w)) for some equality x = w (or w = x) ∈ γ, with s ≡ ν γ r derived at the previous step.By the induction hypothesis D |= equivF O γ ( t, s, r).We can assume s = r since otherwise (s, s ) would be in the base relation.Therefore, D satisfies one of the disjuncts of equivF O γ ( t, s, r).
Then, there exists a sequence of m + 1 pairs in ȳ ∪ adom(γ) We now show that from this sequence of pairs, one can construct another one of exactly m pairs, (u i , v i ), i = 1..m still connecting s ans s , that is such that: The idea is to first cut the sequence (u i , v i ), i = 1..m + 1, removing at least one pair, then pad it to size m if necessary.
In order to cut the original sequence, remark that it contains m + 1 pairs where m is the number of ∼ γ equivalence classes.Thus, there exist i < j such that u i ∼ γ u j .We remove from the sequence all elements between u i and v j (excluded), the new sequence is Note that this sequence satisfies (a) (b) and (c) above since u i ∼ γ u j ∼ γ v j .Let the new sequence contain k pairs.We know k ≤ m because we have removed at least one pair from the original sequence (recall i < j).If k < m, we pad the sequence on the right with m − k pairs (v m+1 , v m+1 ).The new sequence still satisfies (a), (b), and (c); therefore, the corresponding disjunct of equivF O γ ( t, s, s ) is satisfied by D.

Example 4.18
Let γ := y 1 = y 2 ∧z = x be the equality subquery of the query Q(x) in Example 4.2.Up to logical equivalence, equivF O γ (y 1 , y 2 , z, x, w, w ) contains precisely the disjuncts w = w , w = y 1 ∧ w = y 2 , w = z ∧ w = x, w = y 1 ∧ w = x ∧ y 2 = z, plus all disjuncts obtained from them by applying one or more of the following transformations: switch w and w , switch y 1 and y 2 , switch x and z.Let D be the database from Example 2.1, then we have for instance As a consequence of Proposition 4.17 As in Section 4.2, equivF O γ selects precisely the pairs of elements of a tuple that a valuation needs to collapse to satisfy a set of equalities.
As a consequence, we can rewrite in FO the formula poss Qi of Subsection 4.2 encoding the set of possible answers to Q i .It is enough to replace each occurrence of the Datalog equiv γ (ȳ, z, z ) program in it by equiv F O  γ (ȳ, z, z ).We denote by poss F O Qi the rewriting so obtained.

Theorem 4.19 (FO rewriting)
Let D be a database, Σ = ∅ and let Q(x) be a BCCQ in NRV normal form.
Note that tractability of BCCQ was already proved in Gheerbrant and Libkin (2015) using tableau-based methods.We now refine complexity as follows.

FO rewriting for best answers
Considering arbitrary FO queries brought us an intrinsic intractability result for all variants of best answers.This motivates restricting to unions of conjunctive queries, for which a polynomial-time evaluation algorithm (in data complexity) already exists (Libkin 2018).The resolution-based procedure is however in sharp contrast with naïve evaluation, which allows to compute certain answers to unions of conjunctive queries via usual model checking.We thus initiate a descriptive complexity analysis of the best answers problem, showing that for unions of conjunctive queries, it can essentially be reducedmodulo a preprocessing of the query -to (naïve) evaluation of an FO-formula.
Given a union of conjunctive queries Q, our starting point towards an FO rewriting for best answers is finding an FO-formula Q ⊆ (x, ȳ) encoding the inclusion of supports, that is selecting tuples s, t over adom(D) ∪ adom(Q) iff Supp(Q, D, s) ⊆ Supp(Q, D, t).From Q ⊆ , one can easily define an FO-formula selecting precisely all best answers to Q on D: best As in Section 4.2, we assume all CQs to be in NRV normal form.We can thus first concentrate on the support of equality subqueries.This will be encoded in FO and then integrated in the rewriting of the whole conjunctive query.
We now go back to an arbitrary union of conjunctive queries of vocabulary σ in NRVnormal form: where each Q i is in NRV normal form with relational subquery q i (ȳ i , zi ) and equality subquery eq i (x, ȳi , zi ).
Recall the formula comp γ , defined in Section 4.2, stating the existence of a valuation that collapses all equivalent elements of a tuple: We now define a formula capturing the inclusion of supports between two conjunctions of equality atoms, which will be a crucial ingredient in our rewriting.
Combining Lemmas 4.5, 4.9, Propositions 4.7, 4.8, and Corollary 4.23, we get: This allows to derive for instance Supp(Q, D, 1) ⊆ Supp(Q, D, ⊥ 2 ) (as observed in Example 2.1).In fact the subquery R(y 1 )∧S(y 2 , z)∧comp γ (y 1 , y 2 , z, x) with free variables y 1 , y 2 , z, x selects on D tuples (1, ⊥ 2 , ⊥ 2 , 1), (⊥ 1 , ⊥ 2 , ⊥ 2 , 1), and no other tuple with last element 1.Moreover, as shown in Example 4.22 As a corollary of Theorem 4.26, for a union of conjunctive queries Q one can compute best(Q, D) by first computing the formula best Q (x) from Q, then evaluating best Q on D. Since data complexity of FO query evaluation is DLogSpace (and in particular AC 0 ), this gives the following corollary: Corollary 4.28 For each fixed union of conjunctive queries Q, the data complexity of BestAnswer Σ is DLogSpace.
Note that it was known from Libkin (2018) that the data complexity of computing best answers for unions of conjunctive queries is polynomial time.In terms of combined complexity (i.e. when either Q, D and ā are in the input), the rewriting approach (i.e. the procedure of computing best Q from Q and then evaluating best Q on D), can be easily shown to be in PSPACE.In fact it is well known that a first-order query φ can be evaluated on a database D in space at most qr(φ) log |D| + log|φ|, where qr(φ) is the quantifier rank of φ.Note that although best Q has size exponential in Q, the quantifier rank of best Q is linear in the size of Q.Thus whether ā ∈ best(Q, D) can be checked using space O(|Q|, |D|).Moreover one can easily check that best Q can be computed from Q in space polynomial in the size of |Q|.Since space bounded computations can be composed without storing the intermediate output, computing best Q from Q and then evaluating best Q on D can be done overall in PSPACE in the size of |Q| and |D|.The rewriting approach thus implies a PSPACE upper bound for the combined complexity of BestAnswer Σ for unions of conjunctive queries.However, we can show that the problem actually stands in the third level of the polynomial hierarchy.For hardness, we reduce from ∀∃∀3DN F , which is known to be Π p 3 -complete (Schaefer and Umans 2002).We take as input a ∀∃∀3DN F -formula of the form F := ∀z 1 , . . .z l ∃x 1 . . .x k ∀y 1 . . .y p n i=1 conj i where the each conj i is a conjunction of 3 (not necessarily distinct) literals over variables z 1 , . . .z l , x 1 , . . ., x k , y 1 , . . ., y p .
D F is of signature {S 4 , C 2 , A 2 , B 3 } as follows: • The extension of S and A and B is fixed and does not depend on F : Q F is defined as follows.For each variable w of F , the conjunctive query Q F will use variables w and w (either quantified or free).For a literal α of F, the corresponding variable of Q F will be denoted as enc(α).More precisely if α = w is a positive literal, then, enc(α) := w, otherwise if α = ¬w then enc(α) := w.We can prove that all tuples of the form ( t, good) (which we refer to as good tuples) have the same support.This is given by the set of all consistent boolean valuations (i.e.https://doi.org/10.1017/S1471068423000364Published online by Cambridge University Press valuations of ⊥ i , ⊥i in {0, 1} such that v(⊥ i ) = v( ⊥i ) for all i).Moreover we can prove that if there exists a ( t, bad) whose support contains all consistent boolean valuations then the support of ( t, bad) strictly contains the support of good tuples.Therefore, any good tuple (including ( 0, good)) is a best answer iff for all tuples t there exists a consistent boolean valuation which is not in the support of ( t, bad).We can finally show that the last holds iff F is true.
Therefore, under standard complexity theoretic assumptions, our rewriting approach is not optimal in terms of combined complexity, as it is often the case with generic approaches.However, it has the advantage of exploiting standard FO query evaluation, which despite the PSPACE combined complexity is highly optimized in database systems and works well in practice.

Future work
Our rewriting techniques are closer to a practical implementation than the previous tableau-based method from Gheerbrant and Libkin (2015).This is due to their expressibility in recursive SQL (or even non-recursive in the case of Theorems 4.19 and 4.26).However, while theoretically feasible, an actual implementation will need additional techniques to achieve acceptable performance.To see why, notice that the first rule in the definition of equiv γ creates a cross product over the full active domain, that is, the set of all elements that appeared in the database.This of course will be prohibitively large.While this may appear to be a significant obstacle, a similar situation with computing or approximating certain answers is not new in the literature.For instance, the first approximation scheme for certain answers to SQL queries that appeared in Libkin (2016) has done exactly the same, and generated very large Cartesian products even for simple queries with negation.Nonetheless, an alternative was found quickly (Guagliardo and Libkin 2016) that completely avoided the need for such expensive queries, and it was shown to work well on several TPC-H queries.Thus, looking for a practical and implementable rewriting is one of the possible directions for future work.
As another open problem, we note that the query for which we have shown certain answers to be non-rewritable in FO has DLOGSPACE data complexity.Indeed the problem is essentially reachability over trees, which can be easily encoded using deterministic transitive closure (Immerman 1987).To express DLOGSPACE problems, we need a language weaker than Datalog with negation.Thus, it is natural to ask whether a low complexity Datalog fragment would be sufficient to express rewritings of BCCQ, or a separating example that is PTIME-complete can be found.
Another direction would be to investigate how our techniques can be extended to different semantics of incompleteness.We used here the closed-world semantics (Abiteboul et al. 1995;Imielinski and Lipski 1984;van der Meyden 1998), in which data values are the only missing information, but there are other possible semantics, for example needed in order to cope with data inconsistencies (Calì et al. 2003a), where query rewritings could still be found.Quantitative variations of the notion of certainty as proposed in Libkin (2018) could also be investigated.
Proposition 4.10 Let D be a database and Q(x) a DNF BCCQ in NRV normal form, then Supp(Q(x), D, r) = ∅ if and only if D |= 1≤i≤n ∃ w poss Qi (r w).https://doi.org/10.1017/S1471068423000364Published online by Cambridge University Press296A.Gheerbrant et al.
, for fixed γ and t, the relation {(s, s ) | D |= equivF O γ ( t, s, s )} is an equivalence relation over adom(D)∪adom(γ) where each element of adom(D) neither in t nor in adom(γ) forms a singleton equivalence class.
Example 4.22 Let γ and D be as in Example 4.18.Let γ := y 1 = y 2 ∧ z = x , then it follows from Example 4.18 that D |= imply t) and let v ∈ Supp(Q, D, s) be a valuation of D. By Lemma 4.9, ∃iā b D |= q i (ā b) ∧ comp eqi (sā b) and v(D) |= eq i (v(sā b)).So by our assumption there exists j, ā b with D |= q j (ā b ) ∧ imply eqi,eqj (sā b, tā b ) and by Corollary 4.23 D |= comp eqj ( tā b ).Now let t 1 , t 2 such that D |= equiv eqj ( tā b , t 1 , t 2 ).By D |= imply eqi,eqj (sā b, tā b ), we have D |= equiv eqi (sā b, t 1 , t 2 ), and by Lemma 4.5, Theorem 4.29 For unions of conjunctive queries, combined complexity of BestAnswer Σ is Π p 3complete.Hardness already holds for conjunctive queries.Proof For membership, first note that one can check in Π p 2 whether Supp(Q, D, ā) ⊆ Supp(Q, D, b) on input given by a database D, a UCQ Q, and tuples ā and b.In fact in order to check Supp(Q, D, ā) Supp(Q, D, b) one guesses a valuation v of D, then calls an NP oracle to check v(ā) ∈ Q(v(D)) and v( b) / ∈ Q(v(D)).On input given by a database D, a UCQ Q, and a tuple ā one can check ā / ∈ best(Q, D) as follows.First guess a tuple b over adom(D) of the same arity as ā; then, using two https://doi.org/10.1017/S1471068423000364Published online by Cambridge University Press Querying incomplete data 305 calls to a Σ p 2 oracle, check that Supp(Q, D, ā) ⊆ Supp(Q, D, b) and Supp(Q, D, b) Supp(Q, D, ā).
then for each s ∈ adom(D)∪adom(γ) there exists at most one constant c such that D |= equivF O γ ( t, s, c).In fact if for constants c 1 and c 2 , D |= https://doi.org/10.1017/S1471068423000364Published online by Cambridge University Press 302 A.