Skip to content

Deduplication in database queries

wricciot edited this page Aug 30, 2021 · 3 revisions

Links implements SQL queries using a normaliser that targets the multiset fragment of SQL. We have now introduced a new, experimental normaliser (the mixing normaliser) which supports queries mixing set and multiset operations, enabling the user to freely use deduplication within a query.

Usage

The mixing normaliser is enabled in one of two ways:

  • globally, by adding a mixing_norm=on statement to the Links configuration file: this will cause all flat queries to be processed by the mixing normaliser (nested queries introduced by the query nested statement are unaffected)

  • on a per query basis, for queries specified in a query mixing { ... } block.

When the mixing normaliser is enabled, the functions dedup and distinct can be used within queries. The former takes as input a list, and returns the same list where duplicate elements have been removed.

links> dedup;
fun : ([a]) -> [a]

The function distinct is similar, except that its input is a table handle (concretely distinct(t) is defined as dedup(asList(t))).

links> distinct;
fun : (TableHandle((|a::Base),(|_::Base),(|_::Base))) {}-> [(|a::Base)]

Lateral query inputs

Unlike homogeneous multiset queries, queries mixing sets and multisets can be defined using 'lateral' dependencies between inputs that cannot be simplified away using standard techniques (see [1]). The mixing normaliser will generate SQL queries using the keyword lateral when this is needed in order for a query to be legal SQL. For example the query

query mixing {
    for (x <-- factorials)
    for (z <- dedup(for (y <- distinct(factorials))
                where (x.i == y.f)
                [(f = x.f)]))
    [(f = z.f)]
}

after normalisation, yields the following SQL:

select z.f as f from 
  factorials as x, 
  lateral (select distinct x.f as f
    from (select distinct * from factorials) as y
    where x.i = y.f) as z

lateral is an SQL:1999 feature and is not supported on older DBMSs; furthermore, in some cases it may be executed inefficiently. The mixing normaliser implements a query transformation ('delateralisation') that can be used to only produce SQL queries that will not use lateral. This transformation is enabled optionally on a per query basis, using a query delat { ... } block. Delateralisation enables the use of older, non SQL:1999 compliant DBMSs, and may in some cases lead to more efficient execution of queries.

Limitations

  • In order for a query to be translatable to SQL, dedup should only be used on expressions whose type is a flat list (e.g. lists of base types, or lists of tuples of base types, but not lists of lists of base types or lists of functions). If dedup is applied to other types, the evaluation will fail even if the type of the whole query is a flat list ([1] and [3] provide some insight on this limitation).

  • Presently, the mixing normaliser provides no support for ranges:

    links> query [2,1] mixing { for (f <-- factorials) [f] };
    ***: Error: Links_core.EvalMixingQuery.EvalMixingUnimplemented("Range is not (yet) supported by the new mixing normaliser")
    
  • The mixing normaliser is also unable to deal with nested queries: query nested will always use the standard normaliser, even when mixing_norm=on is specified in the configuration file.

References

  1. Wilmer Ricciotti, James Cheney: Mixing set and bag semantics. (2019)

  2. Wilmer Ricciotti, James Cheney: Strongly Normalizing Higher-Order Relational Queries. (2020)

  3. Wilmer Ricciotti, James Cheney: Query Lifting: Language-integrated query for heterogeneous nested collections. (2020)