A database system usually consists of several tables. For instance, the following is the table of transactions.

A table usually doesn’t contain all information. For instance, the table above does not contain credit card information and shipping address. This is because a buyer may buy multiple things at once, so we avoid repetition by storing those information in a different table.

At the end of the day, we want to post-process the data. For example, we want to move money from the buyers’ credit cards to the sellers’ bank accounts, so we need a report containing all money-related information from across the tables.

Seller	Good	Buyer
Alice	Apple	Adam
Bob	Banana	Betty
Cindy	Cherry	Charlie

Buyer	Credit Card	Address
Adam	3141-5926	53 Amsterdam Rd
Betty	2718-2818	28 Barcelona Rd
Charlie	1618-0339	62 Copenhagen Rd

Seller	Bank Account	Rating
Alice	1414-2135	6.2 (37)
Bob	1732-0508	0.7 (56)
Cindy	2236-0679	6.7 (49)

Good	Price	Warehouse
Apple	44.16	Aberdeen
Banana	55.25	Boston
Cherry	66.26	Chicago

Seller	To Account	Good	Price	Buyer	From Card
Alice	1414-2135	Apple	44.16	Adam	3141-5926
Bob	1732-0508	Banana	55.25	Betty	2718-2818
Cindy	2236-0679	Cherry	66.26	Charlie	1618-0339

This is called a join operation in database systems: For any seller, say Alice, we pull information from the seller table, and insert it into the transaction table, which gives us the first two columns of the report. We then pull and insert information from the good table, giving us the next two columns, and then pull and insert information from the buyer table, giving us the last two columns.

Mathematically speaking, a table is denoted as $R(A, B)$, where $A$ is the set of possible values of the first column, $B$ is the set of possible values of the second column, and $R$ is the set of rows, understood as a subset of $A \times B$. Then the join of $R(A, B)$ and $S(B, C)$ is a table $V(A, B, C)$ such that $V = \{(a, b, c) : R(a, b) \land S(b, c)\},$ where $R(a, b)$ means $(a, b) \in R$, because we also treat $R$ as a predicate. A more complicated example is a cyclic one $\Delta = \{(a, b, c) : R(a, b) \land S(b, c) \land T(c, a)\}$ suppose that $T(B, C)$ is another table.

The problem with table joining is that the size of the output tables $V$ and $\Delta$, unlike the transaction example above, can be much larger than the input tables $R$, $S$, and $T$. For instance, let $A = B = C = {1, …, 30}$; let $R(a, b)$ be whether $a - b$ is even, $S(b, c)$ be whether $b - c$ is a multiple of $3$, and $T(c, a)$ be whether $c - a$ is a multiple of $5$. Then $|R| = 450$, $|S| = 300$, and $|T| = 180$, but $|V| = 4500$ and $|\Delta| = 900$. This problem does not show up in all tables, so it makes sense to ask if we can tell, before actually performing the join, whether the output table is of linear size or exponential size. This is the problem of pessimistic cardinality estimation, where pessimistic means that we want to upper bound the size of the output table.

On the first glance, an optimistic bound, a lower bound, seems to be on the equal footing with a pessimistic bound, but this is not the case. For instance, we can bound $|V|$ and $|\Delta|$ as $|R| \times \deg_S(B)$, where $\deg_S(B)$ is the maximum of $\deg_S(b)$ and the latter is $|{c : S(b, c)}|$, the number of $c$’s connected to $b$. This bound corresponds to the following algorithm.

SS = {b : [] for b in B}
for (b, c) in S: SS[b].append(c)

V = []
Delta = []
for (a, b) in R:
    for c in SS[b]:
        V.append((a, b, c))
        if (c, a) in T: # let's pretend that T is a set
            Delta.append((a, b, c))

In other words, SS can look up the $c$’s by $b$ and hence we have $\deg_S(b) =$ len(SS[b]) and $\deg_S(B) =$ max(len(SS[b]) for b in B). So the nested for loop will cost $|R| \times \deg_S(B)$ time.

By symmetry, we can also bound $|\Delta|$ as $|S| \times \deg_T(C)$, which is the teim the following algorithm needs/

TT = {c : [] for c in C}
for (c, a) in T: TT[c].append(a)

Delta = []
for (b, c) in S:
    for a in TT[c]:
        if (a, b) in R: # let's pretend that R is a set
            Delta.append((a, b, c))

Mathematically, both algorithms are correct and they will give the same Delta. But the run time varies depending on the values of $|R|$, $|S|$, $\deg_S(B)$, and $\deg_T(C)$. In fact, if we continue with the divisibility example, we notice that $|R| \times \deg_S(B) = 450 \times 10 = 4500$ and $|S| \times \deg_T(C) = 300 \times 6 = 1800$. So the second algorithm gives a better pessimistic bound, even though both algorithms compute the same output table.

The punchline here is that $T$ is a much sparser relation, and in the first algorithm we check $(c, a)$ against $T$, so the pass rate is low and we are wasting time enumerating triples that are not going to pass the test. On the other hand, $R$ is a denser relation and checking $(a, b)$ against $R$ is more likely to pass, meaning that we generate the same $\Delta$ with fewer failures.

The example is all clear and intuitive. But for real world data it is not always this easy to tell. It is therefore crucial to find

Hsin-Po's Website

Pessimistic Cardinality Estimation