Eager Aggregation and Lazy Aggregation
Eager Aggregation and Lazy Aggregation
345
group-by. When the amount of data reduction does
not justify the cost of eager group-by, we should prob- SUMISUM’
R
ably delay group-by until after the join, which we term ‘@&
lazy group-by. Both directions of the transformation
SUM& i
should be considered in query optimization. We call
the technique of performing aggregation before join
t J1=J2G9
eager aggregation, and delaying aggregation until after SUM(S1) AS SW/ \
join lazy aggregation.
Figure 2(a) and (b) show the basic idea of ea-
“‘=a m (ZJ2)
ger/lazy group-by. Eager group-by performs eager ag-
Tl
/‘\ T2
aI ’
gregation on all tables containing aggregation columns. (Gl,Jl.Sl)
Tl
(62952)
Lazy group-by is its reverse transformation. (Gl,Jl,Sl)
The following examples illustrate the basic idea of (a) Lazy grwW (b) Eager group-by
eager group-by and lazy group-by. The examples are
based on a subset of the TPC-D database[Raa95]. The
tables are defined in Appendix A.
Example 1 : Find the total loss of revenue on OT-
ders handled by each clerk due to parts being returned
by customers. Output clerk and loss of revenue.
SELECT O-CLERK,
SDM(L-EXTENDEDPRICE
* (I-L-DISCOUNT))
FROM LINEITEM, ORDERS Tl T2
WHERE O-ORDhKEY = L-ORDERKEY T2
AND LJETURNFLAG= 'R'
(Gl,Jl,Sl) (WJ2) mJJ2)
GROUPBY O-CLERK (c) Lazy Count (d) Eager Count
A
Each order is handled by one clerk so we can first
find the loss of revenue for each order. We then join
the aggregated view with table ORDERSto find the total
loss for each clerk. suM~s’& 51352 a
SELECT O-CLERK, SUM(REVENDE)
FROM (SELECT L-ORDERKEY.SUM(L-EXTENDEDPRICE
t f’\
‘XT
*(l-L-DISCOUNT)) AS REVENUE J,=>&k-& sAs
FROM LINEITEM
WHERE L-RETURNFLAG = 'R' t t
GROUPBY L-ORDERKEY)AS LOSS, ORDERS Tl l2 Tl T2
WHERE O-ORDBRKEY = L-ORDERKEY (Gl,Jl,Sl) WC2) (Gl,Jl,Sl) K-352)
GROUPBY O-CLERK (e) Double Lazy (t) Double Eager
The eager (inner) group-by reduces the number of
input rows to the join. If the LINEITEM table is clus-
tered on L-ORDERKEY,the eager group-by can be done
at almost no additional cost. Experiment on DB2
JlJ2 &
V2 Beta3 confirms that eager group-by reduces the
elapsed time by 16%. The following example shows I \
346
where LOSS-BY-ORDERis an aggregated view defined by six queries. For example, it reduces the elapsed time
of Query 5 by a factor of ten.
CREATEVIEW LOSS-BY-ORDER(L-ORDERKEY,REVENUE)
(SELECT L-ORDERKEY,
~~~~(L-E~TENDEDPRI~E
* (~-L-DISCOUNT)) 1.1 Organization of This Paper
FROM LINEITEM The rest of the paper is organized as follows. Section 2
WHERE LJETURNFLAG= 'R' reviews aggregation functions in SQL2 and introduces
GROUPBY L-ORDERKEY); the concepts of decomposable aggregation functions,
We can merge the view with the query and rewrite the and class C and class D aggregation functions. Sec-
query as tion 3 defines the class of queries that we consider and
introduces notations. Section 4 presents the formalism
SELECT O-CLERK, that our results are based on. Section 5 introduces and
SUM(L-EXTENDEDPRICE*(l-L-DISCOUNT)) proves our main theorem. Sections 6, 7, 8 and 9 in-
FROM LINEITEM. ORDERS troduce corollaries for eager/lazy group-by, eager/lazy
WHERB O-ORDERKEY = L,ORDERKEY
count, double eager/lazy and eager/lazy split trans-
AND LJETURNFLAG= 'R'
AND O-ORDERDATB BETWEEN "1995-05-01" formations. Section 10 proposes algorithms for find-
AND "1995-05-31" ing all possible eager/lazy transformations for a query
GROUPBY O-CLERK and discusses the way to integrate eager/lazy aggre-
gation and group-by push down/pull up into existing
The predicate on 0-ORDERDATEis highly selective. optimizers. In order to simplify the proofs we have not
In this case, we should delay the group-by until after considered HAVING in the theorem and corollaries. Sec-
the join. A nested loop join with LINEITEM as the inner tion 11 considers the case when the HAVING clause is
and ORDERSas the outer looks like a very promising present. Section 12 shows that eager/lazy aggregation
evaluation strategy. Experiment on DB2 V2 Beta3 and group-by push down/pull up is very beneficial for
confirms that lazy group-by reduces the elapsed time TPC-D official queries. Section 13 discusses related
by 60%. work. Section 14 concludes the paper.
These examples show that both directions (eager
group-by and lazy group-by) should be considered in 2 Aggregation Functions
query optimization. There may be several ways of per-
In SQLP, a value expression may include aggregation
forming eager group-by when there are more than two
functions. There are five aggregation functions: SUM,
tables in the FROMclause[Yan95].
AVGs MIN, MAXand COUNT.Consider the query
Figure 2 and 3 show the eager/lazy transformations
introduced in this paper. Eager count transformation SELECT2*SUN(Tl.Cl)/COUNT(DISTINCT T2.C2)*
performs eager aggregation on tables not containing MIN(Tl.C3*T2.C3)
aggregation columns, as shown in Figure 2 (d). It first FROMTl,T2
counts the number of rows in each group in the early We can rewrite this query as
aggregation, then performs the join, and finally ag-
SELECT2*NCl/NC2*NC3
gregates the original aggregation columns. Lasy count FROM(SELECTmM(cl) AS NCI,
transformation is its reverse transformation. COUNT(DISTINCTT2.C2) AS NC2,
Double eager performs eager count on tables not #IN(Tl.C3*T2.C3) AS NC3
containing aggregation columns and eager group-by on FROMTl,T2 ) TMP-VIEW;
the remaining tables which may or may not contain
Any query block that has an arbitrary value ex-
aggregation columns, as shown in Figure 2 (f). The
reverse transformation is double lazy. pression containing more than one aggregation func-
tions, can always be rewritten so that the new query
Eager groupby-count, as shown in Figure 3, per-
block is a SELECT on top of a view that contains value
forms eager aggregation on a subset of tables contain-
ing the aggregation columns. Its reverse transforma- expression having at most one aggregation function.
Therefore, without loss of generality, we assume that
tion is called lazy groupby-count.
our query contains no value expression that has more
Eager split, as shown in Figure 2 (h), performs eager
than one aggregation functions.
groupby-count on both input streams before the join,
when both input streams are involved in aggregation.
2.1 Decomposable Aggregation Functions
Its reverse transformation is called lazy split.
Our experiments show that we can apply group- All sets in this paper are multi sets. Let U, denote set
by push down/pull up and eager/lazy aggregation to union preserving duplicates, and ud denote set union
twelve of the seventeen queries in the TPC-D bench- eliminating duplicates. These operations exist in SQL2
mark. This significantly reduces the elapsed time for as UNION ALL and UNION, respectively.
347
Definition 1 : (Decomposable Aggregation FROMLINBITEH
Function) An aggregation function F is decompos- GROUPBY L-ORDERKEY)AS COUNT-BY-ORDER.
able if there exist aggregation functions F1 and F,?2 ORDERS
such that F(&Ua&) = F2(Fl(Sr),Fl(S2)), where WHERE 0sORDERKEY = L-ORDERKEY
S1 and & are two sets of values. We call S1 and Sz GROUPBY O-CLERK
partial groups.
COUNTEY-ORDERcounts the number of lineitems for
SUM(C) is decomposable since SUM(S~U,S~) = each orders, then joins with table ORDERSto find the
SUM(SUM(Sl), SUM(S2)); count required. We call this transformation eager
COUNT(C) is decomposable since COUNT(SlU,S2) = count, and its corresponding reverse transformation
lazy count. Note that, this time, we are performing
SUM(COUNT(Sl), COUNT(S2));
and MIN(C) is decomposable since MIN(Slu,S2) = eager aggregation on a table which contains no aggre-
gation columns. Eager count performs eager aggrega-
MIN(MIN(Sl), MIN(S2)); and AVG(C) can be han-
dled as SUM(C) and COUNT(NOTNULL C) and thus is tion on tables not containing any aggregation columns.
decomposable’. Experiment on DB2 V2 Beta3 shows that eager count
For aggregation functions like COUMT(DISTINCT for this query reduces the elapsed time by 40%.
Cl), it is not trivial to determine whether it is de- When performing eager count and the original ag-
composable. There may be two rows with the same gregation function is either SUM or COUNT, we need
Cl value in Sl and S2. These two rows would then to count the number of rows in each group produced
contribute 2 instead of 1 in the final count. However, by the inner group-by and multiply the count with
if we know in advance that column Ci cannot contain the result from the later group by. We call aggrega-
duplicate values, then COUMT(DISTINCT Cl) is decom- tion functions satisfying this property class C aggre-
posable. Note that, even though Cl has duplicate val- gation functions(C stands for COUNT), and the count
ues, there may be other conditions which ensure that obtained from the inner group-by duplication factor. If
rows with the same Cl value belong to the same par- the original aggregation function is SUM(DISTINCT),
tial groups(e.g., Cl is a grouping column). Therefore, COUNT(DISTINCT), MIN, MAX, or AVG, we can discard
an aggregation function may or may not be decompos- the count in the subquery block. In other words, we
able. Aggregation functions MIN and MAX are always
can use a DISTINCT in the subquery block. We call
decomposable; SUMand COUNTare decomposable when aggregation functions satisfying this property class D
aggregation functions(r) stands for DISTINCT). And we
they contain no DISTINCT. The issue of determining
whether an aggregation function is decomposable will call this transformation eager distinct, and its corre-
not be discussed further. From now on we will assume sponding reverse transformation lazy distinct. There-
fore, combining this with whether the function is de-
that we have the knowledge about whether an aggre-
composable or not, we can have four types of aggre-
gation function is decomposable.
gation functions. Class D aggregation functions are
2.2 Class C and Class D Aggregation hnc- insensitive to duplication factors.
tions
Find the total number of urgent
3 Class of Queries Considered
Example 3 : OT
high priority lineitems handled by each clerk. Any column occurring as an operand of an aggregation
function (COUNT, MIN, MAX, SUM, AVG) in the SELECT
SELECT O-CLERK,
SUM(CASEWHENO-ORDERFRIORITY='l-URGENT' clause is called an aggregation column. Any column
OR O-ORDERF'RIORITY='2-HIGH' occurring in the SELECTclause which is not an aggrega-
THEN 1 ELSE 0 END) tion column is called a selection column. Aggregation
FROM LINEITEM, ORDERS columns may belong to more than one tables. We par-
WHERE O-ORDERKEY = L-ORDERKEY tition the tables in the FROMclause into two groups:
GROUPBY O-CLERK those tables that contain aggregation columns and
those that may or may not contain any such columns.
It is equivalent to the following query. Technically, each group can be treated as a single ta-
SELECT O-CLERK, ble consisting of the Cartesian product of the member
SUM(CASEWHENO-ORDERPRIORITY='l-URGENT' tables. Therefore, without loss of generality, we can
OR O-ORDERPRIORITY='2-HIGH' assume that the FROMclause contains only two tables,
THEN 1 ELSE 0 END) * CNT & and &. Let Rd denote the table containing aggre-
FROM (SELECTL-ORDERKEY,COUNT(*) AS CNT gation columns and R, the table that may or may not
lspL2 does not support COUMT(IOT IULL Cl) operation, but contain any such columns.
it is fairly easy to implement in any existing systems. The search conditions in the NMEREclause can be
348
expressed as Cd A Co A C,,, where Cd, Co, and C,, are GA: : E Gdd U a(&) - R,, i.e., the columns of &
in COnjUnCtiVe normal form, Cd Only involves columns participating in the join and grouping;
in &, C,, only involves columns in R,, and each dis- GA:: E Gdd U a(Co) - &, i.e., the columns of R,
junctive component in Co involves columns from both participating in the join and grouping
& and R,. Note that subqueries are allowed. FAA: resulting columns of the application of function
The grouping columns mentioned in the GROUP BY array F on AA in the fhst group-by when eager
clause may contain columns from & and R,, de- group-by is performed on the above query.
noted by C& and GA,,, respectively. According
to SQL2[ISO92], the selection columns in the SELECT 4 Formalization
clause must be a subset of the grouping columns. We
denote the selection columns as SC& and SGA, , sub- In this section we define the formal “machinery” we
sets of GAd and GA,,, respectively. For the time being, need for the theorems and proofs to follow.
we assume that the query does not contain a HAVING SQL2[ISO92] represents missing information by a
clause(relaxed in Section 11). The columns of & par- special value NULL. It adopts a three-valued logic in
ticipating in the join and grouping is denoted by GA:, evaluating a conditional expression. We define func-
and the columns of R, participating in the join and tional dependencies using strict SQL2 semantics tak-
grouping is denoted by GA,+. ing into account the effect of NULLS in SQLP. When
In summary, we consider queries of the following NULLS do not occur in the the columns involved in
form: a functional dependency, our definition of functional
dependency is the same as the traditional functional
SELECT [ALL/DISTINCT] SGdd, SGA,, F(AA)
dependency. The detailed definitions are included in
FROl4 &is Ru
WHERE cd A c,, A c,
pL94]. Due to space limitation, they are not included
GROUP BY Gdci,GA, here. Let A and B be two sets of columns, A fimction-
ally determines B is denoted by A-B.
where
Gdd: grouping columns of table Rd; 4.1 An Algebra for Representing SQL Queries
GA,: grouping columns of table R,; GAd and GA, Specifying operations using standard SQL is tedious.
cannot both be empty.
As a shorthand notation, we define an algebra whose
S&id: selection columns, must be a subset of grouping basic operations are defined by simple SQL statements.
cohmns G.&i; Because all operations are defined in terms of SQL,
SGA,,: selection columns, must be a subset of grouping there is no need to prove the semantic equivalence be-
columns GA,; tween the algebra and the SQL statements. Note that
AA: aggregation columns of table & and possible transformation rules for “standard” relational algebra
table R,. When considering eager/lazy group- do not necessarily apply to this new algebra. The op
by, eager/lazy count and double eager/lazy, AA erations are defined as follows.
belong to Rd. When considering eager/lazy
groupby-count and eager/lazy split, AA belong l B[GA] R: Group table R on grouping columns
to & and R, and is denoted by the union of GA = {GAI, GAz, .... GA,}. This operation is
aggregation columns AA,, and A&, where AA, defined by the query 2 SELECT * FROM R ORDER
and AAd belong to & and R, respectively.
BY GA. The result of this operation is a grouped
cd: conjunctive predicates on columns of table &; table.
c,: conjunctive predicates on columns of table R,;
l RI x R2: The Cartesian product of table RI and
co: conjunctive predicates involving columns of both
R2.
tables Rd and R,, e.g., join predicates;
a(C0): columns involved in CO; l a[C]R: Select all rows of table R that satisfy con-
F: array of aggregation functions and/or arithmetic dition C. Duplicate rows are not eliminated. This
aggregation expressions applied on AA (may be operation is defined by the query SELECT * FROM
empty). When considering eager/lazy groupby- R WHERE C.
count and eager/lazy split, F is denoted by the
union of aggregation functions Fa and F,,, where l xd[B]R, where d = A or D: Project table R on
Fd and Fd are applied on AAd and AA, respec- columns B, without eliminating duplicates when
tively.
2Certainly, this query does more than GROUP BY by ordering
F(AA): application of aggregation functions and/or the resulting groups. However, this appears to be the only valid
,arithmetic aggregation expressions F on aggre- SQL query that can represent this operation. It is appropriate
gation columns AA; for our purpose as long as we keep the difference in mind.
349
d = A and with duplicate elimination when d = 2. F contains both class C and class D aggregation
D. This operation is defined by the query SELECT functions. In this case, we need to use a COUNT
CALL /DISTINCT] B FROMR. aggregation function in the SELECTlist of the sub-
query block. The aggregation value of a class C
l $Jr;l~;A”[“=“l yAUd$4, fz@A),.... fn(AA)), aggregation function f is the count multiplied by
1, 2, ....A.,}, and F =
the value resulting from applying f. Therefore,
{fl, f2, . . ..f”}. AA are aggregation columns of
we need to change F into F,, in which every
grouped table R and F are arithmetic aggrega-
class C aggregation function f of F is replaced
tion expressions operating on AA. We must em-
by f * count. For example, if F(C1, C2, C3) is
phasis the requirement that table R is grouped
(SUM(Cl>,COUNT(C2),MIN(C3)), then
by some grouping columns C. All rows of table
F,(Cl, C2, C3, count)
R must agree on the values of all columns except
is (SUM(C1) ,MAX(C2) ,MIN(C3))o(count, 1, I) =
AA columns. Each fi, where i = 1,2, . . . . n, is an
(SUM(Cl)*count,MAX(C2) ,MIN(C3)). The oper-
arithmetic expression(which can simply be an ag-
ator o is vector product. We call F, the duplicated
gregation function) applied to some columns in
aggregation functions of F. As a shorthand nota
AA of each group of R and yields one value. An
tion, we use F(C1, C2, . .. . C,,) * count to repre-
example of fi(AA) is COUNT(A1)+ sUM(A2+ As).
sent F,, while keeping in, mind that we only need
Duplicates in the overall result are not eliminated.
to multiply class C aggregation function by the
This operation is defined by the query SELECT
count. Note that we need an additional argument
GA,A, F(AA) FROMR GROUPBY GA,whereGAis
to F,e
the grouping columns of R, and A is a set of
none grouping columns that are functionally de- Note that, it is not necessary that the functions in
termined by GA and may be empty. Note that F be decomposable.
this is not a syntactically valid SQL2 statement
since the columns A in the SELECT clause are not SUM(SSl), SUM(S2)‘CNT t
mentioned in the GROUPBY clause. However, since WWP SY
GA+ A, from a query processing point of view, SUM(Sl), t SUM(S2) 9
this is semantically sound.
Therefore, the class of query we consider can be
expressed as
nd[SGA,j, SGA,, FAA] F[AA]rA[GAd, GA,, , AA]
G[GAd, G&]+‘d A co A Cu](& x &a)
where d = A or d = D, and FAA are the aggre- Tl T2
gation values after applying F[AA] on each group. (Gl,Jl,Sl) (G2,J2,S2) (Gl ;,Sl)
The last projection simply projects the rows on the
Lazy groupbycount Eager groupby-count
columns wanted, and may eliminate duplicates. If
all the columns wanted are the same as all existing Lazy aggregation Eager aggregation
columns, and the projection does not eliminate dupli-
cates, then we usually omit the last projection in the Figure 3: The Main Theorem
expression. Consider the query to the left of Figure 3. It aggre-
All sets in this paper are multisets which may con- gates columns from both input streams. In the query
tain duplicates. Td, T” denote instances of table & and on the right, we can first perform aggregation on one
R,; T[SJ is used as a shorthand for nA[qT, where S of the input stream. We need to not only find the sum
is a set of columns and T is a grouped or ungrouped of partial groups, but also keep track of the number
table, or a row. of rows in each partial group for the aggregation on
the table(T2) that are aggregated only after the join.
5 Main Theorem This is the basic idea of eager groupby-count.
In the following theorem, let (1) NGAd denote a set
When performing eager count, we need to consider two of columns in Rd; (2) CNT the column produced by
cases: COUNT(*) after grouping a[Cd]& on NG&; (3) FA&
1. F contains only class D aggregation functions. We the rest of the columns produced by Fd in the first
can simply add a DISTINCT to the SELECTlist of group-by of table U[Cd]Td on NG&; and (4) F,, the
the subquery block and no modification to the duplicated aggregation function of F,,. Also assume
original aggregation functions is needed. that (1) AA = AAd ud AA, where AAd contains only
350
columns in &, and AA,, contains only columns in R,; S,. Note that the above statements hold for all joins,
(2) F = Fd ud F, where Fd applies to AAd and F,, not just equijoins.
applies to AA,,. Since S,, depends on Gd, we denote the set of rows
joining with Gd as S, (Gd). The set resulting from
Theorem 1 (Eager/Lazy Groupby-Count(Main
the join of Gd and S, is Gd x S,(Gd), i.e., a Carte-
Theorem)): The expcpressions sian product. (Fdl[A&], COUNT~)Gd denotes the
El : F[A&, A&]n[G&, GA,, AAd, AA,] row resulting from applying Fdr and COUNT on AAd of
the group Gd.
G[G&, G&]@d A co A cu](& x R,) Let Gdi, Gds be two (partial groups) produced by
and B[NGA,+[Cd]rd. We have two cases to consider.
Case 1: Gdi[G&] = Gdz[G&] and S, (Gdl)[GA,] =
E2 : ?Td[G&, GA,, FAA] S,,(Gds)[GA,]. In Es, after the join, all rows in
(Fua[AAu, CNT], FddFA&])
((Fdl [A&], COUNTIJ)
flA [G& GA,,, AA,,, PAAd, CNll]
(“[NGAd, GA,+, A&]‘&) x St, (Gdl)
G[GAd, GAMCo, ‘%](((Fdl[-‘h], C0UNTl-j)
and
c#=‘b, GA,+, A&]~[NGAd]+‘d]Rd) x R,)
((Fdl[A&], COUNTU)
are equivalent if (1) aggregation functions Fd contain
only decomposable aggregation functions and can be de- TINGAd, GA,+, AA&h) x su (Gd2)
composed into Fdl and Fd2; (2) F,, contain class C OT are merged into the same group by the second group-
D aggregation functions and (3) NGAd+ GA: hold by(after the join).
in U[cd]&. In El, each row in Gdi and Gd2 joins with each row
in &(Gdl) and &(Gd2), respectively. Therefore, all
The main theorem is illustrated in Figure 3. The ag-
rows in Gdr x S, (Gdl) and Gds x S,(Gds) are merged
gregation columns are split into two sets, which belong
into the same group by the group-by. Since every ag-
to & and R,, tables respectively. For the transforma-
gregation function in Fd can be decomposed as Fdi
tion from El to E2 (eager aggregation), we push down
and Fds, the aggregation values in the row produced
the & tables and perform eager aggregation on AAd
and obtain the count before the join. After the join, we by
then perform aggregation on FA& and AA,. There- Fd[A-‘ti]~A[G&, GA,, A&]
fore, we basically split the aggregation into two parts, ((Gdl X &(Gdl))Ua(Gd2 X &(&a)))
one is pre-evaluated before the join and one is evalu-
ated after the join. We call the transformation from in El are equal to the aggregation values produced by
El to E2 eager groupby-count and its reverse transfor- Fd2 [FA&]m [G-b, G-L, FA&]
mation lazy gmupby-count.
The requirement NG&+ GA: is not a neces- (((Fdl[AA&bdNG&, GA,+, A&]‘%) x s&h))
sary conditions. If NG& ti G& in some instance Ua((Fdl[A&]m[NG&, GA,+, A&]Gda) x su(Gd2)))
of a[Cd]&, then the first group-by of E2 may group in E2*
rows together when they do not belong to the same Since every aggregation function in F,, is either class
group in El. However, incorrectly assigned rows may C or D, the aggregation values in the row produced by
be eliminated by the join and we may still get the cor-
rect result. If NG& does not functionally determine E&%&r&&, GA,, AAul((Gdl x & (Gdl))
the join columns of table &, the join in E2 is un- UaGdz( Xsu (Gdz)))
defined since a group may contain different values on
in El are equal to the aggregation values produced by
the join columns. To obtain necessary and sufficient
condition, we need to extend the meaning of F[AA], J’u,[AA,, CNT]ci[G&, GA,, AA,, CNT]
which is beyond the scope of this paper.
(((COUNTbA[N’=d, GAd+]Gdl) x & (‘%l))Uo
Proof:
Consider a group Gd in g[NG&]u[cd]Td for some
((COUNTbA[NG&, GA,+]Gdz) x St,(h)))
instance Td of Rd. Since NG&+ GA:, all rows in in E2.
G,j have the same G& value and have the same value Case 2: Gdl [G&l # G-&G&] 01 & (Gdl)[G&] #
for the join columns of Rd. Therefore, if one row of S,, (Gd2) [GA,]. In Ez? the rows in
G,j qualifies in the join of b[Cd A Co A C&l (Td X T,),
all rows of Gd qualify. If one row of Gd joins with a ((Fdl[AAd], COUWI)
set of rows S, from ~[C,,]T,,, all rows of Gd join with n[NG&, GA,+, A&]&l) x St, (Gdl)
351
and and
((Fdl[AAd], CouNTO) E2 : F2 [FAA&A [G&, GA,, FAA&(G& GA,]
4NG&, GA:, AA&da) x &(&a) xA[G&, GA,, FAAd]@o A Ct,]
are not merged into the same group by the second ( (S [A&A [NG& GA,f 7 AAl
group-by(after the join). In Er, each row in Gdr and ~[NG&]~[Cd]Rd) X Ru)
Gds joins with each row in & (Gdi) and S, (Gdz), re-
spectively. However, the rows in Gdr x S,, (Gdr) and are equivalent if NG&-+ GA: holds in b[cd]Rd and
Gdz x S,,(Gdz) are not merged into the same group by all aggregation functions in F[AA] are decomposable
the group-by. Since F is decomposable, the aggrega- and can be decomposed into FI and F2.
tion values in Eager group-by transformation introduces a new
group-by, and lazy group-by transformation eliminates
Fd [A&]nA [G&, GA,, A&]
a group-by.
(Gdl X su (‘&I) The proof of the corollary is straightforward. Since
in Er are equal to the aggregation values in AA,, is empty, Fua[AAU, CNTJ is empty. Deleting all
terms relating to AA,, in E2 of the Main Theorem gives
Fd@‘&]u[GAd, GA”, F&z] Ez of the corollary.
((h[A&]m[NG&, GA,+, AA&al) X sty (‘&l))
7 Eager/Lazy Count and Eager/Lazy
in Ez.
Also, e aggregation values in the row produced by Distinct
In the Main Theorem, if we let GA, contain all the
u,CNT]%t[Gfti,
+L.l
Fu[A GA,,AAl aggregation columns, that is, all aggregation columns
(GdlX$Gdl)) belong to R, tables, then we obtain the following corol-
in El are equal ‘to the aggregation values produced by lary. In the following corollary, NGAd denotes a set
of grouping columns belonging to &, and CNT the
column produced by COUNT(*)after grouping o[c,j]&
on NGAa.
352
in the subquery block. We then call the transformation Corollary 4
from El to Ez eager distinct and from Ea to El lazy (Double Group-by Push-Down/Double Group-
distinct. Note that in this case, F, is the same as F. by Pull-Up): A ssume that the conditions in Co~ol-
lary 3 holds. If, in addition, (1) GA;---, NGAd holds
in u[Cd]Rd, (2) GA,++ NGA, holds in u[C,,] R,,
8 Double Eager and Double Lazy
and (3) (GA,, Gdd) functionally determines the join
Now we are ready to tackle the double eager and dou- columns in r[Cd A Co A CU] (Rd x R,), then the es
ble lazy transformations. Consider the query in Fig- pressions
ure 2(e). It aggregates the columns belonging to one
El : F[A+A[G&, GA,, AA]B[GAd, GA,]
input stream (Tl). In the query in Figure 2(f), we per-
form eager group-by on the stream (Tl) containing c[cd A co A Cu](Rd X Ru)
353
(4) FAAd the columns produced by Fd in the first ag- columns(NC&). According to Corollary 1, we can
gregation of table 6[Cd]Rd on NGAd; (5) FAA, the add more Rd columns to NGAd without changing the
columns produced by F,, in the first aggregation of result of the query. Normally we want to choose a
table o[C,]R, on NGA,; and (6) Fd,, and F,,, the new set of grouping columns only if the new set has
duplicated aggregation function of Fd and F,, respec- some ordering properties that save sorting time. For
tively. Also assume that (1) AA = AAd ud AA,, where example, if the ordering property on a new column
AAd contains only columns in &, and AA,, contains is supported by a clustering index, then after the new
only columns in R,; (2) F = Fd ud F, where Fd ap- column is added into NC&, it can be used as the ma-
plies to AAd and F,, applies to AA,,. jor of the sorting columns(assuming sorting is used for
GROUPBY). The subsequent sort may be faster since
Corollary 5 (Eager Split and Lazy Split:) The the minor columns are sorted in a smaller range, plus
expressions the advantage of sequential fetching of data rows. In
this case, even if one of the GA$ columns has an index,
El : F[A&, ~A,]~A[G& GA,, AAd, AA,]
since the index is not clustered, it may be more expen-
g[GAd, G&]&‘d A co A Cu](Rd x Ru) sive to perform the grouping using GA: as the group-
and ing columns than using the clustering index column
and GA: as the grouping columns. Therefore, we want
E2 : %[G&, GA,, FAA] to consider possible beneficial addition of columns to
(L#‘uz[FA&], CNrr,], Fda[Fdz[FA&],CNTz])GA,+as eager grouping columns. We call such columns
promising columns. Since adding new columns is of-
nA[GA,j, GA,, FAA,, FAAd, CNTI, CNTz]
ten not beneficial, a good heuristic might be not to
8[GAd, G&l+‘o, cu]((((Fdl[AAd], COUNT[) add grouping columns beyond GA$.
r.dN’=d, GA:, AA&[NG&]u[‘%]Rd) When performing eager aggregation, our objective
x ((Ful[AAwl, COUNTU) is to achieve data reduction before the join, so we want
each partial group to contain as many rows as possible.
AA[NG&, GA:, AAMNGA&[G]~u))
Therefore, if NC& contains a unique key of a[Cd]&,
are eqUiVaknt if (I) aggregation function3 Fd contain we should immediately abandon using this set for eager
only decomposable aggregation functions that can be group-by.
decomposed into Fdl and Fd2; (2) aggregation fwac-
tions F,, contain only decomposable aggregation junc- 10.1.2 Table Partitioning
tions that can be decomposed into Ful and F,,z; (3) When the query contains more than two tables, there
F, and Fd contain class C OT D aggregation func- may be several ways of performing eager aggregation.
tions; (4) NG&+ GA,+ holds in 6[Cd]Rd; (5) The question is how to partition the tables in the FROM
NGA,+ GA: holds in u[C&] R, . clause into Rd tables and R, tables. Section 10.3 dis-
The proof of this corollary is also straightforward. It cusses the way to partition tables to obtain all pos-
can be done by first performing an eager/lazy groupby- sible transformations. We assume that table parti-
tioning has been done before calling the algorithm in
count on Rd, and then an eager/lazy groupby-count on
Section 10.1.3.
&a.
10.1.3 The Algorithm
10 Algorithms and Implementation
Assuming table partitioning is done, we have the fol-
10.1 Algorithm for Eager Aggregation
lowing algorithm for finding valid eager aggregation.
In this section, we present a practical algorithm for In this algorithm, we choose not to add new columns
recognizing all valid eager transformations for a given to either NC& or NGA,. In the following algorithm,
query. We assume that & tables contain aggregation & tables must contain aggregation COhnUS.
columns and R, tables do not. That is, all queries
belong to the class of queries specified in Section 3. Algorithm 1 Eager Aggregation
Inputs: input query, &, R,, AA
10.1.1 Finding the Eager Grouping Columns Output: all possible rewritten queries
for Eager Aggregation
1 NG& := GA,+ and NGA, := GA:
Given two sets of tables, Rd and R,, with Rd ta 2 eagerd = false, eageru = false
bles containing aggregation columns and R, tables 3 if NC& is not a unique key of U[Cd]Rd
not, let’s first consider eager group-by. We can 4 eUgeTd = true
start with NC& using GA: as the eager grouping 5 end if
354
6 if NGA, is not a unique key of (r[C,,]R, an optimizer, we can first perform group-by pull up
7 eager, = true and lazy aggregation to obtain a canonical form in
8 end if which all group-bys are delayed as late as possible.
9 if eager, and eageTd Then, during dynamic programming process, when-
10 if no aggregation cohmns in R, ever a table access plan or join plan is constructed, we
11 Apply double eager on & end R, can consider adding a group-by on top of the plan. All
12 Output the reuritten query
tables in the query are then partitioned into two sets,
13 else
the set containing all tables in the current join plan,
14 Apply eager split on both .& and R,
15 Output the rearitten query and the set containing the remaining tables. We can
16 Apply eager groupby-count on Rd then apply Algorithm Eager Aggregation to find all
17 Output the reuritten query possible eager aggregations. There can be several pos-
18 end if sible ways for adding an aggregation on top of a plan.
19 else if eageTd and not eager, The optimizer may want to choose the cheapest way
20 if no aggregation columns in R, for each plan to reduce optimization cost. Then, for
21 Apply eager group-by on & each original join plan, there is at most one additional
22 else plan that performs a group-by at the top. On the
23 Apply eager groupby-count on Rd other hand, when considering join plan for two input
24 end if streams, the optimizer can consider the alternatives of
25 Output the reuritten query taking the streams with or without aggregation. If the
26 else if not eagerd and eager, optimizer employs an exhaust search and considers all
27 if no aggregation cohmns in R,
possible join plans in the dynamic programming pro-
28 Apply eager count on &,
cess(e.g., Starburst), all possible transformations can
29 else
be found in this process. This approach is also suitable
30 Apply eager groupby-count on &
31 end if
for dynamic programming process that generates only
32 Output the rearitten query left deep trees or right deep trees. However, it might
33 else overlook some possible rewrites.
34 Output “No transformation”
35 end if 11 Queries Including HAVING
END Algorithm 1
A query with a HAVING clause can always be trans-
10.2 Algorithm for Lazy Aggregation formed into one without. This technique is well known
and is used in existing database systems. For exam-
Now consider lazy aggregation. Whenever a query ple, the Starburst optimizer always transforms a query
matches the form in any one of our theorems and sat- with a HAVING into one without at the beginning of
isfies their conditions, we can perform a lazy aggre- the query rewrite phase[PHH92]. After the HAVING is
gation to eliminate one GROUPBY(or DISTINCT), and eliminated, we can perform eager aggregation trans-
delay grouping until after the join. Lazy aggregation is formation on the view created.
especially useful when the join is highly selective. The Now consider lazy aggregation. When a HAVING is
algorithm to find all valid lazy aggregation transforma- eliminated in a subquery block with an aggregation
tion for a given query is to iterate through each avail- (either groupby or distinct), and the HAVING clause
able transformation and output the rewritten forms. contains no aggregations, then the predicate in the
Please refer to [Yan95] for a detailed description for HAVING clause can be moved to the WHEREclause and
the algorithm. we can then try to apply one of our lazy aggregation
theorems. If the HAVING clause contains aggregations,
10.3 Implementation we usually give up performing lazy aggregations be
We need to find a way to efficiently integrate ea- cause the HAVING predicates have to be evaluated be
ger/lazy aggregation and group-by push down/pull fore the join. However, it is possible to perform lazy
up into existing optimizers. The standard technique aggregation when the HAVING clause of a query con-
for determinating join order in a cost-based opti- tains aggregation.
mizer is dynamic programming in a bottom up (e.g., We formally proved our theorem for the conditions
Starburst[LohM]) fashion. During the dynamic pro- of groupby push down transformation for queries con-
gramming process, plans for table accesses, two-table taining a HAVING clause in [yL95]. The process to
joins, three-table joins and joins involving more tables prove conditions of eager aggregation for queries with
are constructed and kept until the final query plan is a HAVING clause is completely analog to our previous
obtained. To integrate the transformations into such effort. Due to space limitation we shall not present the
355
conditions and proof here.
Table 2: Ratio Between Best And Worst Elapsed Time
12 TPC-D Queries For All TPC-D Queries That Can Be Transformed
We can apply group-by push down/pull up and ea- 1 Q ) # of I Worst 1 Best 1 W/B t
ger/lazy aggregation to twelve of the seventeen queries rewrites Formulation Formulation Ratio
in the TPC-D benchmark and significantly reduce the 3 3 PD on L/O Oligbl 3.93
elapsed time of six queries on DB2 V2 Beta 3, as shown 5 7 EG on EG on 43.71
in Table 1. For example, it improves the elapsed time w/w L/O/C
7 7 Origilld EGonL 2.58
of Query 5 by a factor of ten. Table 2 shows the ratio
8 14 EG on L/S EG on 501.02
between best and worst elapsed time for all TPC-D
official queries that can be transformed3. The perfor-
mance difference between a badly formed query and a
better formed query can be very significant. Partic-
ularly, in applications where queries are generated by
tools or inexperienced users, automatic transformation I I 1 for both 1 I I
of queries is indeed very important. aggregations
In both Table 1 and 2, each table is represented by 12 2 EConL Otigilld 1.02
the first letter of its name, except that table PART- 13 2 EGonL Olighl 16.59
SUPP is represented by PS. Also, we use PD, PU, EG, 14 2 Origin&l EGonL 1.07
EC and DC to represent group-by push down, group 15 2 PU Ori& 2.51
by pull up, eager group-by, eager count and query de-
correlation transformations respectively.
original query. The access plan must maintain a count
Table 1: TPC-D Queries With Reduced Elapsed Time of the number of duplicates being removed. Then, af-
(Compared With Original Formulation) ter or during the join, the access plan must restore the
duplicates. Chaudhuri and Shim[CS95] also general-
ized group-by pull up to handle the case when the join
is a many to many join.
14 Conclusion
Group-by push down and group-by pull up interchange
the order of join and group-by. The number of group
by’s is unchanged. Eager aggregation introduces an
additional group-by before a join, and lazy aggregation
eliminates a group-by before a join. Groupby push
13 Related Work
down and eager aggregation reduces the number of
We proposed the idea of eager aggregation and lazy rows participating in a join, groupby pull up and lazy
aggregation in [Yan94]. Chaudhuri and Shim[CS94] aggregation reduces the number of input rows to the
also independently discovered eager group-by and ea- groupby. Both directions of transformation should be
ger count. Their simple coalescing grouping and gen- considered during query optimization.
eralized coalescing grouping correspond to our eager We classify eager aggregation into five different
group-by and eager count transformation, respectively. types: eager group-by, eager count, double eager, ea-
They also proposed an algorithm to integrate group- ger groupby-count and eager split. Eager groupby
by push down, eager groupby and eager count into a partially pushs down a groupby on the tables that
greedy join enumeration algorithm which produces left contain all aggregation columns; eager count partially
deep trees in a cost based optimizer. However, they pushs down a groupby on the tables that do not con-
did not discuss lazy aggregation transformation in the tain any aggregation columns; double eager partially
paper. pushs down a groupby on both types of tables; eager
Gupta, Harinarayan and Quass[GHQ95] general- groupby-count partially pushs a groupby into a sub-
ized group-by push down in another fashion. They set of tables containing the aggregation columns; eager
showed that it is possible to perform early duplicate re- split splits a group-by into two group-bys and partially
moval before a join when there is no aggregation in the pushs the groupbys down the two input streams of the
3The ratio marked as ‘infinity’ means that the query with the join. As a special case of double eager, we can com-
worst formulation ran out of system space and did not finish. pletely push down group-by into two input streams,
356
CUSTOMERS(C_) PARTSUPP(PSJ SUPPLIERS ORDERS(O-) LINEITEM (L_)
15K 1K 1K 160K 6ooK
tTAX
Legend:
l Scale factor 1
l The highlighted column names in each table form its primary key
l The number below a table name shows the number of rows of the table.
1 COMMENT
357