Friday, September 28, 2007

Visualizing rollups


Post 1 of a 3 post article.


Many code examples, found on the web, focus on a single feature of the programming language. There is a good reason for this which is that examples incorporating many features are less generic and therefore less likely to be of use to a large portion of the audience. The down side of this 'keep it simple' approach is that the audience is less often exposed to ways to combine features. The ability to combine the features of SQL is what makes it an amazingly powerful language.

Before you begin, you should understand the Oracle SQL reporting functions such as ROLLUP, GROUP_ID, GROUPING_SET and GROUPING_ID. If you are not familiar with these, you might still learn something interesting, but you will not understand everything that is going on.

When I first encountered Oracle's reporting functions I was somewhat overwhelmed by the complexity. What I needed was a way to visualize the process of constructing non trivial rollups. I tend to be a visual learner, and if you show me a picture, I'll probably understand instantly.
There are plenty of examples that show rollup statements and the resulting output and that's fine, if you need to do exactly what the example shows. However, I've not seen anybody outline a methodology for constructing new rollup patterns. So, there are 2 things I am attempting to do in this article

  1. Develope a way to think about rollups

  2. Lay the groundwork for future articles that will build upon the examples in this article.


This first article will be divided into two examples. First, a simple example then a second, more complex example.

EXAMPLE 1. a simple rollup

Table 1

CREATE TABLE TEST_MM
(
CITY
CHAR(9),
SHIPPER
CHAR(4),
DELIVERER
CHAR(3),
CUSTOMER
CHAR(8)
);

Insert into TEST_MM (CITY, SHIPPER, DELIVERER, CUSTOMER)
Values ('Sunnyvale', 'fedx', 'Sam', 'PJ Shoes');
Insert into TEST_MM (CITY, SHIPPER, DELIVERER, CUSTOMER)
Values ('Sunnyvale', 'fedx', NULL, 'PJ Shoes');
Insert into TEST_MM (CITY, SHIPPER, DELIVERER, CUSTOMER)
Values ('Sunnyvale', 'fedx', 'Sam', 'TeaVana ');
Insert into TEST_MM (CITY, SHIPPER, DELIVERER, CUSTOMER)
Values ('Sunnyvale', 'ups ', 'Ted', 'PJ Shoes');
Insert into TEST_MM (CITY, SHIPPER, DELIVERER, CUSTOMER)
Values ('Sunnyvale', 'ups ', 'Ted', 'TeaVana ');
Insert into TEST_MM (CITY, SHIPPER, DELIVERER, CUSTOMER)
Values ('Sunnyvale', 'con ', 'Sam', 'TeaVana ');
Insert into TEST_MM (CITY, SHIPPER, DELIVERER, CUSTOMER)
Values ('Sunnyvale', 'ups ', 'Ted', 'TeaVana ');
Insert into TEST_MM (CITY, SHIPPER, DELIVERER, CUSTOMER)
Values ('Sunnyvale', 'con ', 'Ted', 'TeaVana ');
Insert into TEST_MM (CITY, SHIPPER, DELIVERER, CUSTOMER)
Values ('Sunnyvale', 'fedx', 'Sue', 'PJ Shoes');
Insert into TEST_MM (CITY, SHIPPER, DELIVERER, CUSTOMER)
Values ('Sunnyvale', 'fedx', NULL, 'PJ Shoes');
Insert into TEST_MM (CITY, SHIPPER, DELIVERER, CUSTOMER)
Values ('Sunnyvale', 'fedx', 'Sam', 'TeaVana ');
Insert into TEST_MM (CITY, SHIPPER, DELIVERER, CUSTOMER)
Values ('Sunnyvale', 'ups ', 'Ted', 'PJ Shoes');
Insert into TEST_MM (CITY, SHIPPER, DELIVERER, CUSTOMER)
Values ('Sunnyvale', 'ups ', 'Sue', 'TeaVana ');
Insert into TEST_MM (CITY, SHIPPER, DELIVERER, CUSTOMER)
Values ('Sunnyvale', 'con ', 'Sam', 'TeaVana ');
Insert into TEST_MM (CITY, SHIPPER, DELIVERER, CUSTOMER)
Values ('Sunnyvale', 'ups ', 'Sue', 'TeaVana ');
Insert into TEST_MM (CITY, SHIPPER, DELIVERER, CUSTOMER)
Values ('Sunnyvale', 'ups ', 'Ted', 'TeaVana ');
COMMIT;



Each record in Table 2 represents a delivery of some item to a customer.

Table 2

CITY SHIP DEL CUSTOMER
------------ ---- --- -----
Sunnyvale con
Sam TeaVana
Sunnyvale con
Sam TeaVana
Sunnyvale con
Ted TeaVana
Sunnyvale fedx Sam PJ Shoe
Sunnyvale fedx Sam PJ Shoe
Sunnyvale fedx Sam PJ Shoe
Sunnyvale fedx Sam TeaVana
Sunnyvale fedx Sam TeaVana
Sunnyvale fedx Sue PJ Shoe
Sunnyvale ups
Sue TeaVana
Sunnyvale ups
Sue TeaVana
Sunnyvale ups
Ted PJ Shoe
Sunnyvale ups
Ted PJ Shoe
Sunnyvale ups
Ted TeaVana
Sunnyvale ups
Ted TeaVana
Sunnyvale ups
Ted TeaVana.


The measure, for this example, is the count(*) of deliveries. The rollups of interest are
number of deliveries per city
number of deliveries per city per shipper
number of deliveries per city per shipper per deliverer



Table 3

Table 2A
CITY
SHIP DEL CNT
--------- ---- ------
Sunnyvale con
Sam 2
Sunnyvale con Ted 1
Sunnyvale con 3
Sunnyvale fedx Sam
5
Sunnyvale fedx Sue 1
Sunnyvale fedx 6
Sunnyvale ups
Sue 2
Sunnyvale ups
Ted 5
Sunnyvale ups 7
Sunnyvale 16

Table 2B
CITY
SHIP DEL CUSTOMER CNT
--------- ---- --- -------- ----
Sunnyvale fedx Sam PJ Shoes
3
Sunnyvale fedx Sue PJ Shoes 1
Sunnyvale fedx PJ Shoes 4
Sunnyvale ups
Ted PJ Shoes 2
Sunnyvale ups PJ Shoes 2
Sunnyvale PJ Shoes 6
Sunnyvale con
Sam TeaVana 2
Sunnyvale con Ted TeaVana 1
Sunnyvale con
TeaVana 3
Sunnyvale fedx Sam TeaVana
2
Sunnyvale fedx
TeaVana 2
Sunnyvale ups
Sue TeaVana 2
Sunnyvale ups
Ted TeaVana 3
Sunnyvale ups TeaVana 5
Sunnyvale TeaVana 10




Table 3
A shows the result set of ...
SELECT city, shipper, deliverer, COUNT(*) cnt FROM test_mm
GROUP
BY city, ROLLUP(shipper, deliverer);



Table 3Bshows the result set of ...
SELECT city, shipper, deliverer, customer, COUNT(*) cnt FROM test_mm
GROUP
BY city, ROLLUP(shipper, deliverer),(customer);


Table 4 shows the result of joining the non-grey rows from table3 A and table3 B.I will come back
to this join later. For now, just observe.

Table 4

Sunnyvale con Sam [TeaVana 2]
Sunnyvale con
[TeaVana 3
Sunnyvale fedx Sam
[PJ Shoes 3, TeaVana 2]
Sunnyvale fedx [PJ Shoes 4, TeaVana 2]
Sunnyvale ups Sue
[TeaVana 2]
Sunnyvale ups Ted [TeaVana 3]
Sunnyvale [PJ Shoes 6, TeaVana 10]


In Table 4 I've shown the JOIN result is in a denormalized form. This is done to emphasize the rollup levels
( the first three columns) as unique, non-repeating, entities.

Again, I'll come back to this example from time to time and use it to explain certain concepts.


EXAMPLE 2 - a complex rollup


Table 5 Table and Data Load Script

CREATE TABLE TESTDATA
(
F1
NUMBER, F2 NUMBER, F3 NUMBER, F4 NUMBER, F5 NUMBER,
F6
NUMBER, F7 NUMBER, F8 NUMBER, F9 NUMBER, F10 NUMBER,
F11
NUMBER, F12 NUMBER, F13 NUMBER, F14 NUMBER);

Insert into TESTDATA
(F1, F2, F3, F4, F5, F6, F7, F8, F9, F10, F11, F12, F13, F14)
Values (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, null, 12, 13, 14);
Insert
into TESTDATA
(F1, F2, F3, F4, F5, F6, F7, F8, F9, F10, F11, F12, F13, F14)
Values (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, null, 16, 13, 14);
Insert into TESTDATA
(F1, F2, F3, F4, F5, F6, F7, F8, F9, F10, F11, F12, F13, F14)
Values (1, 2, 3, 4, 5, 6, 7, 8, 9, null, 11, 12, 13, 14);
Insert into TESTDATA
(F1, F2, F3, F4, F5, F6, F7, F8, F9, F10, F11, F12, F13, F14)
Values (1, 2, 3, 4, 5, 6, 33, 8, 9, null, 11, 20, 20, 20);
Insert into TESTDATA
(F1, F2, F3, F4, F5, F6, F7, F8, F9, F10, F11, F12, F13, F14)
Values (1, 2, 3, 4, 5, 6, 4, 8, 9, null, 11, 22, 22, 22);

COMMIT ;




We'll start with a SELECT which we will build upon later Eventually, this will become part of a
SUBQUERY in the final code example.
Don't be intimidated by the seeming complexity of this SELECT statement. I plan on walking through
it one small section at a time.


Table 6 in the beginning there was SELECT, and God saw that it was good

Select X.F1, X.F2, X.F3, X.F4, X.F5, X.F6, X.F7, X.F8, X.F9, X.F10, X.F11,
X.F12, X.F13, X.F14, Count(*) Cnt,
Grouping(X.F1) G1, Grouping(X.F2) G2,
Grouping(X.F2) G3, Grouping(X.F4) G4, Grouping(X.F5) G5,
Grouping(X.F6) G6, Grouping(X.F7) G7, Grouping(X.F8) G8,
Grouping(X.F9) G9, Grouping(X.F10) G10, Grouping(X.F11) G11,
Grouping(X.F12) G12, Grouping(X.F13) G13, Grouping(X.F14) G14
From Testdata X
Group By F1,
F2,
Grouping Sets(Rollup(F3, F4, F5, F6)),
Grouping Sets(Rollup(F7)),
Grouping Sets(Rollup(F8)),
Grouping Sets(Rollup(F9, F10), Rollup(F9, F11)),
Grouping Sets(Rollup((F12, F13, F14)))
Having GROUP_ID() = 0
And Not( F10 Is Null
And Grouping(X.F10) = 0)
And Not( F11 Is Null
And Grouping(X.F11) = 0);



In this example, f1 thru f11 are the dimensions and are analogous to the city/shipper/deliveryperson hierarchy shown in the first example. f12, f13 and f14 represent the entity for which we may or may not produce rollups depending on the aggregated counts of detail records in f1 and f7. f12, f13 and f14 are analogous to ''customer'' from the first example. COUNT(*) is our measure. Having minimums on F1 and F7 was an arbitrary choice for the sake of this example. A real-world example might have minimums applied to any or all levels of the rollup hierarchy



GROUP BY

Lets start by deconstructing the GROUP BY clause.

Table 7

GROUP BY f1, f2,
GROUPING SETS(ROLLUP(f3, f4, f5, f6)),
GROUPING SETS(ROLLUP(f7)),
GROUPING SETS(ROLLUP(f8)),
GROUPING SETS(ROLLUP(f9, f10), ROLLUP(f9, f11))


GROUP BY f1, f2


All f1's will be followed by an f2 value. aggregate values will be given for the f1,f2 combination, but not for f1 alone.
[f1,f2, aggregates] is a valid combination.

Figure 1




GROUPING SETS(ROLLUP(f3, f4, f5, f6))

f3, f4, f5and f6 act as a group, therefor the following combinations are valid

Figure 2


F6 must always be preceded by f5 which must in turn be preceded by f4, which must in turn be preceded by f3.
Combinations like, [f1, f2, f3, f5, f6, aggregates] will not be produced because, in this case, f5 is not preceded by f4.


GROUPING SETS(ROLLUP(f7))

F7 will be appended to all of the previous combinations which adds these new combinations to those previously mentioned.

Figure 3




GROUPING SETS(ROLLUP(f8))

F8 will be appended to all of the previous combinations which adds these new combinations to those previously mentioned

Figure 4




GROUPING SETS(ROLLUP(f9, f10), ROLLUP(f9, f11))

This grouping set says three things:
First, make additional combinations by appending f9 to each of the previously mentioned combinations.

Figure 5



Second, make additional combinations by appending f10 to any sequence ending with an f9, giving ...

Figure 6



Third, make additional combinations by appending f11 to any sequence ending with an f9 as shown below.
Note that f10 is never followed by f11. This "GROUPING SETS(ROLLUP(f9, f10), ROLLUP(f9, f11))" structure is useful for handling mutually exclusive columns of data. For example you might have [State + MetroArea] or [State + City] but never both MetroArea and City.


Figure 7







Figure 8




Having

Having GROUP_ID() = 0
And Not( F10 Is Null
And Grouping(X.F10) = 0)
And Not( F11 Is Null
And Grouping(X.F11) = 0);


The HAVING clause is doing several things. It might look complicated, but we'll address it one piece at a time starting with the first line. GROUP_ID() = 0 prevents duplicate rollup rows. I am not going to talk about why there could be duplicate rollups. If you are interested, a simple web search will get that information. I can't think of any reason why anybody would ever want duplicate rollups.

The next few predicates dealing with nulls are there to clean up a problem that was created by the bifurcated treatment of f10 and f11. It might help to recall what was said earlier:

Note that f10 is never followed by f11. This "GROUPING SETS(ROLLUP(f9, f10), ROLLUP(f9, f11))" structure is useful for handling mutually exclusive columns of data. For example you might have [State + MetroArea] or [State + City] but never both MetroArea and City.



So, when f10 has a value, f11 will be null, and visa-versa. These nulls in the input data will cause rollups that we are not interested in. The following SELECT statement demonstrates this point. If you are not familiar with the GROUPING function, simply put, it produces a 1 or a 0 which indicates if the function argument column is a data level or a summary level for the current row of the result set. For example, "GROUPING(x.f10)" tells if x.f10 is a data level or a summary level for the current row. 1 means summary level, 0 means data level.


Select X.F9,
X.F10,
X.F11,
Count(*) Cnt,
Grouping(X.F9) G9,
Grouping(X.F10) G10,
Grouping(X.F11) G11
From Testdata X
Group By F1,
Grouping Sets(Rollup(F9, F10), Rollup(F9, F11))
Having GROUP_ID() = 0;

Figure 9 SELECT showing only bifurcated part of rollup

In the resulting table below, we can see that there appears to be three rows that rollup to the f9 level, rows 1, 3 and 5. However, when we look at the GROUPING for the first row, we see that it is really a (f9, f10) level rollup where f10 is null. Likewise, row 3 is really a (f9, f11) level rollup where f11 is null. If the data corresponded to our [State + MetroArea] or [State + City]

Table 8 Bifurcated rollup result set using table testdata

f9

f10

f11

cnt

g9

g10

g11

9

3

0

0

1

9

10

2

0

0

1

9


2

0

1

0

9

11

3

0

1

0

9

5

0

1

1

5

1

1

1


If the data corresponded to our [State + MetroArea] or [State + City] example, the data might look something like this:
Consider [CA,null]. This would be an aggregate level that represented all Metro Areas for CA which are unknown. Since all of our City records have 'unknown' as the Metro, the cnt = 3 on row 1 is the count (or rollup) of all records which contain valid City data! Clearly, this is not the right way to go about getting "count(*) for all cities". If you did want that information, the proper way to get it would be to add a LocationType column so that the rollup might look like [City,LocationType, Metro,City].. The question at hand is how to get rid of rows 1 and 3 in Table 8 which for our application are nonsense. So, how might we state the problem in English? How about:
"I want to drop any row that looks like a Metro level rollup but has a null for the metro." You might be temped to try a HAVING clause that looks like:
"HAVING METRO IS NOT NULL" , but oops, that would take out rows 3, 4, 5 and 6 as well. How about

"HAVING METRO IS NOT NULL AND CITY IS NOT NULL" . Almost, it will get rid of rows 1 and 3 as desired; unfortunately it will also get rid of rows 5 and 6, which we'd kind of like to keep. What we would really like to do is to get rid of [State=CA,Metro=null] records ONLY when Metro is not supposed to be null. And how do we know if Metro is supposed to be null or not? If you haven't guess by now, we'll have to look at the GROUPING columns. Remember, a 0 means that we should expect to find data, not a NULL. Right there in the gMetro column of row 1 it tells us that Metro is not a rollup level.

Table 9 bifurcated rollup substituting example names

State

Metro

City

cnt

gState

gMetro

gCity

CA

3

0

0

1

CA

SF Bay Area

2

0

0

1

CA


2

0

1

0

CA

Baker

3

0

1

0

CA

5

0

1

1

5

1

1

1


Figure 10 Solution for removinf NULL rollup rows

Select X.F9,
X.F10,
X.F11,
Count(*) Cnt,
Grouping(X.F9) G9,
Grouping(X.F10) G10,
Grouping(X.F11) G11
From Testdata X
Group By F1,
Grouping Sets(Rollup(F9, F10), Rollup(F9, F11))
Having GROUP_ID() = 0
And Not( F10 Is Null And Grouping(X.F10) = 0)
And Not( F11 Is Null And Grouping(X.F11) = 0);


One last thing: The 3 "Grouping" statements in the select list are not required. They are only there as a visual aid.
to use during the debugging phase of developing the rollup. When you know that you are getting the desired results
you can remove them as follows:



Table 10

Select X.F9,
X.F10,
X.F11,
Count(*) Cnt
From Testdata X
Group By F1,
Grouping Sets(Rollup(F9, F10), Rollup(F9, F11))
Having GROUP_ID() = 0
And Not( F10 Is Null And Grouping(X.F10) = 0)
And Not( F11 Is Null And Grouping(X.F11) = 0);


In the next post, part 2 of this article, we will add on more bells and whistles to this basic rollup. That's where "combine the features of SQL" part comes in. This is only the beginning of what will become a much larger, much more complex SQL statement.

No comments:

Labels