Almost any big data pipeline has an aggregation processing step for analyzing multidimensional data. For example: in sales dataset, we have the following
<year, month, day, hour, minute, second, city, state, country, sales>
where <year, month, day> are attributes of the date dimension and <hour, minute, second> are attributes for the time dimension. These are so called the temporal dimensions.
<city, state, country> are attributes of the location dimension.
In OLAP data tube terminology, cube group denotes an actual group belonging to the cube region defined by a certain dimension attributes. Eg. one could have the following cube region <search_topic, device_type, *, *> with the corresponding actual cube groups <technology, iPhone, *, *>, <technology, android, *, *>, <news, iPhone, *, *> etc.
Note, * represents the all category which includes every attributes in that dimension.
Many pipelines are interested in computing aggregate measures SUM, REACH, TOP K over all possible aggregate/cube groups defined by the above dimensions. Eg. total sales for San Francisco, CA, USA for February 2015, number of unique buyers for last quarter for San Francisco, CA, USA.
Nowadays, terabytes of accumulated data per day is not uncommon in many companies. There are two types of aggregate measures, so called algebraic and holistic measures.
For algebraic measure, eg. SUM, COUNT, we can compute the SUM distributively part by part from the subgroups.
SUM( sales day 1, sales day 2, sales day 3) = SUM(sales day 1, sales day 2) + SUM(sales day 3)
For holistic measure, eg. DISTINCT_COUNT or TOP_K, we cannot compute them in a distributive fashion. This impacts the parallelization of our aggregation job.
DISTINCT_COUNT(USERS in day 1 and day 2) NOT equals to DISTINCT_COUNT(USERS in day 1) + DISTINCT_COUNT(USERS in day 2)
- Size of intermediate data
Since the size of intermediate data emitted during the map phase depends of the number of cube regions and the size of the input data. As we increase the number and depth of dimensions in aggregation, a naive approach can emit huge amounts of data exponentially during the map phase. This causes IO disk space issue and incurs additional shuffling cost.
- Size of large groups