Almost any big data pipeline has an aggregation processing step for analyzing multidimensional data. For example: in sales dataset, we have the following
<year, month, day, hour, minute, second, city, state, country, sales>
where <year, month, day> are attributes of the date dimension and <hour, minute, second> are attributes for the time dimension. These are so called the temporal dimensions.
<city, state, country> are attributes of the location dimension.
In OLAP data tube terminology, cube group denotes an actual group belonging to the cube region defined by a certain dimension attributes. Eg. one could have the following cube region <search_topic, device_type, *, *> with the corresponding actual cube groups <technology, iPhone, *, *>, <technology, android, *, *>, <news, iPhone, *, *> etc.
Note, * represents the all category which includes every attributes in that dimension.
Many pipelines are interested in computing aggregate measures SUM, REACH, TOP K over all possible aggregate/cube groups defined by the above dimensions. Eg. total sales for San Francisco, CA, USA for February 2015, number of unique buyers for last quarter for San Francisco, CA, USA.
Nowadays, terabytes of accumulated data per day is not uncommon in many companies. There are two types of aggregate measures, so called algebraic and holistic measures.
For algebraic measure, eg. SUM, COUNT, we can compute the SUM distributively part by part from the subgroups.
SUM( sales day 1, sales day 2, sales day 3) = SUM(sales day 1, sales day 2) + SUM(sales day 3)
For holistic measure, eg. DISTINCT_COUNT or TOP_K, we cannot compute them in a distributive fashion. This impacts the parallelization of our aggregation job.
DISTINCT_COUNT(USERS in day 1 and day 2) NOT equals to DISTINCT_COUNT(USERS in day 1) + DISTINCT_COUNT(USERS in day 2)
Other than above holistic measure challenge, other issues in big data aggregation are
- Size of intermediate data
Since the size of intermediate data emitted during the map phase depends of the number of cube regions and the size of the input data. As we increase the number and depth of dimensions in aggregation, a naive approach can emit huge amounts of data exponentially during the map phase. This causes IO disk space issue and incurs additional shuffling cost.
There is high possibility that we have some large cube groups during reduce phase , e.g. <*, *> or <*, *, *, USA>. With a naive approach, the reducer assigned to these groups will take a long time to process compared to others.
Most of the time, cube regions at the bottom of the cube lattice are likely to have these large groups, we can call them reducer-unfriendly groups, e.g.. <*,*>. One could also perform sampling on the input and figure out the potential large groups.
We can convert some holistic measures, eg. DISTINCT_COUNT into partial algebraic form so that we can compute them from subgroups in a parallel fashion. If we don’t care about exact value, we could also use approximate distinct counting.
To be continued…