HyperLogLog | Wei Shung Chung

To understand HyperLogLog (HLL), one should read the LogLog and HyperLogLog papers.

http://algo.inria.fr/flajolet/Publications/DuFl03-LNCS.pdf (LogLog)

http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf (HyperLogLog)

The major difference between the two is one uses regular mean and the other (HLL) uses harmonic mean to average the estimate cardinality calculated by different m buckets. I would view these buckets as different experiments.

The accuracy (standard error) is controlled by the number of buckets, m. The more buckets the better is the estimation. But, HLL has better accuracy compared to LogLog given the same number of buckets.

The standard error for HLL is

The standard error for LogLog is

The following is the LogLog algorithm taken from the paper

Basically, there are m buckets, each one is keeping track of the run of consecutive zeros in the bit string. We then figure out the max run of consecutive zeros (position of leftmost 1) in all the buckets.

eg. 1 => 1, 001 =>3

Then using the cardinality estimate formula shown in the last line, calculate the estimate unique counts. As you see from the formula, it involves taking the average of the run of zeros observed in these buckets. (Note: the alpha is the bias correction applied to the cardinality estimate)

For HyperLogLog, instead of using the average, harmonic mean (stochastic averaging) is used. Please refer to the HLL paper for more detailed description of the formula.

Taken from the HLL paper,

In MapReduce, distinct/unique count of large data set is very common but unfortunately it is not scalable because it requires one reducer. It gets worse when you have to perform unique count across different aggregation/segment groups. Again, unique counting across different aggregation granularities whether in terms of time dimension or in combination with other demographic attributes is a common practice in big data analytics. Eg. In your massive data sets, let’s find unique counts of users for every hour, every day, every week, every month and etc.

HyperLogLog approximate unique counting can be used to provide a scalable solution. Keep in mind, the unique count here is a probabilistic approximate counts.You can implement either a map side hyperloglog counting or reduce side hyperloglog counting. However, map side counting requires more memory and beware of out of memory issue if you have many segments to perform unique count on.

For more information, you can check out the following pages

http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html

http://stefanheule.com/papers/edbt2013-hyperloglog.pdf

http://tech.adroll.com/media/hllminhash.pdf

http://www.looker.com/news/blog/practical-data-science-amazon-announces-hyperloglog