Thursday, April 26, 2012

Visualizing too much data

This week, I got a chance to study a dataset of 13 million rows. Luckily, the file has no strings in it. So my (local) R reads it just fine, except taking a longer time. The original idea is to find a model/systematic pattern that describes the relationship between two fields in the dataset. Most of time, it's a good idea to take a peak at things before rolling sleeves. So now I ended facing the problems of visualizing too much data.

Of course, brute force plotting of every point won't work. The points will step on top of each other. The density of data points will get lost in the sea of points. And it takes forever to run the command. The better solution would be to use hexbin plots, which can handle million+ data points. The data points are first assigned to hexagons that covers the plotting area. Then head counts were done for each cell. At the end, the hexagons got plotted on a color ramp. R has a hexbin package to draw hexbin plots and a few more interesting functions. R ggolot2 package also has a stat_binhex function.

hexbinning that 13 million data points 
Quite surprisingly, I did not find a lot of literatures online regarding this binning techniques. But something to worth noting are:

  • Enrico Bertini has a post regarding things that could be done to deal with visualizing a lot data.
  • Zachary Forest Johnson has a post devoted to hexbins exclusively, which is very helpful.
  • Last but not least, the hexbin package documentation talks about why hexagons not squares, and the algorithms to generate those hexagons.

No comments:

Post a Comment