Thursday, April 26, 2012

Visualizing too much data

This week, I got a chance to study a dataset of 13 million rows. Luckily, the file has no strings in it. So my (local) R reads it just fine, except taking a longer time. The original idea is to find a model/systematic pattern that describes the relationship between two fields in the dataset. Most of time, it's a good idea to take a peak at things before rolling sleeves. So now I ended facing the problems of visualizing too much data.

Of course, brute force plotting of every point won't work. The points will step on top of each other. The density of data points will get lost in the sea of points. And it takes forever to run the command. The better solution would be to use hexbin plots, which can handle million+ data points. The data points are first assigned to hexagons that covers the plotting area. Then head counts were done for each cell. At the end, the hexagons got plotted on a color ramp. R has a hexbin package to draw hexbin plots and a few more interesting functions. R ggolot2 package also has a stat_binhex function.

hexbinning that 13 million data points 
Quite surprisingly, I did not find a lot of literatures online regarding this binning techniques. But something to worth noting are:

  • Enrico Bertini has a post regarding things that could be done to deal with visualizing a lot data.
  • Zachary Forest Johnson has a post devoted to hexbins exclusively, which is very helpful.
  • Last but not least, the hexbin package documentation talks about why hexagons not squares, and the algorithms to generate those hexagons.

Tuesday, April 17, 2012

Some Python

The other day, I was trying to flag a posted (to the Redis server for quick lookup) recommendation from a dictionary of recommendations. Then the next round, I could do some weighted random sampling among the unposted ones, without actually going through the entire recommendation calculations. Anyway, it's a little bit tricky to neatly flag a 'used' recommendation in a python dictionary. Fortunately, someone has already provided a solution. I'd like to borrow it over for quick references.
>>> x = {'a':1, 'b': 2}
>>> y = {'b':10, 'c': 11}
>>> z = dict(x.items() + y.items())
>>> z
{'a': 1, 'c': 11, 'b': 10}

b's vlaue is properly overwritten by the value in second dictionary. In Python 3, this is suggested
>>> z = dict(list(x.items()) + list(y.items()))
>>> z
{'a': 1, 'c': 11, 'b': 10}


And I saw a cool post about using python's map, reduce, filter function to a dictionary. Once again, something could be cleanly and flexibly accomplished. Python is such a beautiful language.