Showing posts with label data visualization. Show all posts
Showing posts with label data visualization. Show all posts

Thursday, April 26, 2012

Visualizing too much data

This week, I got a chance to study a dataset of 13 million rows. Luckily, the file has no strings in it. So my (local) R reads it just fine, except taking a longer time. The original idea is to find a model/systematic pattern that describes the relationship between two fields in the dataset. Most of time, it's a good idea to take a peak at things before rolling sleeves. So now I ended facing the problems of visualizing too much data.

Of course, brute force plotting of every point won't work. The points will step on top of each other. The density of data points will get lost in the sea of points. And it takes forever to run the command. The better solution would be to use hexbin plots, which can handle million+ data points. The data points are first assigned to hexagons that covers the plotting area. Then head counts were done for each cell. At the end, the hexagons got plotted on a color ramp. R has a hexbin package to draw hexbin plots and a few more interesting functions. R ggolot2 package also has a stat_binhex function.

hexbinning that 13 million data points 
Quite surprisingly, I did not find a lot of literatures online regarding this binning techniques. But something to worth noting are:

  • Enrico Bertini has a post regarding things that could be done to deal with visualizing a lot data.
  • Zachary Forest Johnson has a post devoted to hexbins exclusively, which is very helpful.
  • Last but not least, the hexbin package documentation talks about why hexagons not squares, and the algorithms to generate those hexagons.

Monday, February 13, 2012

Funnel plot, bar plot and R

I just finished my Omniture SiteCatalyst training in Mclean, VA a few days ago. It was ok (somehow boring), we only went through how to click buttons inside SiteCatalyst to generate reports, not necessarily how to implement it and let it track the information we want to track.

I got two impressions out of the class: one is Omniture is great and powerful web analytical tool; another is the funnel plots could be misleading from data visualization perspective. For example, regardless of why the second event 'Reg Form Viewed' has higher frequency than first event 'Web Landing Viewed', the funnel bar for second event is still narrower than the one for first event. Just because it's designed to be the second stage in the funnel report.


This is a typical example of visualization components do not match up the numbers. There could be other types of funnel plots that are misleading as well, as pointed out by Jon Peltier in his blog article. I totally agree with him on using the simple barplot to be an alternative for the funnel plots. And I also like his idea of adding another plot for visualizing some small yet important metric, like purchases as shown in his example.

Then I turned into R to see if I can do some quick poking around on how to display the misleading funnel I have here into something meaningful and hopefully beautiful. Since I always feel like I don't have a good grasp on how to do barplots in R, this is going to be a good exercise for me.

As always, figuring out the 3-letters parameters for base package plot function is painful. And I had to set up appropriate size of margins, so that my category names won't be cut off.

Then I drew the same plot using ggplot2. All the command names make sense. And the plot is built up layer by layer. However, I did not manage to get the x-axis to the top of the plot, which will involve creating new customized geom.


There are some nice R barchart tips on the web, for example on learning_r, stackoverflow, and gglot site. Anyway, this is what I used

##### barchart

dd = data.frame(cbind(234, 334, 82, 208, 68))
colnames(dd) = c('web_landing_viewed', 'reg_form_viewed', 'registration_complete', 'download_viewed', 'download_clicked')
dd_pct = round(unlist(c(1, dd[,2:5]/dd[,1:4]))*100, digits=0)

# plain barchart horizontal
#control outside margin so the text could be equeezed into the plot
par(omi=c(0.1,1.4,0.1,0.12))
#las directions of tick labels for x-y axis, range 0-3, so 4 combinations
mp<-barplot(as.matrix(rev(dd)), horiz=T, col='gray70', las=1, xaxt='n');
tot = paste(rev(dd_pct), '%');
# add percentage numbers
text(rev(dd)+17, mp, format(tot), xpd=T, col='blue', cex=.65)
# axis on top(side=3),'at' ticks location, las: parallel or pertanculiar to axis
axis(side=3,at=seq(from=0, to=30, by=5)*10, las=0)

# with ggplot2
dd2=data.frame(metric=c('web_landing_viewed', 'reg_form_viewed', 'registration_complete', 'download_viewed', 'download_clicked'), value=c(234, 334, 82, 208, 68))

ggplot(dd2, aes(metric, value)) + geom_bar(stat='identity', fill=I('grey50')) + coord_flip() + ylab('') + xlab('') + geom_errorbar(aes(ymin = value+10, ymax = value+10), size = 1) + geom_text(aes(y = value+20, label = paste(dd_pct, '%', sep=' ')), vjust = 0.5, size = 3.5)

Tuesday, January 18, 2011

Visual That


Yesterday I opened my business week magazine and saw this picture in an article talking about CES in Las Vegas this year. The picture compares the size and weight of Panasonic new garage-size 3D TV to something people are very familiar with, like Koby Brant's height, length of mini cooper, weight of a cow. So that users reading this article have rough idea how big the TV is.


Precision of measurements is not very important in this case, the familiarity and approximation of measurements are the key. And this is an idea that have sat in my mind for a month now. How desperately I want to realize that idea by building my own website! I actually already have a name for the website. It's called "visualthat.com", isn't it a cool name? Maybe at this moment I don't have the ability to do that now. At the very minimal, I can tape my thoughts here. :)

What I have been thinking is a website that allows users to input the metric they are interested in (for example, height, weight, volume, currency etc) and the amount of the metric in numbers. And the website spits out a visualization of the metric comparing against something average people are familiar with. For example, I am interested in 1 yard. The website spits out a picture of a height of SUV, and visualize how much a yard is comparing to that height. This visualization could be chop some part of the SUV (suppose SUV is taller than 1 yard), or a ruler next to the picture indicating the approximate position of 1 yard. Another example, I am interested in how much 1 dollar is worth in Chinese currency. Well, I input 1 dollar, it shows me a mac burg from McDonald’s dollar menu, and probably 3/4 of the same thing over a map of China or 2 apples. To make it more fun, users are allowed to choose baseline they’d like to compare to. And an app for mobile devices can be created too.

The closest thing I can find on the internet is Wolfram Alpha . However, I don’t quite like (1) no pictures, (2) too many query results crowded in one page. This Scale of Universe from Primax Studio is also cool, but it does not (1) allow user input (2) choose baseline.

Anyway, this is going to be a very fun project, making measurement conversion allowing user input query and visualizing query result. And it’s going to be a challenging one! Hmmm, maybe I should start learning some html and JavaScript.