Friday, July 15, 2011

Overlay two heatmaps in R

To-do: two heatmaps on two data sets with same dimension, overlay one over the other. For example, you have now heatmap 1 and 2 like these (generated in R, dataset is 4 by 2). How to overlay red one on top of the green one?


Well, I really hope this problem is as easy as overlay two pieces of paper. Unfortunately there are a few extra miles to go. The steps are clear though:
  1. get the rgb number on the corresponding position for paint and background heatmap, regardless of the values in the dataset
  2. merge two rgbs cell by cell
  3. plot the result rgb color matrix.
Before that, understanding how R assign colors to data from a color list, is needed. I actually played with some toy datasets and color list. Then I found that R will arrange data from smallest to biggest, and then assign colors according to the rank, so that the smallest and biggest will get the beginning and end of the color list. Say you have color c1 to color c8, and your ranked data set is (1,2,3,4). R will assign color c8 to 4, color c1 to 1. Color b and c will get in this case get color c3 (ceiling ((2-1)/(4-1)*8) and color c6. Thus given the color list a heatmap uses and the dataset that's been plotted, one can calculate which color is used to plot which data point. Well, I wrote a R function to accomplish this

heatmap_col <- function(val_array, col_list) {

minv = min(val_array)
maxv = max(val_array)

out = val_array
for( i in 1:nrow(val_array))
for( j in 1:ncol(val_array))
# take max of 1 or the other value in case 0 is the ceiling result
out[i,j] = col_list[max(1, ceiling((val_array[i,j] - minv) / (maxv - minv) * length(col_list)))]

out
}

With this then, I can easily get the rgb used on each cell of the heatmap. However, the returns are simply rgb code, like '#FF000045'(red,green,blue,alpha). We need to know the exact numbers of red, green, blue and even alpha, which are critical for the blending. After some serious googling, I found this colorful presentation very helpful. It contains a hex constant table, which could be used to break down rgb color into 4 numbers between 0 and 1, corresponding to red, green, blue and alpha. R code is here

hex_constant = data.frame(hex=c(seq(from=0, to=9, by=1), toupper(letters[1:6])), decimal=seq(from=0, to=15, by=1), stringsAsFactors=F)


rgb2hex_number <- function(x, hex_constant){

num_collector = hex_constant$decimal[hex_constant$hex == toupper(substr(x, start=2, stop=2))]
for(i in 3:nchar(x))
num_collector = c(num_collector, hex_constant$decimal[hex_constant$hex == toupper(substr(x, start=i, stop=i)]))

R = (num_collector[1]*16+num_collector[2])/255
G = (num_collector[3]*16+num_collector[4])/255
B = (num_collector[5]*16+num_collector[6])/255

with_alpha = ifelse(nchar(x) > 7, TRUE, FALSE)

if(with_alpha) {
A = (num_collector[7]*16+num_collector[8])/255
return(c(R, G, B, A))
} else { # fill in alpha of 0
return(c(R, G, B, 0))
}
}


The next step is to merge two rgb numbers on the same position. Two stackoverflow posts (here and there) talked about how to do it in python. If one happens to look up in wikipedia, one has to be skeptical with it.
Wiki is wrong about the operation could be associative. It's actually not, meaning switching background and paint colors while blending can yield different color. A quick algebra is enough. The R code for this step is trivial.

mix_2rgb <- function(x, y){
# x is the paint color, y is the background color
# NOTICE: this operation is not associative. Switching x and y may yield different result.

# lookup table of hex and decimal
hex_constant=data.frame(hex=c(seq(from=0, to=9, by=1), toupper(letters[1:6])), decimal=seq(from=0, to=15, by=1), stringsAsFactors=F)

# belending two rgb colors evenly
x_out = rgb2hex_number(x, hex_constant)
y_out = rgb2hex_number(y, hex_constant)

# with alpha channel considered
ax = 1- (1 - x_out[4]) * (1 - y_out[4])

if( ax > 0 ) {

rx = x_out[1] * x_out[4] / ax + y_out[1] * y_out[4] * (1 - x_out[4]) / ax
gx = x_out[2] * x_out[4] / ax + y_out[2] * y_out[4] * (1 - x_out[4]) / ax
bx = x_out[3] * x_out[4] / ax + y_out[3] * y_out[4] * (1 - x_out[4]) / ax

return(rgb(rx, gx, bx, ax))

} else {

rx = (x_out[1] + y_out[1]) / 2
gx = (x_out[2] + y_out[2]) / 2
bx = (x_out[3] + y_out[3]) / 2

return(rgb(rx, gx, bx))
}
}


The last step is a little bit tricky. Since all I have for now is the colors I want to plot at each cell, but I don't have a data matrix to pass to "heatmap" function. So I have to do a hack using "rect" function. Anyway, This is the end result. Not bad, huh?

Thursday, July 7, 2011

A Few Things Learned on Hadoop Streaming

In the past couple of days, I tried to run some map-reduce jobs on EMR through python streaming. The API I used is boto. It's really basic and not very documented. I was only able to find one example. One thing I learned the hard way is about the data coming out of hive. Surprising, no matter what input format (in terms of separators), the data out of hive is always 'ctrl-A' separated. Check this.