Monday, December 17, 2012

End of year data thinking

Lately, Big Aanalytics 2012, NY hosted an interesting live panel discussion. The name was quite catchy, "Do you believe in Santa? How about Data Scientist?” Four guests were invited to the panel, including Geoffrey Guerdat, Director of Data Engineering Group from Gilt.
Guests and audience had heated discussion around the topic. Personally, I enjoyed almost all of Geof's comments and opinions. He precisely described what I have been observing and thinking this year. In the following, I just highlighted a few points that really touched my heart.
  •  Team building

The Data Engineering group has 12 members, majoring in 3 main areas - Business Intelligence, Data Engineering and Data Science. This is the mix I'd like to see and be involved with as well. Data Engineers are extremely important, in my opinion, sometimes even more important than both BI and DS analyst. They are in charge of plumbing (ETL) and making sure everything works. Organizations normally start their data team with BI. So BI has longer history and more "credibility and reputation" than the fancy DS. Having a good BI sub-team, ensures the companies to have access to vital measurements and make smarter decision. The DS sub-team is crucial as well. As Geof pointed out, with the amount of data, and time it takes to process them, DS analysts kind of bridge the gap between BI and DE. And they are aware of more techniques/tools than traditional BI analysts.

Ideally, I'd like to see the mix of those 3 functions change over time. At beginning, one might want more BI and DE people, but way less DS people (definitely not completely missing. DS people need to get trained on company's data over time.) This mix will focus on sorting things out and serve other departments inside an organization. As things got more stable, one would get more DS but less BI people. Thus, the team could work closely with a few teams to solve harder problems.

Geof constructs his team around two Data Scientists, one is strong in Statistics and one is strong in Computer Science. They provide guidance and act as quarterbacks. The solution sounds very clever to me. My ideal team would include 1 director, who is very good at working inside an organization (aka, politics, as someone call it), 2 tech leads (1 stat and 1 cs). All other team members are acquired around this golden triangle. However, I see many companies hire "managers" to manager data teams, who have never written a single line of code. They had a hard time identifying problems and bottlenecks; they even had hard time recognizing/accepting suggestions from the data scientist inside the group. All is because that, they often don't know data as well as the people who work with data 40 hours a week.

I'd appreciate data managers to have good "listening" and "summarizing" skills, and data scientists to have the nature of curiosity and ability to prove or implement their own thoughts.


  • As more people become data scientists by clicking buttons inside "tools", is it good or bad?

Some companies who are in the tool business, actually aimed their goal to be "let everyone be data scientist". Please allow me to frown to such claims. It's very dangerous if everybody were a data scientist. "Data People" are armed with more and more powerful data mining weapons. They are capable of doing more harm. And not only they need to understand the underlying models to explain to others, they also need to know well enough to recognize the spots to reconstruct and optimize their models. People need to get trained on understand the input data (meaning familiar with the business, and knowing what's available), and output data (to identify how to act upon the insights).

  • How can someone tie a dollar amount to Data teams?

All the panelists agreed that it's hard to do so. Well, my response is "don't even get yourself there". In both of my jobs, companies tried to tie revenue goal/gain to data teams. They both failed. Whenever I see companies try to put a price tag on data teams, it only occurs to me that they haven't realized the value of their data, or truly recognized the fact that data team provides guidance and advice is helpful and important. They probably still think data as accessories, something supplementary not necessary. However, in my opinion, data should be treated as one of the organization's product lines. It's as important as all other products. With the amount of data we have on our users, and the amount of insights we know about them, we have just started the data journey.

  • Tools data scientists use

Geof mentioned R/sql/vi/emacs/shell/java. It seems rather primitive. However, they are really powerful. I hate teams become tool-dependant, which creates bottlenecks naturally. Because it's hard for others to maintain the system and make changes, particularly when the tool experts are not around.

  • What makes good data scientists?

"Moving the info around, reconstructing info in some other way, and making use out of it ...", Geof summarized. This truly describes what I am working in the past few months, to consolidate data in a way that is easy to consume and make sense to both analysts and the entire company. I believe, without solid foundations, no buildings on top should be called "success". So data plumbing and pumping is really the key to everything.

One of the audiences raised an interesting point, the short of data scientists is just a gap of education. Right now some schools are teaching statistics to elementaries. It's going to be fun to teach my kindergartener "averages" and tell him that "average American kindergartener" actually doesn't not exist!