December 18, 2012 By Tanya Roscorla
An elective class arms students with big data analysis skills that are in high demand.
In general, people with big data analysis skills and real experience are hard to find, said Gilad Mishne, engineering manager of search at Twitter. And competition between companies for these skilled people is hot, especially in the Silicon Valley.
Because users send 400 million tweets a day, Twitter needs complex algorithms to analyze all that data. And it needs skilled engineers to do the job. So it partnered with UC Berkeley on a class that teaches students how to analyze data using real tweets.
The "Analyzing Big Data with Twitter" class included lectures from UC Berkeley professors and 15 volunteers from Twitter on the technology behind the social network. These lectures were video taped and are available online for anyone to watch.
Along with the lecturers, 12 volunteer mentors from Twitter worked with 40 undergraduate and graduate students as they built data-driven applications. For example, one team used tweets to find funky restaurants around campus. This application was something they could use the next day, and both the engineer supervising the team and the students had fun working on it.
"This is priceless," Mishne said. "The first thing I actually look at when I see a CV (curriculum vitae) is, 'Did this student do something real? Did he build something or did she build something with real data?' And this is exactly what I would look for — this kind of experience."
In one project, students analyzed Twitter interest graphs (who links to whom) and conversation graphs (who refers to whom), said Marti A. Hearst, professor in the UC Berkeley School of Information. Students made interesting visualizations for this assignment through simple graphing algorithms that showed hundreds of thousands of interests that people discussed online.
"To me, what was interesting was how much you can see about the different topics that do arise from who links to whom, especially if you're looking at more well-known people, not necessarily celebrities, but people who have a lot of influence in the twitter sphere," Hearst said.
These students could use their newly acquired skills in applications including public health, business and city planning. As more types of data and real time data are collected, analysts can see accurate trends in the spread of disease. They can understand what customers think about a product and trigger a response. And they can see where new fire stations or social services should be located based on where people are living and moving to.
But more students need to be trained to do this kind of work. And to help train these students, Mishne said he would like to repeat a class like this with UC Berkeley and other places that express interest.
This story was originally published at the Center for Digital Education website.
You may use or reference this story with attribution and a link to
http://www.govtech.com/education/Twitter-Addresses-Data-Analysis-Skill-Shortage-with-UC-Berkeley-Class.html
Strong large data set analytical skills are not a difficult to find as some might guess. Rather, IT trending has made a buzz phrase that actually complicates the matter. For example, a well trained Six Sigma Blackbelt with Design of Experiement and Hypothesis Testing skills ets can do Big Data Set Analystics. Any Aerospace Engineer with Rocket Telemetry Analysis skills can do it. Any Medical or Insurance Actuarial trained staff can do it. These grups simply do not call statistical, hypotheis and trend analysis skills, "Big Data" analytical skills. The problem is an artificact of Jargon.
I have seen the same aversion to the "big data" buzz from other big names in analytics. I agree that there are a lot of other strong skill sets that provide a good foundation, but I also think the new challenges shouldn't be underestimated. My understanding is that the term Big Data isn't just about statistics on large, static (or frozen), or designed data sets. I think it is intended to bring attention to the specific challenges associated with analyzing what's currently happening with data sets that are growing and changing at an unprecedented rate. Most query training I've seen doesn't explain why one query is faster than another, for example - but that difference changes what's possible with big data. There are also challenges associated with working with emergent, opportunistic and unstructured data sets (especially comment text), rather than data coming from designed experiments. Even rocket scientists haven't had to analyze 400 million new data points a day that say things like "NYE we streaming live on http://www.xumanii.com at 10:30 CST!!!! https://socialcam.com/videos/ea7ke1Vu?type=email … #CHAPTERVTOUR".