The current crop of scientists working in the predictive analytics field faces two fundamental challenges in analyzing so-called "Big Data": finding adequately trained personnel and locating the right tools. So says Norman Nie, a pioneer of computer-based statistical analysis.
The ever-increasing avalanche of human activity data generated by online clicks, social networks, and telecommunications activity is ratcheting up demand for unique combinations of skills that are very difficult to find.
And for many firms, finding appropriate and affordable hardware and software tools to routinely store and profitably analyze unprecedented amounts of data are also major problems.
Few data scientists can match Professor Nie's knowledge and experience in the predictive analytics field. He is professor emeritus at both Stanford University and the University of Chicago. And at 67, he is a pioneer and continuing presence in the business of collecting, cleaning, and statistical modeling of large amounts of raw data.
As a graduate student at Stanford in the late 1960s he invented a computer software package called the "Statistical Package for the Social Sciences" or SPSS. He served as the founding CEO of the company SPSS, from 1975 to 1992, and was chairman through 2007. In late 2008, IBM (this site's sponsor) bought SPSS for $1.2 billion.
Since October 2009, Nie has served as the president and CEO of Revolution Analytics, a Palo Alto, Calif., company built to foster the growth of the R statistical language, an open-source programming language used for computational statistics and predictive analytics.
Making sense of big data
Nie believes that making sense of big data and utilizing that information to modify enterprise practices represents a massive challenge.
"The work done by many in the data mining field has to do with things like: Should we give a loan to this guy, given what we know about him? Not why," he says.
That kind of static data is good for real-time risk analysis but not informative for building substantive models for how the world actually works. Social network data defies many canned model formulations.
In data mining there are at least two kinds of data, attributive and behavioral. Attributive data is slow-moving or even static. A person's name, address, credit score, family size, etc., are all attributive. Behavioral data involves online behavior such as Web analytics activity like page visits. It also involves a person's purchasing decisions -- the kind of information used by Amazon.com when it suggests new products to visitors.
Data mining involves predicting human behavior based on disparate past behaviors, some online, some offline. To truly exploit this kind of data, scientists build models. It is the scientific way of asking "why?" It involves the examination of how seemingly random variables are related. It is data analytics with a long view.
To fill in the personnel gaps created by the emergence of big data, the BI community is looking for trained candidates and even undergraduates in certain scientific fields, Nie says, but they are hard to find. The US is currently experiencing an acute shortage of mathematicians and others trained in related fields such as statistics.
The problem, Nie believes, is American students are afforded early opt-out of their majors, and many mathematics majors take advantage of that. They tend to go into other fields that involve less academic work and more money in the labor market, often heading off to the financial market to create more exotic investment derivatives.
Data analytics requires knowledge in multiple fields. For instance, a math major might need some familiarity with social sciences such as sociology, psychology, or biology. And candidates with degrees in the social sciences often lack sufficient math training.
"A lot of what we're doing out there is taking people with mathematical skills, giving them canned statistical algorithms, and blasting data at them."
Big data is best defined as the amount that is too big for current hardware and software products to handle, says Nie. "The key is software parallelization on multiprocessor commodity hardware. That is the future of big data storage and analysis."
Many companies including his own are creating databases, access, management, and analysis tools for big data. Multiprocessing hardware and software either in the cloud or in the closet will replace older serial processing systems that can process big data only by throwing millions of dollars of hardware at the problem.
Nie believes that parallelized software running on inexpensive multiprocess computers is the wave of the future for all types of big data computing. But the transition will be slow.
"Enterprises, private or public, have tremendous resistance to change, even in the face of mounting costs and sub-optimal solutions," he says.