December 22, 2012 by Krisi
In the final days of 2012 I decided to turn my attention to one of the hottest subjects lately – Big Data. Rarely a day has passed this year without someone writing about Big Data. If you perform a query on Google Trends, you can see the increase of mentioning the term:
What does Big Data actually mean?
Currently the term is used mainly to describe the exponential increase of information, and its availability in structured and unstructured way. Furthermore, big data encompasses many various types of formats – information from traditional relational databases, documents, email, video, audio, financial transactions, various meta-data types, data from various social media channels, sensor data, mobile data, and so on and so forth. And we are able to store those huge data volumes because of diminishing costs and increased storage capacity. After all, Moore’s law is still valid.
Data do not equals knowledge
But when we talk about big data, we need to take into consideration that the variety and volume of information contribute to the complexity of the work to link, clean and transforms the data. This is not an easy task, and many new innovations were made (Google being one of them) to create new types of data bases, the increased usage of knowledge graph (also known as semantic web) and many new exciting things. But this is related to the technical side.
Lately some pundits have been stating that with so much data, and yet more to come, we would be able to extract a lot of information and arrive at valuable insights from it. The problem with this statement is that many people equate data to information, while they are not the same. More data does not mean more information. Furthermore, they believe they can get an insight only by looking through the data, without considering that we first need to gain understanding of the intricacies of the systems we are tracking in order to build a model, a hypothesis which we can test or prove through the data. Otherwise we might mistake irrelevant data for important information, thus rendering our prediction irrelevant.
Some of those sentiments are also considered by Nate Silver in his book “The Signal and the Noise”. Particularly, he states that one of the things people agree nowadays is that the amount of information is increasing exponentially, much more than the speed at which we can process it and understand what the data might mean. Additionally, it is harder for us to make a distinction between valid data and noise, sometimes mistaking the latter for useful information to make predictions. This leads often to overconfident statements that are usually incorrect.
Mr. Silver is also expressing a thought, that every person who is talking about “big data” needs to put in a bumper sticker: “The numbers have no way of speaking for themselves. We speak for them. We imbue them with meaning…”
And this is my second point – you first need to understand the intricacies of the system and then put context when analyzing the data. Regardless the amount of data you might collect, if you can’t provide a context and meaning to the data, most often than not you will not be able to construct a valid forecast. Through context we transform data to information, and the latter to knowledge. Without this we can’t separate the signal from the noise.
And while we are the ones that give context to data, we need to be aware of the biases we introduce while developing our forecasts/analysis and account for that in our hypothesis. We also need to come to understanding that we are intrinsically unable to make objective predictions due to the previous beliefs we are holding, that inevitably affect our interpretation of the data.
Finally, it is expected that Big Data industry will account for app. $40b. in the next years, assisting for innovation in many industries and providing opportunities to those who can use these data creatively. In this blog, I plan to post every so often about big data applications, uses and some research I will be doing, hoping that the context I provide along with the data analysis will provide for some interesting reading in 2013.