View on GitHub

workshop-

Return to Home

Kitchin and Anderson Question Response

“Google conquered the advertising world with nothing more than applied mathematics. It didn’t pretend to know anything about the culture and conventions of advertising — it just assumed that better data, with better analytical tools, would win the day. And Google was right” (Chris Anderson). This quote from the Anderson article best explains where my beliefs fall. We have entered the petabyte age, where “correlation is enough” to suggest causation. The age of designing a hypothesis and testing said hypothesis is coming to an end. We have entered an age where a researcher starts with data, and lets the computer analyze it in order to develop connections between various data points– this is their hypothesis. For example, J. Craig Venter made use of big data and supercomputers to discover tons of new life-forms by sequencing the genes found in the air around the ocean. The evidence of these new life forms is simply a “statistical blip”, however this correlation is enough. One problem with the aged scientific method is that some hypotheses are unable to be tested due to the size of available data. This is currently a large problem, though it may be resolved by the power of quantum computing some day not-too-far in the future.

Kitchin listed a variety of problems that arise when dealing with big data to test a hypothesis: it’s high in velocity, diverse in variety, and it’s exhaustive by covering entire populations. It is also continuously being collected, such as phone records and website clicks. Kitchin also provided one bewildering statistic: Walmart collected 2.5 petabytes of data every hour in 2012. To put things in perspective: one petabyte is 1024 terabytes, which is 1 x 10^6 gigabytes; meanwhile, most laptops can store between 500 gigabytes to one terabyte of data. The beauty of this new method is that a computer can comb through this sizable data and make connections that a human observer would have trouble detecting. Another upside is that scientists no longer need to devise a hypothesis before collecting data; the data itself tends to reveal new information (that often the scientists did not even know they were looking for).