Cukier, Kenneth & Viktor Mayer-Schönberger (2013). Big Data: A Revolution That Will Transform How We Live, Work and Think. London: John Murray. 2013. ISBN 9781848547933. Pagine 257. 12,99 €

images-amazon.com
Non si parla altro che di big data. E come spesso accade per le mode, se ne parla per lo più a sproposito. Questo è un tentativo piuttosto serio di esaminare diversi aspetti del fenomeno big data senza eccessi né nella direzione dell’entusiasmo né in quella dell’allarmismo.
È un libro molto importante, soprattutto per chi opera professionalmente nei campi della statistica, dell’economia, della sociologia e delle analisi quantitative: perché – al di là della fuffa – i big data cambiano radicalmente il mondo dell’analisi dei dati. O gli statistici (e gli uffici pubblici di statistica) cambiano radicalmente il loro modo di pensare e di operare, o saranno condannati all’irrilevanza. Capire come funzionano i big data è la chiave per comprendere il mondo in cui viviamo oggi. Niente di più, niente di meno.
Il libro è chiaro e ben strutturato in 10 capitoli. Alla fine, ho apprezzato più i primi – che trattano diversi aspetti rilevanti dei big data – degli ultimi – in cui emerge un po’ di misoneismo peloso. Ma andiamo con ordine, capitolo per capitolo (riferimento alle posizioni su Kindle].
1. Now. Dove si fa il punto su che cosa siano i big data e perché siano importanti.
One way to think about the issue today—and the way we do in the book—is this: big data refers to things one can do at a large scale that cannot be done at a smaller one, to extract new insights or create new forms of value, in ways that change markets, organizations, the relationship between citizens and governments, and more. [114]
There was a shift in mindset about how data could be used.
Data was no longer regarded as static or stale, whose usefulness was finished once the purpose for which it was collected was achieved, such as after the plane landed (or in Google’s case, once a search query had been processed). Rather, data became a raw material of business, a vital economic input, used to create a new form of economic value. [96]The sciences like astronomy and genomics, which first experienced the explosion in the 2000s, coined the term “big data.” The concept is now migrating to all areas of human endeavor. [105]
Sometimes the constraints that we live with, and presume are the same for everything, are really only functions of the scale in which we operate. [181]
Yet the need for sampling is an artifact of a period of information scarcity, a product of the natural constraints on interacting with information in an analog era.
[…]
Big data gives us an especially clear view of the granular: subcategories and submarkets that samples can’t assess.
[…]
It’s a tradeoff: with less error from sampling we can accept more measurement error. [214-219]Society has millennia of experience in understanding and overseeing human behavior. But how do you regulate an algorithm? Early on in computing, policymakers recognized how the technology could be used to undermine privacy. Since then society has built up a body of rules to protect personal information. But in an age of big data, those laws constitute a largely useless Maginot Line. People willingly share information online—a central feature of the services, not a vulnerability to prevent. [303]
2. More. Usare tutti i dati, non solo un campione.
As noted in Chapter One, big data is about three major shifts of mindset that are interlinked and hence reinforce one another. The first is the ability to analyze vast amounts of data about a topic rather than be forced to settle for smaller sets. The second is a willingness to embrace data’s real-world messiness rather than privilege exactitude. The third is a growing respect for correlations rather than a continuing quest for elusive causality. This chapter looks at the first of these shifts: using all the data at hand instead of just a small portion of it. [309]
The very word “census” comes from the Latin term “censere,” which means “to estimate.” [337]
Sampling was a solution to the problem of information overload in an earlier age, when the collection and analysis of data was very hard to do. [372]
Using all the data makes it possible to spot connections and details that are otherwise cloaked in the vastness of the information. For instance, the detection of credit card fraud works by looking for anomalies, and the best way to find them is to crunch all the data rather than a sample. The outliers are the most interesting information, and you can only identify them in comparison to the mass of normal transactions. It is a big-data problem. [439]
Reaching for a random sample in the age of big data is like clutching at a horse whip in the era of the motor car. [505]
3. Messy. Venire a patti con l’imprecisione.
Big data transforms figures into something more probabilistic than precise. [557]
“Simple models and a lot of data trump more elaborate models based on less data,” wrote Google’s artificial-intelligence guru Peter Norvig and colleagues in a paper entitled “The Unreasonable Effectiveness of Data.” [623]
It bears noting that messiness is not inherent to big data. [623]
The Billion Prices Project
[…] two economists at the Massachusetts Institute of Technology, Alberto Cavallo and Roberto Rigobon […]
Price-Stats [655:668: web scraping per produrre il CPI statunitense]
4. Correlation. Correlazioni, predizioni e predilezioni.
Knowing what, not why, is good enough.
[…]
Correlations are useful in a small-data world, but in the context of big data they really shine.In 1998 Linden and his colleagues applied for a patent on “item-to-item” collaborative filtering, as the technique is known. The shift in approach made a big difference. [800]
Because data was scarce and collecting it expensive, statisticians often chose a proxy, then collected the relevant data and ran the correlation analysis to find out how good that proxy was. But how to select the right proxy?
To guide them, experts used hypotheses driven by theories—abstract ideas about how something works. Based on such hypotheses, they collected data and used correlation analysis to verify whether the proxies were suitable. If they weren’t, then the researchers often tried again, stubbornly, in case the data had been collected wrongly, before finally conceding that the hypothesis they had started with, or even the theory it was based on, was flawed and required amendment. Knowledge progressed through this hypothesis-driven trial and error. And it did so slowly, as our individual and collective biases clouded what hypotheses we developed, how we applied them, and thus what proxies we picked. It was a cumbersome process, but workable in a small-data world. [857]Our “fast thinking” mode is in for an extensive and lasting reality check. [1024]
In 2008 Wired magazine’s editor-in-chief Chris Anderson trumpeted that “the data deluge makes the scientific method obsolete.” In a cover story called “The Petabyte Age,” he proclaimed that it amounted to nothing short of “the end of theory.” The traditional process of scientific discovery—of a hypothesis that is tested against reality using a model of underlying causalities—is on its way out, Anderson argued, replaced by statistical analysis of pure correlations that is devoid of theory. [1147]
5. Datafication. Trasformare i fenomeni in dati.
The word “data” means “given” in Latin, in the sense of a “fact.” It became the title of a classic work by Euclid, in which he explains geometry from what is known or can be shown to be known. Today data refers to a description of something that allows it to be recorded, analyzed, and reorganized. There is no good term yet for the sorts of transformations produced by Commodore Maury and Professor Koshimizu. So let’s call them datafication. To datafy a phenomenon is to put it in a quantified format so it can be tabulated and analyzed. [1223]
His story highlights the degree to which the use of data predates digitization. [1206: la storia di Matthew Fontaine Maury]
It enabled information to be recorded in the form of “categories” that linked accounts. It worked by means of a set of rules about how to record data—one of the earliest examples of standardized recording of information. One accountant could look at another’s books and understand them. It was organized to make a particular type of data query—calculating profits or losses for each account—quick and straightforward. And it provided an audit trail of transactions so that the data was more easily retraceable. Technology geeks can appreciate it today: it had “error correction” built in as a design feature. If one side of the ledger looked amiss, one could check the corresponding entry. [1278: si sta parlando della partita doppia]
Over the following decades the material on bookkeeping was separately published in six languages, and it remained the standard reference on the subject for centuries. [1288: a proposito del capitolo sulla partita doppia del manuale di matematica di Luca Pacioli; peccato che gli autori scrivano che i Medici erano famosi commercianti e mecenati di Venezia!]
The standardization of longitude and latitude took a long time. It was finally enshrined in 1884 at the International Meridian Conference in Washington, D.C., where 25 nations chose Greenwich, England, as the prime meridian and zero-point of longitude (with the French, who considered themselves the leaders in international standards, abstaining). [1376]
The company AirSage crunches 15 billion geo-loco records daily from the travels of millions of cellphone subscribers to create real-time traffic reports in over 100 cities across America. Two other geo-loco companies, Sense Networks and Skyhook, can use location data to tell which areas of a city have the most bustling nightlife, or to estimate how many protesters turned up at a demonstration. [1426]
Twitter messages are limited to a sparse 140 characters, but the metadata—that is, the “information about information”—associated with each tweet is rich. It includes 33 discrete items. [1472]
It could tell if someone fell and did not get back up, an important feature for the elderly. [1493]
For well over a century, physicists have suggested that this is the case—that not atoms but information is the basis of all that is. This, admittedly, may sound esoteric. Through datafication, however, in many instances we can now capture and calculate at a much more comprehensive scale the physical and intangible aspects of existence and act on them. [1529]
6. Value. Come cambia il valore dei dati.
Data’s value shifts from its primary use to its potential future uses. [1565]
[…] data is starting to look like a new resource or factor of production. [1591]
Unlike material things—the food we eat, a candle that burns—data’s value does not diminish when it is used; it can be processed again and again. Information is what economists call a “non-rivalrous” good: one person’s use of it does not impede another’s. And information doesn’t wear out with use the way material goods do. Hence Amazon can use data from past transactions when making recommendations to its customers—and use it repeatedly, not only for the customer who generated the data but for many others as well. [1597]
In the end, the group didn’t detect any increase in the risk of cancer associated with use of mobile phones. For that reason, its findings hardly made a splash in the media when they were published in October 2011 in the British medical journal BMJ. [1709]
Google’s spell-checking system shows that “bad,” “incorrect,” or “defective” data can still be very useful. [1771]
A term of art has emerged to describe the digital trail that people leave in their wake: “data exhaust.” [1785]
[…] “learning from the data” […] [1789]
“We like learning from large, ‘noisy’ datasets,” chirps one Googler. [1796]
“Data is a platform,” in the words of Tim O’Reilly, a technology publisher and savant of Silicon Valley, since it is a building block for new goods and business models. [1930]
7. Implications. Fare affari con i dati.
In the previous chapter we noted that data is becoming a new source of value in large part because of what we termed its option value, as it’s put to novel purposes. The emphasis was on firms that collect data. Now our regard shifts to the companies that use data, and how they fit into the information value chain. We’ll consider what this means for organizations and for individuals, both in their careers and in their everyday lives.
Three types of big-data companies have cropped up, which can be differentiated by the value they offer. Think of it as the data, the skills, and the ideas.Hal Varian, Google’s chief economist, famously calls statistician the “sexiest” job around. “If you want to be successful, you want to be complementary and scarce to something that is ubiquitous and cheap,” he says. “Data is so widely available and so strategically important that the scarce thing is the knowledge to extract wisdom from it. That is why statisticians, and database managers and machine learning people, are really going to be in a fantastic position.” [1976]
Mathematics and statistics, perhaps with a sprinkle of programming and network science, will be as foundational to the modern workplace as numeracy was a century ago and literacy before that. [2261]
Rolls-Royce sells the engines but also offers to monitor them, charging customers based on usage time (and repairs or replaces them in case of problems). Services now account for around 70 percent of the civil-aircraft engine division’s annual revenue. [2315]
8. Risks. Il lato oscuro.
The dataset, of 20 million search queries from 657,000 users between March 1 and May 31 of that year, had been carefully anonymized. Personal information like user name and IP address were erased and replaced by unique numeric identifiers. The idea was that researchers could link together search queries from the same person, but had no identifying information.
Still, within days, the New York Times cobbled together searches like “60 single men” and “tea for good health” and “landscapers in Lilburn, Ga” to successfully identify user number 4417749 as Thelma Arnold, a 62-year-old widow from Lilburn, Georgia. “My goodness, it’s my whole personal life,” she told the Times reporter when he came knocking. “I had no idea somebody was looking over my shoulder.” [2438]“It isn’t the consumers’ job to know what they want,” he famously said, when telling a reporter that Apple did no market research before releasing the iPad. [2654]
“It is true enough that not every conceivable complex human situation can be fully reduced to the lines on a graph, or to percentage points on a chart, or to figures on a balance sheet,” said McNamara in a speech in 1967, as domestic protests were growing. “But all reality can be reasoned about. And not to quantify what can be quantified is only to be content with something less than the full range of reason.” [2662]
9. Control. Servono nuove regole.
Changes in the way we produce and interact with information lead to changes in the rules we use to govern ourselves, and in the values society needs to protect. [2695]
Rather than a parametric change, the situation calls for a paradigmatic one. [2717]
With such an alternative privacy framework, data users will no longer be legally required to delete personal information once it has served its primary purpose, as most privacy laws currently demand. This is an important change, since, as we’ve seen, only by tapping the latent value of data can latter-day Maurys flourish by wringing the most value out of it for their own—and society’s—benefit. [2746]
Without guilt, there can be no innocence. [2804]
10. Next. Il futuro dei big data.
Solo una sintesi dei capitoli precedenti. Ma non si poteva non fare.
martedì, 3 dicembre 2013 alle 23:46
io sono ancora fermo al primo capitolo (sono un lettore in parallelo…) e trovo un approccio molto anglosassone che è comune a questo tipo di libri. Devo però dire che hanno un buon modo di tirare fuori le idee di base.
martedì, 23 settembre 2014 alle 17:53
[…] statistician. [80: irresistibile per me il pensiero che, ormai, dopo Hal Varian e tutta la moda dei big data, essere di professione una statistica sia […]