Does the advent of big data mark a break with the previous ways of handling information, in particular personal information? Does it represent a true scientific revolution? This question has been discussed in scientific, philosophical and intellectual debates ever since Chris Anderson’s thought-provoking article for the American publication Wired in 2008. In the article, he proclaimed “the end of theory,” which has been made “obsolete” by the “data deluge” and concluded with this intentionally provocative statement: “It’s time to ask: What can science learn from Google?”
Just over a decade after its publication, at a time when what we now know as “big data,” combined with “deep learning,” is used on a massive scale, the Débat journal chose to explore the topic by devoting a special report in its November-December 2019 issue to “the consequences of big data for science” As such, it called on philosophers from a variety of backgrounds (Daniel Andler, Emeritus Professor in the philosophy of science at Université Paris-Sorbonne, Valérie Charolles, philosophy researcher at Institut Mines-Télécom Business School, Jean-Gabriel Ganascia, professor at Université Paris-Sorbonne and Chairman of the CNRS Ethics Committee) as well as a physicist, (Marc Mézard, who is also the director of ENS), asking them to assess Chris Anderson’s thesis. Engineer and philosopher Jean-Pierre Dupuy had shared his thoughts on the subject in May, in the journal Esprit.
Big data and scientific models
The authors of these articles acknowledge the contributions of big data processing on a scientific level (although Jean-Pierre Dupuy and Jean-Gabriel Ganascia express a certain skepticism in this regard). This sort of processing makes it possible to develop scientific models that are more open and which, through successive aggregations of layers of correlated information, may give rise to forms of connections, links. Although this machine learning by what are referred to as deep networks has existed for over 70 years, its implementation is still relatively recent. It has been made possible by the large amount of information now collected and the computing power of today’s computers. This represents a paradigm shift in computer science. Deep learning clearly provides scientists with a powerful tool, but, unlike Chris Anderson, none of the above authors see it as a way to replace scientific models developed from theories and hypotheses.
There are many reasons for this. Since they predict the future based on the past, machine learning models are not made for extreme situations and can make mistakes or produce false correlations. In 2009, the journal Nature featured an article on Google Flu Trends, which, by combining search engine query data, was able to predict the peak of the flu epidemic two weeks before the national public health agency. But in 2011, Google’s algorithm performed less well than the agency’s model that relied on human expertise and collected data. The relationships revealed by the algorithms represented correlations rather than causalities, and the phenomena revealed must still be explained using a scientific approach. Furthermore, the algorithms themselves work with the hypotheses (part of their building blocks) they are given by those who develop them, and other algorithms, if applied to the same data set, would produce different results.
Algorithmic processing of personal data
In any case, even if it does not represent a paradigm shift, the use of big data attests to a new, more inductive scientific style, where data plays an increasingly important role (we often hear the term “data-driven” science). Yet ready-to-be-analyzed raw data does not exist. Daniel Andler elaborates extensively on this point, which is also evoked by the other authors. The information with which computers are provided must be verified and annotated in order to become data that can be used by algorithms in a meaningful way. And these algorithms do not work by themselves, without any human intervention.
When personal data is involved, this point is especially important, as underscored by Valérie Charolles. To begin with, the limitations cited above in terms of the results provided by the algorithms also clearly apply to personal data processing. Furthermore, individuals cannot be reduced to the information they can provide about themselves using digital tools, even if a considerable about of information is provided. What’s more, the quantity of information does not presuppose its quality or relevance, as evidenced by Amazon’s hiring algorithm that systematically discriminated against women simply due to the fact that they were underrepresented in the database. As Marc Mésard concludes, “we must therefore be vigilant and act now to impose a regulatory framework and essential ethical considerations.”
Valérie Charolles, philosophy researcher at Institut Mines-Télécom Business School, member of IMT’s Values and Policies of Personal Information Chair associate researcher at the Interdisciplinary Institute of Contemporary Anthropology (EHESS/CNRS)