Tag Archives: data science

STS and data science: Making a data scientist?

STS perspectives on the unfolding data revolution

Society finds itself at the beginning of a digital era where every device is online and sensors create continuous streams of data. The increased volume, velocity, and variety of this data is encompassed in the concept “big data”. The rise of big data has gone hand in hand with an ongoing increase in computational power which allows for the development of ever more sophisticated data analysis techniques, models, and algorithms. This broad collection of data-centric method innovations is referred to as “data science” (Hey, 2006). Although the concepts of big data and data science are loosely defined and sometimes used interchangeably, in this essay I adopt the distinction as outlined above.

Data science has quickly proliferated outside academia and has attracted interest – and substantial investment – in the public and private sector. Data science is applied in a diversity of substantive areas, including smart cities, smart maintenance, e-health, and e-commerce. Over the years, quantifications in a general sense have earned a reputation in some fields for outperforming human decision makers (Dawes, 1979). Achievements of data science, such as the victory of AlphaGo – a deep learning algorithm- over professional go player Lee Sedol, have attracted widespread media attention.

While much effort is devoted toward advancing technical data science capability, our understanding of the non-technical side to data science has lagged behind. Here, I use technical to broadly discern the quantitative and the non-quantitative elements of data science. This hiatus has caught the attention of several STS scholars; 4S/EASST featured tracks such as “The Potential Futures of Data Science: A Roundtable Intervention” and “Critical data studies”, amongst others. This demonstrates the growing interest from the STS community in data science. In this essay, I reflect on my visit to 4S/EASST Barcelona and by summarizing my fieldnotes and providing a short form digital ethnography.

Through the process of rearranging my 4S/EASST notes – and hastily captured photos of slides – different themes emerged. As a recent sociology PhD graduate, I found that data science brings into focus new challenges (e.g. data-ownership, transparency of artificial neural networks) as well as existing ones (e.g. biases inherent to quantifications). It also draws our attention towards some practical issues for conducting research (e.g. how to study a deep-learning algorithm?).

It is well beyond the scope of this text to discuss all, if any, these topics in detail. Instead, I focus on a challenge that is also relevant to practitioners: How are data scientists coming to terms with their vaguely delineated, yet increasingly topical field? In this context, what does it mean to be a data scientist? Being a practitioner myself, how do I know if am I genuinely a data scientist?

These questions are of interest to STS precisely because data science is an emerging field. To illustrate this point, I draw on material that I have come across in my work as a data scientist. I will discuss the differences and similarities between ‘genres’ that express some definition of data science or data scientists. Perhaps the most salient example of this is the multitude of Venn diagrams that are disseminated online. These diagrams aim to describe what skills or areas of expertise are covered by data science and which ones are not. Figure 1 juxtaposes two such Venn diagrams. Although there are overlaps between the two (e.g. ‘subject matter expertise’ and ‘domain expertise’), the
re are also differences. For example, the diagram on the left does not include the sphere of ‘Social Sciences’. The diagram on the right also marks some areas as ‘danger zones’. These zones are not just considered outside of data science as a field, but also seem to present these zones as combinations of skills that can be risky. The diagram on the left takes a different approach and gives the honorary title of ‘unicorn’ to the data scientist possessing all required skills.

 

Fig. 1: Two examples of a Venn diagram that offers one delineation of data science. The first is taken from Taylor (2016) and the second from Malak (2014).

 

The material on definitions of data science is not limited to Venn diagrams. Another genre that can be identified is that of infographics, see Figure 2. These images differ from the Venn charts in that they do not represent the overlap between different areas. Nor do the explicitly state what combinations of skills can be considered dangerous. Rather, these combinations of text and art offer a list of skills that data scientists are expected to have or attain. Some of the skills listed were also present in the Venn diagrams. For example, ‘math and statistics’ can be seen in all of the images and ‘programming’ or ‘hacking’ in three out of four. The infographics seem to put more emphasis on ‘soft skills’ such as communication and project management.

 

Fig. 2: Two examples of infographics that list the set of skills that data scientists (should) have by Optimus Machine Learning (2016) and Zawadzki (2014).

 

Online vacancies for data scientists are a third genre that deals with the definition of data science and data scientists. As with the previous two genres there are substantial differences between the two examples shown in Figure 3. The required skills in the left advert include programming languages and experience in bash. These skills are absent from the second advert. Instead, it asks for experience in spreadsheet software and work experience at one of the big consultants. There are similarities between the two, both adverts ask for skills in working with databases and experience with a – albeit different – set of technologies. Yet, the successful applicant to either vacancy can update his or her job title to “data scientist”.

 

Figure 3: Two examples of skills job adverts that offer a list of skills for data scientists, taken from Godatadriven.com (2016) and has-jobs.com (2016).

 

The three genres outlined above offer different styles that data scientists use to come to terms with their emerging field. The genres offer different styles of definitions of data science and delineate the profession of data scientist in different ways. Although cross-cutting skills can be identified, it would seem there is a wide diversity in what is currently understood as data science and consequentially there is little consensus on what it means to be a data scientist. To practitioners, it remains unclear on what grounds one can use the job title of ‘data scientist’ as the required skillset and experience is divergent. As a data science professional, I am cautious of using the term data scientist. When I introduce myself to a peer, I try to first establish a working consensus of the term by explaining what I do. It perhaps not surprising that new classifications are starting to emerge under the umbrella of ‘data professions’. For example, some are now discerning between, data engineers, data analysts, data solution consultants and data regulatory officers, to name a few.

This essay outlined several challenges and questions which emerged from the material presented at the 4S/EASST conference. I proceeded by illustrating one of these challenges – the definition of data science – by presenting some online material. The essay demonstrates that there exists no consensus amongst practitioners of data science regarding the boundaries of their field or the skillset that associated with ‘data scientists’. This is just one of the non-technical aspects of data science. With the abundance of funding that is allocated towards data science initiatives, it seems both opportune and important that we move to develop directions for research on data science in STS. Surely, data science will prove an interesting subject for STS scholars for years to come.

Data practice, data science

In my account of this years’ EASST conference in Barcelona I would like to focus on STS studies of data practices, and the different perspectives I encountered at the conference with respect to how STS may engage with the professional worlds of digital data. I obtained my Phd in Human-Centered Computing in 2014 in the US, where I studied professional knowledge in the making of software. After my PhD, I returned to Europe, and I have been thinking of EASST conferences as opportunities for finding my way into the academic community in Europe. Now I came to the conference from Hungary with the financial support kindly provided by EASST, for which I feel honored and grateful. I was presenting my postdoctoral research at ITU about digital methods. My recent academic path has involved a lot of wayfinding and criss-crossing between places, countries, social worlds and their concerns, and the issue of finding my way into the professional worlds of digital data as a social scientist was most acute for me as I arrived at the conference.

With all the talk and interest in big data and data science, there is a growing sense of social build-up, and I feel that I share the sentiment with other STS scholars that it would be hard to circumvent all this commotion without intellectual curiosity and a sense of hope for exciting research. The social sciences have been taken up in a movement where objective accounts by impartial onlookers at the sidelines has been giving way to the involved and perspectival accounts of the participant, and I could sense a corresponding eagerness to be part of the digital data game. At the same time, the discussions also made it clear that these positions are in the midst of being explored by STS practitioners. If digital data presents itself as an opportunity (to play on a different metaphorical register which is more akin to the field itself), it is equally a challenge to find out how we can dwell in social science and digital data at the same time. This challenge has a reflexive edge to it insofar as our understanding of the constitution of these new domains plays into the STS position that we seek to outline from within. Big data and data science are emerging at the confluence of the knowledge work of data analysis and digital technology, and I would like to argue that significantly different epistemic positions are outlined depending on whether the digital character of data practices are given emphasis.

 

Fig. 1: A critical making hackathon by Gabby Resch at the University of Toronto exploring the quantification of toilets by means of behavioral and residue data
Courtesy of University of Toronto, Faculty of Information

 

My discussion draws from two panels, a roundtable session on ‘The Potential Futures of Data Science’ taking place at the very beginning of the conference, and the three-part track entitled ‘Critical data studies’ on the last day. The data science roundtable was hosted by Brian Beaton from CalPoly, and repeated a similar arrangement held with the same scholars at the 4S conference in Denver last year. It attracted a surprisingly large audience, who were also willing to cheerfully chip in with their considered opinions despite the early morning hour. The three-part track broadened the theme from data science to computational data practices at large, while big data was casting its shadow over both of the venues. The contributions on the last day were for the most part case studies of professional work practices around digital data, which provided the empirical fodder for a slower-paced discussion.

Overall, the discussions and presentations were convincing that there is a broad sweep of STS research about new professional practices around data. The empirical work presented on the last day was especially diverse, looking at among others visualization practices in elementary particle physics, modeling practices for informing policy among economists, algorithmic sense-making among data scientists, the use of data as evidence in health care, or curating large-scale databases across cultural institutions. Diversity within the field was discussed by several contributors, who pointed to a divide between academia and industry (David Ribes), a distinction between emerging practices of social data and the historical continuities in the natural sciences (Paul Edwards), and differences between large and small scale data practices within the latter (Irene Pasquetto and Ashley E. Sands). It is also clear that our research implies partaking of different professional settings and communities beyond the fields we study, for example in STS, policy and in education.

In the face of this diversity, my own question of wayfinding became translated to the problem of unity and relevance: what brings us together and with whom when we apply the STS lens to professional data practices?

I would like to start with the hype that characterizes big data and data science. These labels were adopted as unifying themes for the track and the roundtable, respectively, while participants also acknowledged that in talking about these areas, we are dealing with moving targets, open-ended signifiers which are driven by evangelism, boosterism or veiled financial and political interests. One approach was to render STS itself into a formative agent within this arena. Brian Beaton proposed, somewhat provocatively, to think about what a takeover of data science by STS would be like. He used the witty argument that (I paraphrase) we have been here for longer, and we have all the right tools for making sense of social practice. I understood him to mean that Big data and data science are surprisingly new developments, which are seeking to make sense of their own position in the scientific arena. STS has been working on making sense of exactly these kinds of situations, and we have developed considerable expertise in this While the fantasy of such a takeover deeply resonates with some part of my intellectual self, I jotted down the immediate reaction in my notes that this would not possible because digital data is already entangled in large-scale institutional contexts, which, together with technologies like databases and tools of analysis, create a powerful regime of practices. While gaining professional agency has enormous appeal, and it resonates with the call for doing STS by other means, we should be wary of a wholesale adoption of these open signifiers as the heuristic framing of research. In this regard, I particularly appreciated Andrew Clement’s short intervention that (and I am paraphrasing again) the emergence of data science is driven by those who seek control without a clear idea of how control may be achieved, and they are soliciting the help from a new cast of professionals, the data scientists, to make sense of data for this purpose.

Meanwhile, I also encountered examples of doing STS by other means which were exploring new avenues for understanding the role of STS within the digital data domain. I like to think of these approaches as qualified versions of insiderism, because they share with digital professionals the orientation to making, but this is pursued within an STS framing. Another way of characterizing them is to say that they appropriate the nitty-gritty of technological work practices around digital data for an STS agenda, engaging in some sort of a take-over of digital practices. An emphasis on the digital character of data practices comes to the fore, and this lends these positions a distinct epistemic character. I would like to report about two approaches which have been making a strong impression on me on account of practicing this silent, everyday form of take-over from within, the critical information practice of Yanni Loukissas, Matt Ratto and Gabby Resch, and the STS-take on digital data analysis that was brought to this conference by Tommaso Venturini, Anders Kristian Munk and Mathieu Jacomy. It was Resch and Venturini who talked about the respective approaches.

Critical data practice is a curriculum that has been developed to engage students in practice-based reflection around data. Paraphrasing Gabby Resch, critical data practice means that participants do actual data science with current digital tools, such as MapReduce and Pandas, but they also do Derrida and think about Derrida’s discussion of the archive. Data often comes to data science as a given, in the form of a database, and the authors have organized digital workshops which tackle this assumption and put in focus the making of data and databases. In these workshops, students are called on to invent their own apparatuses for data collection, they clean and aggregate the data and they are invited to reflect on the tactics they use in this process for making data regular.

 

Fig. 2: Working with network visualizations at a data sprint in Oxford
Courtesy of Tommaso Venturini

 

Venturini talked about how researchers in STS picked up the method of social network analysis and came to grapple with its limitations for pursuing STS questions. ANT proposes for example that networks become actors, and this would require a mode of analysis where node and network are reversible. Network analysis has no ready-made models and tools that could support such a reversible approach. In the face of this and other limitations, Venturini and his colleagues have outlined a research agenda for visual network analysis, which appropriates the computational apparatus for visualizing networks towards STS ends. One example is the ForceAtlas2 algorithm and its implementation in the open source network visualization tool Gephi. This algorithm makes social features like clustering and density more salient in network visualizations. In visual network analysis, advancing the STS agenda becomes possible through partnering with computers and engaging in the nitty-gritty of software development.

Venturini and Rasch have shown a path where STS appropriates digital data practice for its own theoretical and critical agenda. It is a path for doing STS by other means. This is in stark contrast with the approach which would bring the empirical and theoretical STS toolkit to enlighten or critique the agenda of data science. In fact, critical data practice and visual network analysis participate in figuring out digital data and giving a face and a name to it each in its own way. In this, they are similar to the scientists and professionals in the STS case studies presented at the conference. Their data practices are in sync with their work practices, which are varied and local. If we can talk about unity, it is at the level of digital practice.

I find that there is something powerful in the proposition to embrace digital practice for doing STS. It feels like a much awaited opportunity to do social science by other means, and it appeals to the ethnographer’s mandate to turn into an insider without entirely going over to the other side.