Not logged in.

Contribution Details

Type Dissertation
Scope Discipline-based scholarship
Title Crowdsourcing data analysis: empowering non-experts to conduct data analysis
Organization Unit
  • Michael Feldman
  • Abraham Bernstein
  • Kevin Crowston
  • English
Institution University of Zurich
Faculty Faculty of Business, Economics and Informatics
Number of Pages 170
Date 2018
Date Annual Report 2018
Abstract Text The development of Internet-based ecosystem has led to the emergence of alternative recruitment models which are exclusively facilitated through the internet. With Online Labor Markets (OLMs) and Crowdsourcing platforms it is possible to hire individuals online to conduct tasks and projects of different size and complexity. Crowdsourcing platforms are well-suited for simple micro-tasks which could take seconds or minutes and be completed with big number of participants working in parallel. On the other hand, OLMs are usually allowing to hire experts in flexible manner for more advanced projects that could take days, weeks or even months. Due to the flexibility of such employment models it is possible to find various experts on OLMs such as designers, lawyers, developers or engineers. However, it is relatively rare to find data scientists – experts able to preprocess analyze and make sense of data. This shortage is not surprising giving the general shortage of data science experts. Moreover, due to various reasons such as extensive education and training requirements as well as soaring demand, the projected shortage in such experts is expected to grow during the next years. In this dissertation we explored how the crowdsourcing approach could be leveraged to support data science projects. In particular, we presented three use cases where crowds and freelancers with different expertise levels could be involved to support data science projects. We conventionally classified crowds into low, intermediate, and high levels of expertise in data analysis and proposed use cases where every group might contribute through crowdsourcing setting. In the first case study we presented an approach of how crowds could be engaged in the review process of the statistical assumptions in scientific publications. When researchers use statistical methods in scientific manuscripts these methods are often valid only if their underlying assumptions are met. If these assumptions are compromised, then the validity of the results is questionable. We presented an approach based on micro-tasking with laymen crowds that reach quality similar to expert-based review. We then conducted longitudinal analysis of CHI conference proceedings to evaluate the dynamics of standards on statistical reporting throughout the years. Finally, we compared CHI proceedings with 5 top journals in the field of medicine, management, and psychology to compare the reporting of statistical assumptions across disciplines. Our second case study addressed the freelancers with intermediate expertise in data analysis. To better understand what the skills that intermediate experts possess are, we conducted an interview with data scientist experts whom we asked what kind of tasks could be outsourced to non-experts. Additionally, we conducted a survey in most prominent OLMs to better understand the skills of freelancers active in data analysis. The conclusions of this study were twofold: 1) conservatively individuals with certain coding skills could be helpful in data science projects if integrated properly and 2) data preprocessing tasks are by far the biggest bottle neck activity that could be outsourced, if the coordination between involved parties is managed properly. Departing from these results, we conducted a study, where we designed a proof-of-concept for a platform that facilitated a number of experiments where non- experts were collaborating with experts through offloading data preprocessing activities. Our results suggest that the outcome achieved with mixed expertise teams are similar in quality and cheaper than the work of experts. Our last use case was not as much directed to alleviate the shortage in data scientists as to take advantage of the crowdsourcing setting to address inherent vulnerability of data-driven analysis. Recently, there has been a discussion among data analysis experts and researchers regarding the subjectivity of data driven analysis outputs. Namely, it has been shown that when data analysts perform data analysis where they are provided with the same data and the same hypothesis, within an NHST (Null Hypothesis Significance Testing) approach, they often reach cardinally different results. Therefore, we conducted a study where we provided 47 experts with the same data and hypotheses to answer. Through especially designed platform we were able to elicit the rational for every decision made throughout data analysis. This fine-grained data allowed us to conduct a qualitative analysis where we explored the underlying factors leading to the variability of data analysis results. The case studies combined together provide an overview of how the discipline of data science could benefit from the crowdsourcing approach. We hope that the solutions proposed in this dissertation will contribute to the discussion on how to reduce the entry barrier for laymen to participate in data driven research as well as how to improve the transparency of how the results were reached.
Other Identification Number merlin-id:17332
PDF File Download from ZORA
Export BibTeX