Impact of Data Quality and Provenance

How do limitations and intransparencies in data quality and data provenance bias research outcomes, and how can we detect and mitigate these limitations?

Collecting and analyzing social interaction data involves a plethora of choices that can significantly bias both research outcomes and derived implications (Diesner, 2015). These choices refer to the gathering, representation and provenance of data, as well as to the configuration of algorithms, methods and tools. Increasingly, these decisions are embedded in datasets and technologies. However, we have a poor understanding of the impact of these decisions on research outcomes, and insufficient best practices and norms for documenting and communicating these choices. In my lab, we address this issue by studying the impact of data quality issues and pre-processing techniques on analysis results. This work contributes to the accuracy and reliability of data science.

For example, we have been investigating the impact of entity resolution errors on network analysis results. We found that commonly reported network metrics and derived implications can strongly deviate from the truth - as established based on gold standard data or approximations thereof - depending on the efforts dedicated to entity resolution (Diesner, Evans, & Kim, 2015). We found that insufficiently splitting nodes in co-publishing networks can make scientific communities look denser and more cohesive than they truly are, and individual authors appear more productive, collaborative and diversified; all of which potentially downplays the need for (interdisciplinary) collaboration and funding. Incorrect entity resolution can even lead to misidentifying applicable network topologies, e.g., detecting power law distributions of node degree and assuming underlying preferential attachment processes without actual empirical support for this claim (Kim & Diesner, 2015a). We found this distortive impact to increase over time, mainly due to the compound effects of the increase in scholars from Asia and South America; both regions where many people share common names (Kim & Diesner, 2015b, 2016). Our research in this area has mainly been funded by KISTI (Korea Institute of Science and Technology Information).

References:

  1. Diesner, J. (2015). Small decisions with big impact on data analytics. Big Data & Society, 2(2).
  2. Diesner, J., Evans, C. S., & Kim, J. (2015). Impact of entity disambiguation errors on social network properties. Proceedings of International AAAI Conference on Web and Social Media (ICWSM), Oxford, UK.
  3. Kim, J., & Diesner, J. (2015a). Distortive Effects of Initial-Based Name Disambiguation on Measurements of Large-Scale Coauthorship Networks. Journal of the Association for Information Science and Technology, 67(6), 1446-1461.
  4. Kim, J., & Diesner, J. (2015b). The effect of data pre-processing on understanding the evolution of collaboration networks. Journal of Informetrics, 9(1), 226-236.
  5. Kim, J., & Diesner, J. (2016). Distortive effects of initial-based name disambiguation on measurements of large-scale coauthorship networks. Journal of the Association for Information Science and Technology, 67(6), 1446-1461. doi:10.1002/asi.23489

Funder

KISTI (Korea Institute of Science and Technology Information)
KISTI: Selection versus Homophily in Scientific Collaboration Networks
KISTI C3240: Modeling of nation-scale scientific collaboration networks and impact of entity disambiguation on big network data
KISTI P14033: Authority data based scientist network analysis