Not logged in.

Contributions published at Data Analytics (Ingo Scholtes)

Contribution
J\""urgen Hackl, Ingo Scholtes, Luka V. Petrovi\'c, Vincenzo Perri, Luca Verginer, Christoph Gote, Analysis and Visualisation of Time Series Data on Networks with Pathpy, In: Companion Proceedings of the Web Conference 2021, Association for Computing Machinery, New York, NY, USA, 2021. (Conference or Workshop Paper) The Open Source software package pathpy, available at https://www.pathpy.net, implements statistical techniques to learn optimal graphical models for the causal topology generated by paths in time-series data. Operationalizing Occam’s razor, these models balance model complexity with explanatory power for empirically observed paths in relational time series. Standard network analysis is justified if the inferred optimal model is a first-order network model. Optimal models with orders larger than one indicate higher-order dependencies and can be used to improve the analysis of dynamical processes, node centralities and clusters.
Christoph Gote, Ingo Scholtes, Frank Schweitzer, Analysing Time-Stamped Co-Editing Networks in Software Development Teams using git2net, Empir. Softw. Eng., Vol. 26 (4), 2021. (Journal Article) null
Yan Zhang, Antonios Garas, Ingo Scholtes, Higher-order models capture changes in controllability of temporal networks, Journal of Physics: Complexity, Vol. 2 (1), 2021. (Journal Article) In many complex systems, elements interact via time-varying network topologies. Recent research shows that temporal correlations in the chronological ordering of interactions crucially influence network properties and dynamical processes. How these correlations affect our ability to control systems with time-varying interactions remains unclear. In this work, we use higher-order network models to extend the framework of structural controllability to temporal networks, where the chronological ordering of interactions gives rise to time-respecting paths with non-Markovian characteristics. We study six empirical data sets and show that non-Markovian characteristics of real systems can both increase or decrease the minimum time needed to control the whole system. With both empirical data and synthetic models, we further show that spectral properties of generalisations of graph Laplacians to higher-order networks can be used to analytically capture the effect of temporal correlations on controllability. Our work highlights that (i) correlations in the chronological ordering of interactions are an important source of complexity that significantly influences the controllability of temporal networks, and (ii) higher-order network models are a powerful tool to understand the temporal-topological characteristics of empirical systems.
Dominik Arni, A data-driven approach to study the impact of ESG ratings on S&P 500 stock returns, University of Zurich, Faculty of Business, Economics and Informatics, 2020. (Bachelor's Thesis) In today's world, sustainability concerns, in regards to Environmental, Social and corporate Governance (ESG) aspects, increasingly gain in society's awareness. In the meantime, these ESG considerations have found their way into every part of our every day's life. It is, therefore, of no surprise that the world of investing is required to account for ESG topics as well. Such ESG reflections are integrated into the investment process as ESG ratings, which assess the sustainability risks of a company in the form of a score. The popularity and, consequently, the total amount of invested capital in investment approaches, which incorporate ESG ratings, are progressively increasing. Hence, for some time, the academia has tried to evaluate the relationship between ESG ratings and the performance of various investment instruments. The nature of this relationship is, however, widely argued. Thus, the goal of this thesis is to contribute to the academic literature on this topic and to examine whether and how ESG ratings are correlated with stock returns. Additionally, variations of the importance ESG ratings depending on a stock's industry affiliation are assessed. By linear regression, the result of analyzing the impact of ESG ratings on the stock returns of the S\&P 500 from the beginning of 2008 to the end of 2019 reveals an inverse relationship between the overall ESG rating, respectively its sub-ratings, and the performance of the index's constituents. Furthermore, the existence of impact differences for specific industries is found to be highly probable.
Michael Markus Studer, Application of Higher{Order Network Models to Representation Learning in Sequential Data, University of Zurich, Faculty of Business, Economics and Informatics, 2020. (Master's Thesis) Representation learning provides crucial input for machine learning algorithms. Learning these representations instead of manually engineering them accelerates the development of ML applications. Various such methods exist for networks, but they mostly rely on first-order Markov chains to generate random walks to explore the network. However, higher-order Markov chains are often better at modeling real-world spreading processes, and this model change may also affect community detection. We review well-established methods and explore different approaches to upgrade them from first-order to higher-orders. We experiment with multi-class classification and visualization tasks to compare the original and upgraded methods, using an illustrative synthetic grid and real data on social interactions. The Python source code of the methods and experiments is publicly available on GitHub.
Vincenzo Perri, Ingo Scholtes, HOTVis: Higher-Order Time-Aware Visualisation of Dynamic Graphs, In: Graph Drawing and Network Visualization - 28th International Symposium, -, 2020. (Conference or Workshop Paper published in Proceedings) null
Timothy LaRock, Vahan Nanumyan, Ingo Scholtes, Giona Casiraghi, Tina Eliassi-Rad, Detecting path anomalies in time series data on networks, In: Proceedings of SIAM International Conference on Data Mining (SDM20), Philadelphia, PA, USA, 2020. (Conference or Workshop Paper published in Proceedings) null
Christoph Gote, Giona Casiraghi, Frank Schweitzer, Ingo Scholtes, Predicting Sequences of Traversed Nodes in Graphs using Network Models with Multiple Higher Orders, 2020. (Other Publication) null
Luka V. Petrović, Ingo Scholtes, Learning the Markov order of paths in a network, 2020. (Other Publication) null
Luca Weibel, Automated Extraction of Knowledge Graphs from Scholarly Publications using Data Mining and Machine Learning, University of Zurich, Faculty of Business, Economics and Informatics, 2019. (Master's Thesis) Extracting structured data from research papers provided in PDF form has been an ongoing endeavor. This thesis proposes a pipeline capable of taking research paper PDFs and inserting them into a knowledge graph. Developed to be modular and extensible, this pipeline enables new capabilities for scholarly information systems by including both the full-text of research papers as well as metadata provided from CrossRef. We also propose two research paper recommendation algorithms that leverage the proposed pipeline and a method of evaluating research paper recommendation algorithms via a hindcasting experiment that aims to predict the citations in a paper given the first few as input.
Vincenzo Perri, Ingo Scholtes, Higher-Order Visualization of Causal Structures in Dynamics Graphs, In: ArXiv.org, No. 1908.05976, 2019. (Working Paper) Graph or network representations are an important foundation for data mining and machine learning tasks in relational data. Many tools of network analysis, like centrality measures, information ranking, or cluster detection rest on the assumption that links capture direct influence, and that paths represent possible indirect influence. This assumption is invalidated in time-stamped network data capturing, e.g., dynamic social networks, biological sequences or financial transactions. In such data, for two time-stamped links (A,B) and (B,C) the chronological ordering and timing determines whether a causal path from node A via B to C exists. A number of works has shown that for that reason network analysis cannot be directly applied to time-stamped network data. Existing methods to address this issue require statistics on causal paths, which is computationally challenging for big data sets. Addressing this problem, we develop an efficient algorithm to count causal paths in time-stamped network data. Applying it to empirical data, we show that our method is more efficient than a baseline method implemented in an OpenSource data analytics package. Our method works efficiently for different values of the maximum time difference between consecutive links of a causal path and supports streaming scenarios. With it, we are closing a gap that hinders an efficient analysis of big time series data on complex networks.
Remo Hertig, Implementation of an Information-theoretic Graph Clustering Algorithm for Pathway Data, University of Zurich, Faculty of Business, Economics and Informatics, 2019. (Bachelor's Thesis) Standard network science methods assume a first order Markov network, where edges are independent samples. In pathway data this assumption is no longer warranted, because we have explicit higher order dependencies between edges. It is plausible that methods which account for these higher-order dependencies could achieve better performance compared to first order methods. This issue is investigated in this thesis on the problem of community detection in networks, for which Infomap is a well known method. This method is modified and implemented in Python to make use of pathway data. Due to finite pathway samples the use of entropy correction methods is investigated.
Remo Hertig, Implementation of an Information-Theoretic Graph Clustering Algorith for Pathway Data, University of Zurich, Faculty of Business, Economics and Informatics, 2019. (Bachelor's Thesis) null
Luka Petrovic, Ingo Scholtes, Counting Causal Paths in Big Time Series Data on Networks, In: ArXiv.org, No. 1905.11287, 2019. (Working Paper) Graph or network representations are an important foundation for data mining and machine learning tasks in relational data. Many tools of network analysis, like centrality measures, information ranking, or cluster detection rest on the assumption that links capture direct influence, and that paths represent possible indirect influence. This assumption is invalidated in time-stamped network data capturing, e.g., dynamic social networks, biological sequences or financial transactions. In such data, for two time-stamped links (A,B) and (B,C) the chronological ordering and timing determines whether a causal path from node A via B to C exists. A number of works has shown that for that reason network analysis cannot be directly applied to time-stamped network data. Existing methods to address this issue require statistics on causal paths, which is computationally challenging for big data sets. Addressing this problem, we develop an efficient algorithm to count causal paths in time-stamped network data. Applying it to empirical data, we show that our method is more efficient than a baseline method implemented in an OpenSource data analytics package. Our method works efficiently for different values of the maximum time difference between consecutive links of a causal path and supports streaming scenarios. With it, we are closing a gap that hinders an efficient analysis of big time series data on complex networks.
Timothy LaRock, Vahan Nanumyan, Ingo Scholtes, Giona Casiraghi, Tina Eliassi-Rad, Detecting Path Anomalies in Time Series Data on Networks, In: ArXiv.org, No. 1905.10580, 2019. (Working Paper) The unsupervised detection of anomalies in time series data has important applications, e.g., in user behavioural modelling, fraud detection, and cybersecurity. Anomaly detection has been extensively studied in categorical sequences, however we often have access to time series data that contain paths through networks. Examples include transaction sequences in financial networks, click streams of users in networks of cross-referenced documents, or travel itineraries in transportation networks. To reliably detect anomalies we must account for the fact that such data contain a large number of independent observations of short paths constrained by a graph topology. Moreover, the heterogeneity of real systems rules out frequency-based anomaly detection techniques, which do not account for highly skewed edge and degree statistics. To address this problem we introduce a novel framework for the unsupervised detection of anomalies in large corpora of variable-length temporal paths in a graph, which provides an efficient analytical method to detect paths with anomalous frequencies that result from nodes being traversed in unexpected chronological order.
Renaud Lambiotte, Martin Rosvall, Ingo Scholtes, From networks to optimal higher-order models of complex systems, Nature Physics, Vol. 15 (4), 2019. (Journal Article) Rich data are revealing that complex dependencies between the nodes of a network may not be captured by models based on pairwise interactions. Higher-order network models go beyond these limitations, offering new perspectives for understanding complex systems.
Renaud Lambiotte, Martin Rosvall, Michael Schaub, Ingo Scholtes, Jian Xu, Beyond Graph Mining: Higher-Order Data Analytics for Temporal Network Data, In: KDD'18 - 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2018. (Conference Presentation) Network-based data mining techniques such as graph mining, (social) network analysis, link prediction and graph clustering form an important foundation for data science applications in computer science, computational social science, and the life sciences. They help to detect patterns in large data sets that capture dyadic relations between pairs of genes, species, humans, or documents and they have improved our understanding of complex networks. While the potential of analysing graph or network representations of relational data is undisputed, we increasingly have access to data on networks that contain more than just dyadic relations. Consider, e.g., data on user click streams in the Web, time-stamped social networks, gene regulatory pathways, or time-stamped financial transactions. These are examples for time-resolved or sequential data that not only tell us who is related to whom but also when and in which order relations occur. Recent works have exposed that the timing and ordering of relations in such data can introduce higher-order, non-dyadic dependencies that are not captured by state-of-the-art graph representations. This oversimplification questions the validity of graph mining techniques in time series data and poses a threat for interdisciplinary applications of network analytics. To address this challenge, researchers have developed advanced graph modelling and representation techniques based on higher- and variable-order Markov models, which enable us to model non-Markovian characteristics in time series data on networks. Introducing this exciting research field, the goal of this tutorial is to give an overview of cutting-edge higher-order data analytics techniques. Key takeaways for attendees will be (i) a solid understanding of higher-order network modelling and representation learning techniques, (ii) hands-on experience with state-of-the-art higher-order network analytics and visualisation packages, and (iii) a clear demonstration of the benefits of higher-order data analytics in real-world time series data on technical, social, and ecological systems.
Ingo Scholtes, When is a Network a Network? Multi-Order Graphical Model Selection in Pathways and Temporal Networks, In: the 23rd ACM SIGKDD International Conference, ACM Press, New York, New York, USA, 2017. (Conference or Workshop Paper published in Proceedings) We introduce a framework for the modeling of sequential data capturing pathways of varying lengths observed in a network. Such data are important, e.g., when studying click streams in the Web, travel patterns in transportation systems, information cascades in social networks, biological pathways, or time-stamped social interactions. While it is common to apply graph analytics and network analysis to such data, recent works have shown that temporal correlations can invalidate the results of such methods. This raises a fundamental question: When is a network abstraction of sequential data justified?Addressing this open question, we propose a framework that combines Markov chains of multiple, higher orders into a multi-layer graphical model that captures temporal correlations in pathways at multiple length scales simultaneously. We develop a model selection technique to infer the optimal number of layers of such a model and show that it outperforms baseline Markov order detection techniques. An application to eight real-world data sets on pathways and temporal networks shows that it allows to infer graphical models that capture both topological and temporal characteristics of such data. Our work highlights fallacies of network abstractions and provides a principled answer to the open question when they are justified. Generalizing network representations to multi-order graphical models, it opens perspectives for new data mining and knowledge discovery algorithms.
Ingo Scholtes, Pavlin Mavrodiev, Frank Schweitzer, From Aristotle to Ringelmann: a large-scale analysis of team productivity and coordination in Open Source Software projects, Empirical Software Engineering, Vol. 21 (2), 2016. (Journal Article) null
Ingo Scholtes, Nicolas Wider, René Pfitzner, Antonios Garas, Claudio Tessone, Frank Schweitzer, Causality-driven slow-down and speed-up of diffusion in non-Markovian temporal networks, Nature Communications, Vol. 5 (1), 2015. (Journal Article) Recent research has highlighted limitations of studying complex systems with time-varying topologies from the perspective of static, time-aggregated networks. Non-Markovian characteristics resulting from the ordering of interactions in temporal networks were identified as one important mechanism that alters causality and affects dynamical processes. So far, an analytical explanation for this phenomenon and for the significant variations observed across different systems is missing. Here we introduce a methodology that allows to analytically predict causality-driven changes of diffusion speed in non-Markovian temporal networks. Validating our predictions in six data sets we show that compared with the time-aggregated network, non-Markovian characteristics can lead to both a slow-down or speed-up of diffusion, which can even outweigh the decelerating effect of community structures in the static topology. Thus, non-Markovian properties of temporal networks constitute an important additional dimension of complexity in time-varying complex systems.

Previous 12