Dhivyabharathi Ramasamy, Automatic Annotation of Data Science Notebooks: A Machine Learning Approach, University of Zurich, Faculty of Business, Economics and Informatics, 2019. (Master's Thesis)
Data Science Notebooks are notebooks developed for data science activities like exploration, collaboration, and visualization. Traditionally used as a tool to provide reproducible results and documenting the research, they have become prominent in the last few years due to the enormous traction in Machine learning field. Interactive notebooks like Jupyter, Zeppelin, and Kaggle are some of the primary platforms people use for implementing a data science task. Notebooks, used by data scientists to implement their data science tasks, have become an important source of data for understanding and analysing data science pipelines implemented in practice. Each data science pipeline contains many data science activities and in order to analyse them, it is necessary to identify where in a given notebook each data science activity takes place. Labelling the data science activities in the data science notebooks by experts is a time consuming and expensive process. In this master thesis, I attempt to automatically classify and assign the data science activity/activities to each cell of the data science notebooks using supervised machine learning. I have identified a set of common high-level data science activities as labels and assign each notebook cell the labels based on the data science activity they perform. Multiple data science activity labels have been allowed to each cell due to different coding style of the notebook users, overlapping activities, etc. An annotation experiment was designed and conducted to get expert/s labelled data and a set of 100 expert-annotated jupyter notebooks is used as a dataset in the experiments. Python classes have been developed in order to extract various features from the jupyter notebooks for the classification task. Multiple supervised classifiers (KNearest Neighbors, Support Vector Machines, Multi-layer Perceptron, Gradient Boosting, Random Forest, Decision Tree, Naive Bayes, Logistic Regression) have been evaluated using both Singlelabel and Multilabel Classification methods for the classification task. Logistic Regression classifier using Multilabel Classification has a higher precision compared to Singlelabel Classification. The research shows that ensemble methods and logistic regression are more suitable for classification of source code written in notebooks. Features importances discussed in the research questions provide insights into the informatory features for code classification. The comparison of the two classification paradigms and better performance of Multilabel Classification in terms of precision leads to the conclusion that data science pipelines as found in notebooks are not always sequential and are highly overlapping most of the times compared to the theoretical design of data science pipelines. I have also developed an ontology for notebooks and the data science activities and use the same to provide the annotations in semantic web style serialized in Resource Description Framework (RDF) format for further analysis. In addition, I have produced and discussed the results of exploratory data analysis and the performance of unsupervised classification on the dataset. An analysis of inter-annotator agreement is also discussed. It is important to mention that the features generated using the system can also be used in analyses set in other contexts.
|
|
Samuel Meuli, Modelling and Importing Dynamic Data into Wikibase; A Case Study of the Swiss Transportation System, University of Zurich, Faculty of Business, Economics and Informatics, 2019. (Bachelor's Thesis)
The Swiss Federal Railways (SBB) publish datasets on their transportation network in the GTFS format. The company is now looking to integrate this information into the Wikidata ecosystem. The datasets are updated every week with possible changes to the network. The goal of this thesis is to provide users with a way to get an impression of the network's evolution over time. For this purpose, a software tool for mapping GTFS data to Wikibase entities as well as importing and updating these in an instance of Wikibase is developed. To make such graph dynamics understandable for humans and machines, an RDF ontology for modelling changes is defined and a statistical analysis of the SBB's datasets is performed.
|
|
Luca Rossetto, Mahnaz Amiri Parian, Ralph Gasser, Ivan Giangreco, Silvan Heller, Heiko Schuldt, Deep Learning-based Concept Detection in vitrivr at the Video Browser Showdown 2019 - Final Notes, arXiv preprint arXiv:1902.10647, 2019. (Journal Article)
|
|
Lei Han, Kevin Roitero, Ujwal Gadiraju, Cristina Sarasua, Alessandro Checco, Eddy Maddalena, Gianluca Demartini, All Those Wasted Hours: On Task Abandonment in Crowdsourcing, In: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM 2019, Melbourne, VIC, Australia, February 11-15, 2019, ACM, ACM, 2019-02-11. (Conference or Workshop Paper published in Proceedings)
|
|
Lei Han and
Kevin Roitero and
Ujwal Gadiraju and
Cristina Sarasua and
Alessandro Checco and
Eddy Maddalena and
Gianluca Demartini, All Those Wasted Hours: On Task Abandonment in Crowdsourcing, In: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM 2019, Melbourne, VIC, Australia, February 11-15, 2019, ACM, 2019. (Conference or Workshop Paper published in Proceedings)
|
|
Ralph Gasser, Luca Rossetto, Heiko Schuldt, Towards an All-Purpose Content-Based Multimedia Information Retrieval System, arXiv preprint arXiv:1902.03878, 2019. (Journal Article)
|
|
Wen Zhang, Bibek Paudel, Wei Zhang, Abraham Bernstein, Huajun Chen, Interaction Embeddings for Prediction and Explanation in Knowledge Graphs, In: International Conference on Web Search and Data Mining (WSDM), Association of Computing Machinery (ACM), New York, NY, 2019-02-11. (Conference or Workshop Paper published in Proceedings)
Knowledge graph embedding aims to learn distributed representations for entities and relations, and are proven to be effective in many applications. Crossover interactions --- bi-directional effects between entities and relations --- help select related information when predicting a new triple, but hasn't been formally discussed before.
In this paper, we propose CrossE, a novel knowledge graph embedding which explicitly simulates crossover interactions. It not only learns one general embedding for each entity and relation as in most previous methods, but also generates multiple triple specific embeddings for both of them, named interaction embeddings.
We evaluate the embeddings on typical link prediction task and find that CrossE achieves state-of-the-art results on complex and more challenging datasets.
Furthermore, we evaluate the embeddings from a new perspective --- giving explanations for predicted triples, which is important for real applications.
In this work, explanations for a triple are regarded as reliable closed-paths between head and tail entity. Compared to other baselines, we show experimentally that CrossE is more capable of generating reliable explanations to support its predictions, benefiting from interaction embeddings. |
|
Te Tan, Online Optimization of Job Parallelization in Apache GearPump, University of Zurich, Faculty of Business, Economics and Informatics, 2019. (Master's Thesis)
Parameter tuning in the realm of distributed (streaming) systems is a popular research area and many solutions have been proposed by the research community. Bayesian Optimization (BO) is one of the them which is proved to be powerful. While the existing way to conduct the BO process is `offline' and involves shutting down the system as well as many inefficient manual steps, in this work we implement an optimizer which is able to do `online' BO optimization. The optimizer is implemented within Apache Gearpump, a message-driven streaming engine. As the DAG operation at runtime is the prerequisite for doing `online' optimization, we inspect into the existing feature of Apache Gearpump, and propose our improved approach named Restart to do runtime DAG operations. Then supported by Restart approach, we design and implement JobOptimizer, which enables `online' BO optimization. The evaluation results show that: with the constraint of maximum number of trials, although JobOptimizer is not able to explore the parameter space adequately, it is able to find better parameter set than random exploration. It also outperforms Linear Ascent Optimizer in terms of throughput in the case of comparatively larger DAG applications. |
|
Luca Rossetto, Mahnaz Amiri Parian, Ralph Gasser, Ivan Giangreco, Silvan Heller, Heiko Schuldt, Deep Learning-Based Concept Detection in vitrivr, In: MultiMedia Modeling, Springer, Heidelberg, p. 616 - 621, 2019-01-11. (Book Chapter)
This paper presents the most recent additions to the vitrivr retrieval stack, which will be put to the test in the context of the 2019 Video Browser Showdown (VBS). The vitrivr stack has been extended by approaches for detecting, localizing, or describing concepts and actions in video scenes using various convolutional neural networks. Leveraging those additions, we have added support for searching the video collection based on semantic sketches. Furthermore, vitrivr offers new types of labels for text-based retrieval. In the same vein, we have also improved upon vitrivr’s pre-existing capabilities for extracting text from video through scene text recognition. Moreover, the user interface has received a major overhaul so as to make it more accessible to novice users, especially for query formulation and result exploration. |
|
Luca Rossetto, Heiko Schuldt, George Awad, Asad A Butt, V3C - A Research Video Collection, In: International Conference on Multimedia Modeling, Springer, Heidelberg, Germany, 2019-01-08. (Conference or Workshop Paper published in Proceedings)
|
|
Ausgezeichnete Informatikdissertationen 2018, Edited by: Steffen Hölldobler, Sven Appel, Abraham Bernstein, Felix Freiling, Hans-Peter Lenhof, Paul Molitor, Gustaf Neumann, Rüdiger Reischuk, Björn Scheuermann, Nicole Schweikardt, Myra Spiliopoulou, Sabine Süsstrunk, Klaus Wehrle, Gesellschaft für Informatik, Bonn, 2019. (Edited Scientific Work)
|
|
KV Rosni, Manish Shukla, Vijayanand Banahatti, Sachin Lodha, Consent Recommender System: A Case Study on LinkedIn Settings, PAL: Privacy-Enhancing Artificial Intelligence and Language Technologies As Part of the AAAI Spring Symposium Series (AAAI-SSS 2019), 2019. (Journal Article)
|
|
Oana Inel, Lora Aroyo, Validation methodology for expert-annotated datasets: Event annotation case study, In: 2nd Conference on Language, Data and Knowledge (LDK 2019), 2019. (Conference or Workshop Paper published in Proceedings)
|
|
Daniel Spicar, Efficient spectral link prediction on graphs: approximation of graph algorithms via coarse-graining, University of Zurich, Faculty of Business, Economics and Informatics, 2019. (Dissertation)
Spectral graph theory studies the properties of graphs in relation to their eigenpairs, that is, the eigenvalues and eigenvectors of associated graph matrices. Successful applications of spectral graph theory include the ranking web search results and link prediction in graphs. The latter is used to predict the evolution of graphs and to discover previously unobserved edges. However, the computation of eigenpairs is computationally very demanding. The eigenvalue-eigenvector decomposition of graph matrices has cubic time complexity. As graphs or networks become large, this makes the computation of a full eigendecomposition infeasible. This complexity problem is addressed on one of the most accurate state-of-the-art spectral link prediction methods. The method requires several eigenvalue-eigenvector decompositions which limits its applicability to small graphs only.
Previous work on similar complexity bottlenecks has approached the problem by computing only a subset of the eigenpairs in order to obtain an approximation of the original method at lower computational cost. This thesis takes the same approach but instead of modifying the original link-prediction algorithm, it uses the eigenpair subset to approximate the graph without significant changes to the link prediction algorithm. The graph is approximated by spectral coarse-graining, a method that shrinks graphs while preserving their dominant spectral properties. This approach is motivated by the hypothesis that results computed on a coarse-grained graph approximate the original link prediction results.
The main contribution presented in this dissertation is a new, coarse-grained spectral link-prediction approach. In a first part, the state-of-the-art link prediction method is combined with spectral coarse-graining and the computational cost, complexity and link prediction accuracy is evaluated. Theoretical analysis and experiments show that the coarse-grained approach produces accurate approximations of the original method with a significantly reduced time complexity. Thereafter, the spectral coarse-graining method is extended to make the complexity reduction more controllable and to avoid the computation of the eigenvalue-eigenvector decomposition. This dramatically increases the efficiency of the proposed approach and allows to compute more accurate graph approximations. As a result, the link prediction accuracy can be significantly improved while maintaining the reduced time complexity of the coarse-grained approach.
Furthermore, the proposed approach produces a valid graph of the same structure and type as the original graph. In principle, it can be used with many other graph applications without the need for major adaptations. Therefore, the approach is a step towards a more general approximation framework for spectral graph algorithms. |
|
Proceedings of the ISWC 2019 Satellite Tracks (Posters & Demonstrations, Industry, and Outrageous Ideas) co-located with 18th International Semantic Web Conference (ISWC 2019), Edited by: Mari Carmen Suárez-Figueroa, Gong Cheng, Anna Lisa Gentile, Christophe Guéret, Maria Keet, Abraham Bernstein, CEUR Workshop Proceedings, CEUR, 2019. (Edited Scientific Work)
|
|
Laurenz Shi, Dierential Privacy Algorithms for Data Streams on Top of Apache Flink, University of Zurich, Faculty of Business, Economics and Informatics, 2019. (Bachelor's Thesis)
With the upcoming of computing systems that process large amounts of data, securing sensitive data has become a lasting concern. Research has shown that with the help of other data sets, anoynimzed data sets can be reconstructed. Differential privacy is a construct to secure sensitive data. The data is perturbed with noise to ensure privacy. Differential privacy has shown to be very effective against attempts to reconstruct anonymized data sets in combination with other data sets. Originally used on static data sets, differential privacy was extended to data streams. Four papers that introduce various algorithms are examined in this thesis. The algorithms are often of theoretical nature. To evaluate the real effects of the algorithms on the results of data streams, they are implemented and compared to each other. The focus is on the performance in terms of accuracy that the algorithms deliver. |
|
Bibek Paudel, Suzanne Tolmeijer, Abraham Bernstein, Bringing Diversity in News Recommender Algorithms, In: ECREA 2018 - pre-conference workshop on Information, Diversity and Media Pluralism in the Age of Algorithms. 2018. (Conference Presentation)
In major political events and discussions, recommender algorithms play a large role in shaping opinions and setting the agenda for more traditional news media. These algorithms are used pervasively in social networks, media platforms and search engines. They determine what information is shown to the user and in which order.
Existing recommender systems focus on improving accuracy based on historic interaction-data, which has received criticism for being detrimental to the goals of improving user experience and information diversity. Their accuracy is measured in terms of predicting future user behavior based on past observation. The most popular algorithms suggest items to users based on the choices of similar users.
However, these systems are observed to promote a narrow set of already-popular items that are similar to past choices of the user, or are liked by many users. This limits users' exposure to diverse viewpoints and potentially increases polarization.
One reason for the lack of diversity in existing approaches is that they optimize for average accuracy and click rates. There are other aspects of user-experience like new information and diering viewpoints, that are not taken into account. Similarly, a high average accuracy does not necessarily mean that the algorithm is similarly accurate for dierent groups of users and niche items. Besides this, training data may be biased because of previous recommendations,
or the data collection process.
Although information diversity is important, its denition depends heavily on the specic application and it is hard to nd a general denition in the context of recommender systems. Nevertheless, there are denitions that pertain to some specic aspects of diversity, such as entropy, coverage, personalization, and surprisal.
There is a need to design recommender algorithms that balance accuracy and diversity in order to deal with the aforementioned problems. To approach this problem, we rst investigated state-of-the-art algorithms and found that they promote already popular products. We also found that recommender diversity depends on the optimization parameters of the algorithms, a factor that had not received enough attention before. In this context, we developed a graph- based method that promotes niche items without sacricing accuracy. We also developed a probabilistic latent-factor model that signicantly improves the coverage and long-tail diversity of recommendations. Currently, we are using these insights to develop new machine learning algorithms for diverse news recommendation. We plan to and deploy and test them in a large-scale experiment involving multiple news producers and consumers. |
|
Romana Pernisch, Florian Ruosch, Daniele Dell'Aglio, Abraham Bernstein, Stream Processing: The Matrix Revolutions, In: SSWS 2018 Scalable Semantic Web Knowledge Base Systems, CEUR-WS.org, Aachen, 2018-10-09. (Conference or Workshop Paper published in Proceedings)
Analyzing data streams is a vital task in data science. Often, data comes in different shapes such as triples, tuples, relations, or matrices. Traditional stream processing systems, however, only process data in one of these formats.
To enable the processing of streams combining different shapes of data, we developed a system that parses SPARQL queries using the Apache Jena parser and transforms them to Apache Flink topologies. With a custom data type and tailored functions, we enabled the integration of matrices in Jena and therefore allowed to mix graphs, relational, and linear algebra in an RDF graph. This provided a proof of concept that queries may be written for static data and – with the usage of the streaming engine Flink – can easily be run on data streams, even if they contain multiple of the aforementioned types. |
|
Riccardo Tommasini, Yehia Abo Sedira, Daniele Dell'Aglio, Marco Balduini, Muhammed Intizar Ali, Danh Le Phuoc, Emanuele Della Valle, Jean-Paul Calbimonte, VoCaLS: Describing Streams on the Web, In: ISWC 2018 Posters & Demonstrations and Industry Tracks, CEUR-WS.org, Aachen, 2018-10-08. (Conference or Workshop Paper)
|
|
Tobias Grubenmann, Daniele Dell'Aglio, Abraham Bernstein, Dmitrii Moor, Sven Seuken, Make restaurants pay your server bills, In: ISWC 2018 Posters & Demonstrations and Industry Tracks, CEUR-WS.org, Aachen, 2018-10-08. (Conference or Workshop Paper)
|
|