Not logged in.

Contribution Details

Type Master's Thesis
Scope Discipline-based scholarship
Title Document Embedding Models - A Comparison with Bag-of-Words
Organization Unit
Authors
  • Robin Stohler
Supervisors
  • Abraham Bernstein
Language
  • English
Institution University of Zurich
Faculty Faculty of Business, Economics and Informatics
Date September 2018
Abstract Text Word embeddings changed the possibilities in the field of Natural Language Processing and Machine Learning completely, opening new doors for many applications. One is the creation of document embeddings with the Doc2Vec algorithm based on Word2Vec. These dense distributed latent vectors allow to work with text in a better, more meaningful way compared to older text vectorization processes such as Bag-of-Words (BOW). In this thesis, a variety of baseline methods are compared in different categories against Doc2Vec. To finally asses the usefulness of these older approaches after the recent upshake in the Natural Language Processing field. Empirical results show that BOW used with a strong classifier is especially in smaller datasets better than Doc2Vec. Additionally, an approach to reduce the dimensionality of a BOW is presented.
PDF File Download
Export BibTeX