Not logged in.

Contribution Details

Type Master's Thesis
Scope Discipline-based scholarship
Title Sentence-like Segmentation of Swiss German Audio Transcripts for Dependency Parsing
Organization Unit
Authors
  • Melvin Samson Steiger
Supervisors
  • Martin Volk
  • Tanja Samardzic
Language
  • English
Institution University of Zurich
Faculty Faculty of Business, Economics and Informatics
Date 2023
Abstract Text Dependency parsers tend to struggle with parsing transcribed spoken language as they are trained on properly structured, written text. Spoken language lacks the structure of properly written text and exhibits typical phenomena like disfluency, repetition, and truncation of words and sentences. This research examines the problem of parsing spoken language for Swiss German audio transcripts from ArchiMob corpus. Swiss German, an umbrella term for the German (Alemannic) dialects spoken in Switzerland, lacks orthographic and grammatical standardization, shows a high degree of variation among the various dialects and differs substantially from Standard German. The lack of standardization is due to the situation of diglossia in Switzerland. As Swiss German is mainly an oral language or restricted to informal writing, many resources lack structure and exhibit a high variability in terms of morphology, spelling and vocabulary. The combination of variation in Swiss German, its lack of standardization and the unstructuredness of spoken language render parsing transcribed Swiss German challenging. Accordingly, pre-trained (German) dependency parsers struggle with Swiss German audio transcripts and little data is available to train them. This research tackles the problem of parsing spoken language by re-segmenting Swiss German audio transcripts into sentence-like units (SLUs) and examines the impact of re-segmentation on dependency parser performance. Therefore, our experiment setup includes two evaluation steps, one for re-segmentation and one for dependency parsing. We frame the re-segmentation as a binary classification task aiming to predict tokens marking an SLU-boundary. For this purpose, we fine-tune a pre-trained German Bert model to predict such boundaries. These predicted SLU-boundaries are used to re-shape the input for the dependency parser. We show that the re-segmentation into SLUs leads to an improvement of the Labeled Attachment Score (LAS) over a baseline. Moreover, we demonstrate that the performance in the SLU-boundary classification task correlates with the parser performance. To engage in such a supervised learning setting, a test set composed out of roughly 200 SLUs was manually created and annotated with dependency labels for the two folded evaluation. With our work, we contribute to processing spoken Swiss German by showing a way of inducing more structure.
PDF File Download
Export BibTeX