Not logged in.

Contribution Details

Type Journal Article
Scope Discipline-based scholarship
Title FormulaNet: A Benchmark Dataset for Mathematical Formula Detection
Organization Unit
  • Felix Maximilian Schmitt-Koopmann
  • Elaine May Huang
  • Hans-Peter Hutter
  • Thilo Stadelmann
  • Alireza Darvishy
Item Subtype Original Work
Refereed Yes
Status Published in final form
  • English
Journal Title IEEE Access
Geographical Reach international
Volume 10
Page Range 91588 - 91596
Date 2022
Abstract Text One unsolved sub-task of document analysis is mathematical formula detection (MFD). Research by ourselves and others has shown that existing MFD datasets with inline and display formula labels are small and have insufficient labeling quality. There is therefore an urgent need for datasets with better quality labeling for future research in the MFD field, as they have a high impact on the performance of the models trained on them. We present an advanced labeling pipeline and a new dataset called FormulaNet in this paper. At over 45k pages, we believe that FormulaNet is the largest MFD dataset with inline formula labels. Our experiments demonstrate substantially improved labeling quality for inline and display formulae detection over existing datasets. Additionally, we provide a math formula detection baseline for FormulaNet with an mAP of 0.754. Our dataset is intended to help address the MFD task and may enable the development of new applications, such as making mathematical formulae accessible in PDFs for visually impaired screen reader users.
Free access at DOI
Official URL
Digital Object Identifier 10.1109/ACCESS.2022.3202639
Export BibTeX