Semi-Automatic Subject Classification with Basisklassifikation

Principal Investigators:
Dr. José Calvo Tello
Project Manager:
Dr. José Calvo Tello
HPC Platform used:
NHR@Göttingen
Date published:
Introduction:
In this project the goal is to use algorithms to predict classes of the library classification system “Basisklassifikation” (which can be translated as basic classification). A library classification system is a taxonomy of predefined classes that represent disciplines, subdisciplines, themes or types of publications. Subject librarians assign one or more of these classes to each publication, allowing both final users or retrieval system to use this annotated information for finding publications. As input data we observe mainly bibliographic data, such as for example the title, the name of the publisher, the year of publication and the language of the publication. The algorithms should suggest several classes, which are then analyzed by professional subject librarian which makes the final decision. In this way both humans and algorithms benefit from the workflow.
Body:

In this project the goal is to use algorithms to predict classes of the library classification system “Basisklassifikation” (which can be translated as basic classification). A library classification system is a taxonomy of predefined classes that represent disciplines, subdisciplines, themes or types of publications. Subject librarians assign one or more of these classes to each publication, allowing both final users or retrieval system to use this annotated information for finding publications. As input data we observe mainly bibliographic data, such as for example the title, the name of the publisher, the year of publication and the language of the publication. The algorithms should suggest several classes, which are then analyzed by professional subject librarian which makes the final decision. In this way both humans and algorithms benefit from the workflow. For this project multilingualism and explainable predictions (ethic AI) are important aspects to consider.
The “Basisklassifikation” is a good candidate of a classification system to train with algorithms for several reasons: with around 2,000 classes, it represents a middle-range size classification system. However, for the machine learning perspective, this size of classes represents an extreme multi-label classification problem. The “Basisklassifikation” offers a great training dataset with hundreds of thousands of publications already labelled intellectually by subject librarian at the libraries of the German Common Library Network (in German Gemeinsamer Bibliotheksverbund or GBV).
For now, a subset of publications have been extracted from the library database K10plus, preprocessed with several techniques using Python scripts (partially as Jupyter Notebooks). Traditional algorithms from machine learning have been used, such as support-vector machines or logistic regression. Newer language models such as Bidirectional Encoder Representations from Transformers (BERT) have been applied in different steps of the work. The preliminary results show that both traditional and newer approaches obtain better results over a random baseline. The models are also able to deal with data in several languages.
Working with the High Performance Computing (HPC) represents several challenges for this Digital Humanities project. The workflows applied in other contexts (Git Repositories and using Jupyter Notebooks) are not currently compatible when using HPC nodes and need to be adapted. HPC usage on Digital Humanities requires an interactive usage of the computing resources, debugging capabilities and try-out of new algorithms are key. More flexible solutions more similar to Colab could make these services much more user-friendly for researchers with a Humanities background.
 

Institute / Institutes:
Göttingen State and University Library (SUB Göttingen)
Affiliation:
Universität Göttingen
Image:
Logo Göttingen