SuperGLEBer - The first comprehensive German-language benchmark for LLMs

Principal Investigators:
Prof. Dr. Andreas Hotho
Project Manager:
Jan Pfister
HPC Platform used:
NHR@FAU: Alex GPU cluster
Date published:
Introduction:
Large Language Models (LLMs) are continuously being developed and improved, and there is no shortage of benchmarks that quantify how well they work; LLM benchmarking is indeed a long-standing practice especially in the NLP research community. However, the majority of these benchmarks are not designed for German-language LLMs. We assembled a broad Natural Language Understanding benchmark suite for the German language and evaluated a wide array of existing German-capable models.
This allows us to comprehensively chart the landscape of German LLMs.
Body:

Large Language Models (LLMs) are continuously being developed and improved, and there is no shortage of benchmarks that quantify how well they work; LLM benchmarking is indeed a long-standing practice especially in the NLP research community. However, the majority of these benchmarks are not designed for German-language LLMs. Up to now, the few evaluations for German models commonly relied on machine-translated English datasets, which is less than ideal. We assembled a broad Natural Language Understanding benchmark suite for the German language and evaluated a wide array of existing German-capable models in order to create a better understanding of the current state of German LLMs. Our benchmark consists of 29 different tasks ranging over different types such as document classification, sequence tagging, sentence similarity, and question answering, on which we evaluate 10 different German-pretrained models. This allowed us to comprehensively chart the landscape of German LLMs. We found that encoder models are a good choice for most tasks, but also that the largest encoder model does not necessarily perform best for all tasks. The benchmark suite, code, and a leaderboard are available at https://supergleber.professor-x.de and are up for public scrutiny, discussion, and extension.

In order to obtain the results in our paper we employed the A100 GPUs with 80 GiB of memory in the Alex cluster at NHR@FAU.

Read the full paper: J. Pfister and A. Hotho, *SuperGLEBer: German Language Understanding Evaluation Benchmark*. DOI: 

10.18653/v1/2024.naacl-long.438

Institute / Institutes:
Data Science Chair, Center for Artificial Intelligence and Data Science (CAIDAS)
Affiliation:
Julius-Maximilians-Universität Würzburg (JMU)
Image:
SuperGLEBer - The first comprehensive German-language benchmark for LLMs