Open GPT-X - Evaluating the Performance of Large Language Models
- Principal Investigators:
- René Jäkel
- Project Manager:
- Nicolas Flores-Herr
- HPC Platform used:
- NHR@TUD Barnard + Alpha + Capella
- Project ID:
- p_gptx
- Date published:
- Researchers:
- Lena Jurkschat, Lalith Manjunath, Klaudia-Doris Thellmann
- Introduction:
-
OpenGPT-X has set a goal to create and train open large language models (LLM) for European languages. Existing language models focus primarily on the English language, and hence perform unfavourably when used for any of the other commonly spoken European languages.
From large-scale benchmarking of multilingual LLMs to introducing Teuken-7B models, our research uncovers how tokenization and balanced datasets enhance cross-lingual performance. Join us in exploring transparent and reproducible innovations shaping the future of multilingual AI.
- Body:
-
The evaluation of Large Language Models (LLMs) can be conducted from various perspectives.
Within OpenGPT-X, TUD addressed several performance relevant directions in its analyses, including pre-training speed and sharding efficiency. This involved determining optimal parameters that adhere to the established scaling-laws to achieve the best possible pre-training loss, within the given compute budget. The pre-trained model then underwent a comprehensive qualitative assessment to evaluate its generalization capabilities and effectiveness across a wide range of downstream tasks.
In preparation of model training runs, we investigated the impact of parallelization techniques such as tensor, data and pipeline parallelism to ensure the maximum pre-training throughput. To further gain insights on the pre-training bottlenecks, we made the use of ScoreP available within our training code base.
For later model usage, we also benchmark several LLM Inference Frameworks regarding their text generation speed, aiming to reveal the impact factors on LLM inference among model setup times, decoding efficiency, parallelization efficiency and others.
To bridge between hardware performance and model quality, we conducted a case study to investigate the scaling behaviour [1, 2] for a German 7 billion parameter model to ensure the smallest pre-training loss possible for a given fixed amount of pre-training data and compute budget. Investigation of the interplay between model size and amount of pre-training data lead to the training process become more efficient with reduced trial-and-error runs.In addition to performance benchmarking and the monitoring of the model pre-training, we conducted extensive evaluations of LLMs on downstream tasks. We developed evaluation pipelines to assess over 40 state-of-the-art LLMs across various sizes using the newly introduced EU20 benchmarks [3, 4]. These benchmarks enable efficient, scalable, and standardized evaluations across 20 European languages, providing a thorough assessment of cross-lingual performance. Furthermore, we validated their reliability through comparisons with human judgments, demonstrating that our methodology effectively captures meaningful trends in multilingual LLM performance. The results of the multilingual benchmarks are published on a multilingual leaderboard hosted on Hugging Face [5].
In a separate study, we investigated the often-overlooked impact of tokenizer choices on the training and downstream performance of LLMs [6]. While much research has focused on model architecture, data scaling, and pre-training objectives, tokenization strategies remain underexplored. Our study included intrinsic and extrinsic performance evaluations of 24 mono- and multilingual LLMs trained with various tokenizer algorithms and configurations. We found that misalignment between tokenizer design and multilingual datasets increases training costs by up to 68% and degrades performance, particularly in low-resource languages.
Building on the findings from the tokenizer study and additional research on dataset compilation [7] and fine-tuning strategies [8], we introduced the Teuken-7B-Base and Teuken-7B-Instruct models [9, 10]. Teuken-7B-Base was pre-trained from scratch on a balanced dataset with 60% non-English data, while Teuken-7B-Instruct is an instruction-tuned version optimized for downstream tasks, where the pre-trained base model is finetuned on high quality instructions. These models demonstrate how balanced language representation improves training efficiency and cross-lingual generalization.
All these studies emphasize open and transparent practices in model development, enabling reproducibility and adaptation by diverse communities. They detail every aspect of the development process, including tokenizer design, dataset curation, model architecture, and evaluation pipelines. Collectively, our work represents significant advancements in evaluation frameworks, tokenization strategies, and fine-tuning methodologies for multilingual LLMs.
References:
[1] Jared Kaplan and Sam McCandlish and Tom Henighan and Tom B. Brown and Benjamin Chess and Rewon Child and Scott Gray and Alec Radford and Jeffrey Wu and Dario Amodei, Scaling Laws for Neural Language Models, 2020, https://arxiv.org/abs/2001.08361
[2] Jordan Hoffmann and Sebastian Borgeaud and Arthur Mensch and Elena Buchatskaya and Trevor Cai and Eliza Rutherford and Diego de Las Casas and Lisa Anne Hendricks and Johannes Welbl and Aidan Clark and Tom Hennigan and Eric Noland and Katie Millican and George van den Driessche and Bogdan Damoc and Aurelia Guy and Simon Osindero and Karen Simonyan and Erich Elsen and Jack W. Rae and Oriol Vinyals and Laurent Sifre, Training Compute-Optimal Large Language Models, 2022, https://arxiv.org/abs/2203.15556
[3] Klaudia Thellmann*, Bernhard Stadler*, Michael Fromm*, Jasper Schulze Buschhoff*, Alex Jude, Fabio Barth, Johannes Leveling, Nicolas Flores-Herr, Joachim Köhler, René Jäkel, Mehdi Ali, Towards Multilingual LLM Evaluation for European Languages, October 2024, https://arxiv.org/abs/2410.08928
[4] OpenGPT-X team publishes ¬its European LLM Leaderboard, July 2024, https://tu-dresden.de/zih/das-department/news/european-llm-leaderboard-of-opengptx
[5] European LLM Leaderboard, https://huggingface.co/spaces/openGPT-X/european-llm-leaderboard
[6] Mehdi Ali*, Michael Fromm*, Klaudia Thellmann*, Richard Rutmann, Max Lübbering, Johannes Leveling, Katrin Klug, Jan Ebert, Niclas Doll, Jasper Schulze Buschhoff, Charvi Jain, Alexander Arno Weber, Lena Jurkschat, Hammam Abdelwahab, Chelsea John, Pedro Ortiz Suarez, Malte Ostendorff, Samuel Weinbach, Rafet Sifa, Stefan Kesselheim, Nicolas Flores-Herr, Tokenizer Choice For LLM Training: Negligible or Crucial?, June 2024, https://doi.org/10.18653/v1/2024.findings-naacl.247, Association for Computational Linguistics
[7] Nicolo' Brandizzi, Hammam Abdelwahab, Anirban Bhowmick, Lennard Helmer, Benny Jörg Stein, Pavel Denisov, Qasid Saleem, Michael Fromm, Mehdi Ali, Richard Rutmann, Farzad Naderi, Mohamad Saif Agy, Alexander Schwirjow, Fabian Küch, Luzian Hahn, Malte Ostendorff, Pedro Ortiz Suarez, Georg Rehm, Dennis Wegener, Nicolas Flores-Herr, Joachim Köhler, Johannes Leveling, Data Processing for the OpenGPT-X Model Family, October 2024, https://arxiv.org/abs/2410.08800
[8] Alexander Arno Weber, Klaudia Thellmann, Jan Ebert, Nicolas Flores-Herr, Jens Lehmann, Michael Fromm, Mehdi Ali, Investigating Multilingual Instruction-Tuning: Do Polyglot Models Demand Multilingual Instructions?, November 2024, https://doi.org/10.18653/v1/2024.emnlp-main.1159, Association for Computational Linguistics
[9] Mehdi Ali*, Michael Fromm*, Klaudia Thellmann*, Jan Ebert*, Alexander Arno Weber*, Richard Rutmann, Charvi Jain, Max Lübbering, Daniel Steinigen, Johannes Leveling, Katrin Klug, Jasper Schulze Buschhoff, Lena Jurkschat, Hammam Abdelwahab, Benny Jörg Stein, Karl-Heinz Sylla, Pavel Denisov, Nicolo' Brandizzi, Qasid Saleem, Anirban Bhowmick, Lennard Helmer, Chelsea John, Pedro Ortiz Suarez, Malte Ostendorff, Alex Jude, Lalith Manjunath, Samuel Weinbach, Carolin Penke, Oleg Filatov, Shima Asaadi, Fabio Barth, Rafet Sifa, Fabian Küch, Andreas Herten, René Jäkel, Georg Rehm, Stefan Kesselheim, Joachim Köhler, Nicolas Flores-Herr, Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs, October 2024, https://arxiv.org/abs/2410.03730
[10] Multilingual and Open Source: OpenGPT-X Research Project Releases Large Language Model, November 2024, https://tu-dresden.de/tu-dresden/newsportal/news/mehrsprachig-und-open-source-forschungsprojekt-opengpt-x-veroeffentlicht-grosses-ki-sprachmodell
- Institute / Institutes:
- Techniche Universität Dresden
- Affiliation:
- Techniche Universität Dresden
- Image:
-