
by Professor Jan Hajič, Charles University, Prague
What is HPLT about?
The HPLT (High-Performance Language Technology) project aims at transforming the current language research and innovation landscape by providing extensive quantities of web-crawled data. In addition to CommonCrawl, which has already been processed and used for training open-source large language models (LLMs), the HPLT project is gathering unique data by tapping into a previously unused data source, the Internet Archive. In addition, we will also release the first generation of LLMs and machine translation (MT) models trained on the newly built massive data collection.
The project, which has a full duration of 36 months, has reached its halfway point, with many tangible outputs already, and will last until September 2025.