CESAR Aligned Wikipedia Headwords List

Source name: 
Author: 
University of Zagreb, Faculty of Humanities and Social Sciences
CESAR Project (http://www.cesar-project.net/)
Description: 
The entries of the lexicon are built from the Wikipedia dumps of the six CESAR languages by using article titles and interlingual links to English and the remaining five languages.
Resource type: 
terminology resource
Resource availability: 
available for commercial use
available for research purposes
free
Can the resource be directly downloaded?: 
Yes
Modality: 
text
Size: 
762,662 Entries
Production date: 
2013
Domain: 

Rating and Comments

gertva@datamundi.be_1's picture

Rate this resource: 
3

Average: 3 (1 vote)

Unzip with 7-zip.
2 formats included: TXT (tab seperated) and XLSX; Both contains the same data. 
Simple one-to-one termlist with frequency counts per target language. 
Not every term is translated in all languages.
762.663 entries
Very easy to use as termlist.
Like with any term list: beware that when you use this for MT training, it could influence the fluency in a negative way. 

Rate this resource: 
3

Average: 3 (1 vote)

For Mac users, unzip with "The Unarchiver" tool.
The lexicon contains 762662 lines (each one represent one entry).
These entries are regrouped together after building one lexicon for each language.

There are 2 files in the corpus folder, CESAR.wikipedia.lexicon.txt and CESAR.wikipedia.lexicon.xlsx.
Each line of the 2 files is formatted as follows:

cat                                            en    bg    bg_freq    hr    hr_freq    hu    hu_freq    pl    pl_freq    sk    sk_freq    sr    sr_freq

exp:
Large integers; Integers; Units of amount    Googol    Гугол    0    Gugol    0    Googol    1    Googol    0    Googol    0    Гугол    0