Current state
1. The number of word (in thousands) per thematic area and language,
which form the the IULA corpus are shown in the following table:
| Area | Catalan | Spanish | English | French | German | Total | |
| Law | 1518 | 2137 | 548 | 44 | 16 | ||
| Economy | 1821 | 1000 | 285 | 78 | 27 | ||
| Environment | 1487 | 1000 | 586 | 230 | 429 | 3733 | |
| Medicine | 1487 | 1910 | 303 | 27 | 198 | ||
| Computer Science | 654 | 1024 | 363 | 194 | 83 | ||
| Total . . . | 7071 |
1. Number of words per language and thematic area
2. These words are reparted in the respective number of documents
shown in table2:
| Area | Catalan | Spanish | English | French | German | Total | |
| Law | 186 | 254 | 113 | 10 | 60 | 623 | |
| Economics | 81 | 49 | 17 | 8 | 1 | 156 | |
| Environment | 77 | 46 | 86 | 22 | 61 | 292 | |
| Medicine | 101 | 71 | 11 | 3 | 27 | 213 | |
| Computer Science | 40 | 87 | 30 | 6 | 8 | 181 | |
| Total . . . | 486 | 507 | 257 | 49 | 157 | 1465 |
2. Number of documents per language and thematic area
3. Since it was one of the corpus' initial goals, a part of it consists
of parallel texts. At the moment the most important part of the parallel
corpus consists of Catalan-Spanish, Catalan-English and Spanish-English
counterparts. The amount of data covered by the parallel corpus is shown
in table 3.
| Area | Catalan-Spanish | Catalan-English | Spanish-English | |||||
| Doc. | Words | Doc. | Words | Doc. | Words | |||
| Law | 61 | 460 | 1 | 12 | 2 | 57 | ||
| Economics | 21 | 600 | 10 | 250 | 13 | 283 | ||
| Environment | 10 | 214 | 11 | 213 | 13 | 144 | ||
| Medicine | 4 | 108 | 1 | 40 | 4 | 125 | ||
| Computer Science | 1 | 28 | - | - | 23 | 300 | ||
| Total . . . | 97 | 1.410 | 23 | 515 | 55 | 909 | ||
3. Parallel corpus documents and number of words per thematic area
4. Finally, the Corpus contains a subcorpus of general purpose language
which was extracted from printed mass media. The Catalan data was extracted
from the online version of the newspaper L'Avui. For the Spanish
data the digital edition of the newspaper El País was used
in addition to the texts contained in the Corpus-92, which consits of selected
examples. The current numbers of documents in this contrastive corpus is
shown in 4:
| Area | Catalan | Spanish | Total | ||||
| Doc. | Words | Doc. | Words | Doc. | Words | ||
| general | 199 | 9434 | 46 | 1977 | 245 | 11.411 | |