Multilingual Corpus: specialized language

Current state

1. The number of word (in thousands) per thematic area and language, which form the the IULA corpus are shown in the following table:
 
Area Catalan Spanish English French German   Total
               
Law 1518 2137 548 44 16    
Economy 1821 1000 285 78 27    
Environment 1487 1000 586 230 429   3733
Medicine 1487 1910 303 27 198    
Computer Science 654 1024 363 194 83    
               
Total . . .    7071          

1. Number of words per language and thematic area


2. These words are reparted in the respective number of  documents shown in table2:
 
Area Catalan Spanish English French German   Total
               
Law 186 254 113 10 60   623
Economics 81 49 17 8 1   156
Environment 77 46 86 22 61   292
Medicine 101 71 11 3 27   213
Computer Science 40 87 30 6 8   181
               
Total . . .  486 507 257 49 157   1465

2. Number of documents per language and thematic area


3. Since it was one of the corpus' initial goals, a part of it consists of parallel texts. At the moment the most important part of the parallel corpus consists of Catalan-Spanish, Catalan-English and Spanish-English counterparts. The amount of data covered by the parallel corpus is shown in table 3.
 
Area Catalan-Spanish Catalan-English Spanish-English
       
  Doc. Words Doc. Words Doc. Words
Law 61 460 1 12 2 57
Economics 21 600 10 250 13 283
Environment 10 214 11 213 13 144
Medicine 4 108 1 40 4 125
Computer Science 1 28 - - 23 300
             
Total . . .  97 1.410 23 515 55 909

3. Parallel corpus documents and number of words per thematic area


4. Finally, the Corpus contains a subcorpus of general purpose language which was extracted from printed mass media. The Catalan data was extracted from the online version of the newspaper L'Avui. For the Spanish data the digital edition of the newspaper El País was used in addition to the texts contained in the Corpus-92, which consits of selected examples. The current numbers of documents in this contrastive corpus is shown in 4:
 
 

Area Catalan Spanish   Total
         
  Doc. Words Doc. Words   Doc. Words
general 199 9434 46 1977   245 11.411
4. Number of words in general purpose documents