Corpus Project Textual, plurilingual, specialized Corpus
 
Law

Economics

Environmental
science

Medicine

Computer
science

The main goal of the Corpus project is the construction and exploitation of a textual, plurilingual and specialized corpus. The languages involved are the following: Catalan, Spanish, English, German and French. The areas of interest include: economics, law, computer science, medicine and enviromental science. This corpus is the main support for teaching and research at our institut. Some of the research activities envisaged against this corpus include the following ones: terminology detection, parallel texts alignment, partial parsing, (semi)automatic extraction of several levels of linguistic information for building computational systems (for example, subcategorization patterns), language variation studies. 

Texts are selected and classified according to topics proposed by specialists in each area (Law, Economics, Environmental science, Medicine and Computer science). Then the texts are tagged according to the standard SGML, following the guidelines proposed by the "Corpus Encoding Standard (CES)" of the EAGLES initiative. (Actual state)

Text processing includes the following steps: 

Some examples of processed corpus documents at different stages

CREL examples

Corpus processing tools

Bwananet. The IULA LSP Corpus Browser

SEXTAN: A neologism detector using the Corpus

Related Publications

Other people and institutions who contributed to the Corpus project

Main researcher: M. Teresa Cabré Castellví 

Coordinator: Jorge Vivaldi Palatresi

© INSTITUT UNIVERSITARI DE LINGÜÍSTICA APLICADA - UNIVERSITAT POMPEU FABRA, Roc Boronat 138, 08018 Barcelona