The main goal of the Corpus project is the construction and exploitation of a textual, plurilingual and specialized corpus. The languages involved are the following: Catalan, Spanish, English, German and French. The areas of interest include: economics, law, computer science, medicine, enviromental science and linguistic sciences. This corpus is the main support for teaching and research at our institut. Some of the research activities against this corpus include the following ones: terminology detection, parallel texts alignment, partial parsing, (semi)automatic extraction of several levels of linguistic information for building computational systems (for example, subcategorization patterns), language variation studies.
Texts are selected and classified according to topics proposed by specialists in each area (Law, Economics, Environmental science, Medicine, Computer science and Linguistic sciences). Then the texts are tagged according to the standard SGML, following the guidelines proposed by the "Corpus Encoding Standard (CES)" of the EAGLES initiative. (Current state)
Within the frame of METANET4U project (2011-2013): (1) the process of the corpus was adapted to the new directives of the LAF (Language resource management -- Linguistic annotation framework - ISO 24612:2012) standar: XML format and "stand-off" markup, and (2) more than 42,000 sentences of the corpus in Spanish were syntactically annotated.
Text processing includes the following steps:
Access to IULA Corpus tècnic is available online:
Corpus processing tools
SEXTAN: A neologism detector using the Corpus
Main researcher: M. Teresa Cabré Castellví
Coordinator: Jorge Vivaldi Palatresi
© INSTITUT UNIVERSITARI DE LINGÜÍSTICA APLICADA - UNIVERSITAT POMPEU FABRA, Roc Boronat 138, 08018 Barcelona