| Law Economics
Environmental
science
Medicine
Computer
science |
The main goal of the Corpus project
is the construction and exploitation of a textual,
plurilingual and specialized corpus. The languages
involved are the following: Catalan, Spanish, English,
German and French. The areas of interest include:
economics, law, computer science, medicine and enviromental
science. This corpus is the main support for teaching
and research at our institut. Some of the research
activities envisaged against this corpus include
the following ones: terminology detection, parallel
texts alignment, partial parsing, (semi)automatic
extraction of several levels of linguistic information
for building computational systems (for example,
subcategorization patterns), language variation
studies. Texts are selected and classified
according to topics proposed by specialists in
each area (Law, Economics,
Environmental science,
Medicine and Computer
science). Then the texts are tagged according
to the standard SGML,
following the guidelines proposed by the "Corpus
Encoding Standard (CES)"
of the EAGLES
initiative. (Actual state)
Text processing includes the following steps:
|