Presentation
Versión en español

TEXTERM3: Basis, strategies and tools for automatic extraction and processing of specialized information

Extraction, retrieval and automatic management of information from textual corpora need multiple natural language processing (NLP) tools. Over the last decades, different national and foreigner research groups have pursued theoretical and applied research on this field, and they have developed applications for that. While some of these resources use statistical strategies to extract information, others are based on linguistic information. Within this linguistic approach, the better the data descriptions the system is based on, the better the obtained information will be and the more adequate its processing. A system will also be user friendly and efficient if it is integrated in the same platform of other complementary tools.

This project situates itself and continues the work line on PLN and information extraction from scientific-technical corpora developed since 1994 by the Research Group on Lexicon, Terminology and Specialized Discourse (IULATERM), and for what the group has received the support from the Plan Nacional (Spanish Government), the Plan de la Comunidad Autónoma (Catalan Government) and the European Union.

The targets established in this project are divided in theoretical-descriptive and applied-technological targets. Mainly, from the theoretical-descriptive perspective, oriented to automatic information extraction, our objectives are the following: a) to deepen in the analysis of the different types of terminological units and the contextual clues permitting to identify them, b) to refine and to broaden the analysis of the units expressing relations between terminological units, and c) to establish the representative and discriminatory features between specialized texts and those products not considered as specialized texts. From the applied-technological perspective, aiming at creating technological applications from the already-created resources, our targets are: a) to use linguistic descriptions and semantic classifications to improve YATE (a terminology automatic extraction system) and to broaden its scope to new scientific-technical fields, b) to develop an automatic research and classification text system from the internet and based on subject pertinence and specialized density criteria (aiming at adequate different specialized-field corpora updating), and d) to improve the automatic representation system of knowledge structure in texts (automatic ontology and concept map building).

It is worth mentioning that all theoretical and applied studies will be strongly interrelated, creating a synergy which will help to study and solve the questions arising from each application. Aiming to emphasize the applied aspects of the proposed tasks, Law and Economy have been chosen as initial fields of study, apart from Medicine, in which we are already working in the current project.