Introduction and Antecedents

Information Retrieval (IR from now on) covers different technologies that, although being the same on their general target of obtaining refined information by means of computational tools, target different objectives: searching subject rellevant files on the web, generating abstracts, text mining, automatic enriching computational dictionaries, automatic extracting terminology, search engines for documental databases, etc.

Different technique combination has been proved to be a very productive strategy in all technologies aiming at IR. Among the used techniques, along with the statistical ones and the ones using machine learning, the ones based on linguistic strategies must be highlighted. Nevertheless, the interaction of linguistic resources and linguistic analysis tools has been practiced in some of the IR techniques in a basic way. Specifically, the natural language processing (NLP) chain, during lemmatization sequence, morphological tagging, syntactic analysis and disambiguation, can offer good results in the case of IR systems performing on previously delimited data sets (texts, documental databases, textual corpora, knowledge banks), as abstracts generation, computational dictionaries automatic enrichment or terminology automatic extraction. Once the documental source going to be used in IR has been delimited, structural tagging (SGML o XML) of texts and morphological information tagging and, on the last years and for some languages, tagging of syntactic of lexical units and syntagmatic structures of this documental source make it possible to apply IR strategies offering a more precise output. On the other hand, in IR of textual corpora and delimited documental databases, noise level is usually already small due to the selection of the sources taking into account their subject and documental criteria. Therefore, the NLP chain progressively expands to projects of terminology extraction, abstract generation and lexicon automatic acquisition, from syntax analysis to semantic and pragmatic tagging, allowing to achieve a more relevant IR on delimited sources than the one we have nowadays.

However, in the case of IR oriented to unlimited sources, as Internet, extensive linguistic processing seems untackling. It is not possible to linguistically process all pages published on the net, neither the broad results originated from a search. On the other hand, apart from being untackling, it does not seem to be appropriate if what we are looking for is general information on a subject or a set of files illustrating on that subject. When dealing with this kind of IR, neither morphological analyzers, nor the syntactic ones, nor the processing computational dictionaries can be directly applied in an extensive way. Linguistic knowledge has been used in IR systems on Internet basically for file indexing and query expansion. It is relevant for a project like this one the research being carried out on indexing, and recently on metadata and Semantic web, in relation to the development of lexical hierarchies (ex. Wordnet), conceptual ontologies (ex. Mykrokosmos), documental taxonomies (ex. Delphy), as well as documental control systems and to index files from meaning acceptions and not ambiguous denominations. NLP tools have also been used in query expansion, mainly to convert a query term into a set of morphological associated terms.

The problem lies in the fact that when information on the net about a subject is searched by users, simple queries are the ones made, and only exceptionnally, they make them complex or combined. The results of these searches, though reaching very high precision levels by using search engines based on mathematical strategies and linguistic query expansion, do not achieve the pertinence level users would wish. For example, users looking for information about the evolution of certain bonds on the main world stock markets during the last 5 years, even making a quite complex query (concrete stocks, stock evolution, stock market, cities of the most important stock markets, etc.) the maximum obtained will be a list of files where probably, but no surely, will partially find the information searched. The only counterexample would be the existence of a web site exclusively devoted to these types of studies about stock markets, and that it were sufficiently well indexed so that a search engine put it among the first results of the search.

In this line, lately the IR research for unlimited sources has leaded towards what is called semantic web, that may very clearly sites and web page indexing from their content, by means of the so-called metadata. This is a future line locating the solution in the same source of information and not in the research tools. In fact, it is a similar process to the one being done, since a lot of years, on documental databases (full context or review indexes), where sources and their contents are previously indexed by means of thesauri, keywords, control vocabularies, automatic indicators, etc.

Our project proposal sets both traditions we have just exposed on IR: First, the development of linguistic resources is planed, as it has been done on IR for limited corpora; and, second, emphasis is made on semantic aspects being fundamental for rellevant IR. The development of resources has two targets in this project.

On the one hand , we want to re-use or develop textual resources (a multilingual economic corpus) to be able to extract specific information about terminological units, about relations among these units, specific phraseology, lexical combinatory, telling us about senses and the use of these structures. The result will be an economic corpus in English, Spanish, Catalan, Galician and Basque, structurally marked by using standard formats and linguistically processed. These linguistic resources aim at extracting real and pertinent information about forms, senses and relevant linguistic relations in economic discourse, to be able to design and build other linguistic resources being more specific and IR oriented.
Afterwards, with the extracted information, also basic to describe and explain how specialized discourse is, specifically on economics within social sciences, other linguistic resources will be built, common on IR techniques, as a concept ontology and a multilingual terminology database linked to that ontology. The semantic emphasis is indeed found in this type of resources, due to the fact that the definition of specific sense of lexical units and lexical combinations and the establishment of multiple semantic relations among units and combinatory are fundamental for relevant IR. The result will be an ontology of economics (better if from a branch of economics not yet determined) and a related database where grammatical information, illustrative contexts, definitions, equivalents on all the languages of the project, variants and synonyms in all project languages and related phraseology is found.

On these resources, developed during the two first years of the project, the design of a Query Reelaborator for Internet Search Engines (RECBI) is based. The idea of this system is to re-use the validated information by ontology and terminological database to convert a user's simple query into a complex query to be used in a search engine in Internet so that the result achieves better precision. This idea is based on works about the needs of Internet users and the evaluation of IR systems from the point of view of users, made by documentalists experienced in IR, and also on the query expansion among terms semantically related.

One of the essential aspects of this project is the idea or reusing resources to all directions. Taking into account that the resources to be developed are the base to the application of a query reelaboration system and not a target on themselves, we plan to localize, adapt an reuse existing resources that can be added to the project. Therefore, on the creation of the initial textual corpora we consider that some of the resources already exist for Spanish and Catalan (Technical Corpus at IULA-UPF), and we only aim at developing similar resources for Galician and Basque. Moreover, on the construction of the Economics corpus it is expected to re-use, when possible, some samples from other already existing for these languages textual corpora of general domain (press corpus, lexicographical corpus, digitalized corpus, etc.). The group already possesses tools to process texts in Spanish and Catalan and it is expected to license tools existing for Galician (dictionary, morphological analyzer and linguistic-based disambiguator) and licensing existing tools for Galician or adapting to Galician developed tools existing for Catalan and Spanish. The group has also an automatic terminological extractor being language independent and that, if necessary to help at the enrichment of the ontology and the terminological database, can be adapted to each one of the languages of the project and especially in the case of Economic discourse domain. Apart from that, the output results of this project (Economics corpus in the four languages of the project, an ontology and a terminological database) can be used on linguistic studies on each of the languages of the project or on transversal studies about Economics specialized discourse and specific terminology on the domain. Other uses of the outputs could be the update of dictionaries or neology monitoring.

Though this is an essentially applied project, there is no doubt of the existence of basic research on linguistics (discourse analysis, predicate semantics, lexical syntax, lexical semantics, neology). One of these aspects, maybe the most relevant one, of this project is the semantic and pragmatic aspect analysis of lexical units showing specialized value in Economics discourse. This is due to the configuration of terminology of social and human sciences. Some scientific fields, as biology, medicine, chemical science or geology, have a very specific nominal terminology, with a lot of derived and composed forms, specific and of frequent or exclusive use in those domains, that can be easily automatic detected (examples: carbonitrurizar, adenosina trifosfato, mononucleosis). On the other hand, human and social science terminology does not usually offer formal distinctive features, but it is based on the semantic change of words of common use (Economy examples: bolsa, dinero, valor, tasa, incremento), as some specific cases of the domain that are usually on the non-subject marked communication (inflación, devaluación, costes, beneficio). With this kind of terminological material, detecting and identifying terminological units becomes more difficult, because there are not formal marks (morphological filters), and this difficulty also appears on computational tools as the ones for terminology automatic extraction or lexicon automatic acquisition. Working on terminology of human and social sciences, apart of being an interesting challenge for applications, opens a study line on the connections between general and specialized discourse, on lexical unit polysemy, on metaphors used in lexic creation, on internal and external variation of lexicon, phraseology or lexical combinatory as a detecting element of units with terminological value, etc.

Briefly, we consider that both the current state of technology on language engineering and IR systems and the current state of basic and applied linguistic studies on IR, on terminology and on semantic representation (lexical databases, ontologies, lexical hierarchies, thesauri) leave us in a perfect point to make a composing progress, allowing to improve efficiency of search engines on the Internet. Re-using existing linguistic resources and developing complementary resources will allow us to assure, with low cost, the design of efficient linguistic strategies for IR and publicly spread a set of multilingual resources for Economy, by means of a Web site from where it will be possible to access all the resources developed in the project (textual corpora, ontology, multilingual terminological database and RECBI system for Internet queries), as a specialized portal.

General Targets

Design of a system of query reelaboration for multilingual Internet search engines (RECBI), with semantic and formal information extracted from an ontology and from a terminological database.
Construction of an ontology for the Economics field, with semantic and pragmatic information derived from the looking up on real textual corpora and linked to the TDB.
Building of a multilingual terminological database in English, Spanish, Catalan, Basque and Galician about Economy, with definitions, grammatical information, related phraseology, variants, referring, linked to the ontology.
Constitution of Economy textual corpora in Galician and Basque, similar to the ones existing in Catalan, Spanish and English at the Technical Corpus of IULA-UPF. Structure marked corpora by using standards and being linguistically processed, aiming at extracting relevant information for the ontology and the multilingual database construction.
Adaptation of existing processing tools for Galician and Basque to process textual corpora.
Adaptation of the terminological extractor YATE to Economy and the languages in the project.
Basic research about the description of terminology and specialized discourse in Economy and basic research on a theoretical approach about the discourse of Social sciences and related semantic aspects.

The 7 main targets that have just been reproduced can be commented in three blocks:

Development of resources and tools. Adaptation and re-using of existing resources.
IR field work, with the design of a query reelaboration design, based on the idea of query expansion with semantic and formal information extracted from specific economic resources.
Progress in the description and the theory of terminology in a linguistic perspective.

Targets Relating Both Projects

Spanish Economy Corpus Exploitation.
Design of linguistic strategies for query reelaboration based on the interaction of the constructed linguistic resources.
Design of the query reelaboration system.
Constiution of a multilingual web site including information on the project, allowing public access to the built resources (knowledge bank on Economics) and having the query reelaboration system so that users can send from there the queries to any search engine in the Internet.

Different coordination mechanisms are foreseen to assure the viability of the coordinated project and the quality of the results:

Working protocol for the resources constitution.
Training Sessions by the researchers and participating collaborators.
General meetings between both subprojects in Barcelona and Santiago.
Specific meetings inside the subproject UPF-UPV in Barcelona and in San Sebastián.
Research stage by Dr. Lieve Vangehuchten in Barcelona. Research seminar on the economic discourse in Spanish.
Progressive implementation of the materials and the results of the subprojects on a common format into the ontology, the database and the web site of the project.
Establishment of the external advice for each of the general targets, through the contact with similar national and international groups specially assuring multidisciplinarity reinforcement with technological and documental contribution.