Institut de Lingüística Aplicada
 

IULA Resources. Corpus & Tools. MaltParser for Spanish

We present an instance of MaltParser trained for Spanish. The parser has been trained using the IULA Spanish LSP Treebank, which contains more than 42,000 sentences and almost 590,000 tokens. In order to achieve optimal performance, MaltOptimizer has been used to set the best parameters to train MaltParser.

We performed some evaluation experiments using 80% of the IULA Spanish LSP Treebank as train and 20% as test. We also evaluated the model over Tibidabo corpus, which is a good evaluation set because it contains very different kind of sentences (newspaper domain).

The training and test corpora are available for download to perform Machine Learning experiments. In that way the same partitions can be used by different researchers and their results can be directly compared.

The following table summarizes the evaluation results over the two corpora:

 

LAS:
Labelled Attachment Score

LCM:
Labelled Complete Match

IULA Spanish LSP Treebank Test Set

93,14 %

47,60 %

Tibidabo Treebank

88,95 %

36,20 %

The resulting parser can be used in two different ways:

  1. Accessing the malt_parser web service
  2. Running Malt parser Spanish module espmalt-1.0.mco

 

1.

Accessing the malt_parser web service

Access ws Accés Access malt_parser web service
 

MaltParser is expected to work better if the PoS and the tokens are as similar as possible to those used to train it; this is, the PoS and tokenization of the IULA Spanish LSP Treebank. For that reason we have developed a package that performs the PoS step with FreeLing configured as it was when building the Treebank and then calls the MaltParser model to get the dependencies. This package has been deployed as a web service and can be freely used to parse plain text sentences.

  • Input format: plain text sentence
  • Output format: CoNLL format with filled-in relation and function columns

2.

Running Malt parser Spanish module espmalt-1.0.mco

Download Accés Download Malt parser Spanish module espmalt-1.0.mco: e-repositori
 

The file espmalt-1.0.mco contains a single malt configuration for parsing Spanish text with MaltParser.

To run the Malt parser Spanish module follow the steps bellow:

  • download and install MaltParser package
  • download the Spanish model espmalt-1.0.mco (available soon) into MaltParser working directory
  • execute the following command:

    prompt> java -Xmx1024m -jar maltparser-1.7.jar -c espmalt -i infile.conll -o outfile.conll -m parse

    where infile.conll and outfile.conll should be replaced by the names of your input and output files.
  • Input file format: CoNLL format with PoS tags

    You can obtain this input using FreeLing PoS tagger and converting its output to the Conll-X format. Nevertheless, since FreeLing is highly configurable, there may be different possible outputs of the tagger, especially related to tokenization. Please note that MaltParser is expected to work better if the PoS and the tokens are as similar as possible to those used to train it; this is, the PoS and tokenization of the IULA Spanish LSP Treebank.

    For example, this is the input needed for the sentence "En el tramo de Telefónica un toro descolgado ha creado peligro tras embestir contra un grupo de mozos.":

    1 Los el d DA0MP0 _ _ _ _ _
    2 niños niño n NCMP000 _ _ _ _ _
    3 leen leer v VMIP3P0 _ _ _ _ _
    4 cuentos cuento n NCMP000 _ _ _ _ _
    5 de de s SPS00 _ _ _ _ _
    6 hadas hada n NCFP000 _ _ _ _ _
    7 . . f Fp _ _ _ _ _
  • Output file format: CoNLL format with filled-in relation and function columns

    Following the previous example, this will be the obtained output:

    1 Los el d DA0MP0 _ 2 SPEC _ _
    2 niños niño n NCMP000 _ 3 SUBJ _ _
    3 leen leer v VMIP3P0 _ 0 ROOT _ _
    4 cuentos cuento n NCMP000 _ 3 DO _ _
    5 de de s SPS00 _ 4 MOD _ _
    6 hadas hada n NCFP000 _ 5 COMP _ _
    7 . . f Fp _ 6 punct _ _

3.

MaltParser for Spanish training and test corpora

Download Accés Download MaltParser for Spanish training and test corpora: e-repositori
 

In this package we offer a partition of the IULA Spanish LSP Treebank into train and test sets and we also deliver the Tibidabo Treebank (Marimon 2010) which contains a set of sentences extracted from Ancora corpus annotated in the same way than the Iula Treebank. Tibidabo Treebank is a very good test set for models trained with Iula Spanish LSP Treebank since the sentences that form it are from a very different domain than those of the Iula Spanish LSP Treebank.

From the IULA Spanish LSP Treebank, we took 80% for training and 20% for test. The following table summarizes the size of these two partitions plus the Tibidabo Treebank:


Corpus

Sentences

Tokens

IULA Spanish LSP Treebank Train

33,107

465,460

IULA Spanish LSP Treebank Test

8,125

114,610

Tibidabo Treebank

3,376

41,620

All these corpora follow Conll-X shared task format with the same function values than IULA Spanish LSP Treebank.

4.

More information

For more information:

5.

Acknowledgments

The development of the Malt Parser for Spanish has been funded by PANACEA project (7FP-ITC-248064).

We thank the creators of the IulaTreebank, used to train this MaltParser model for Spanish. The creation of the Treebank was Funded by METANET4U project (CIP-PSP-270893) and IULA.

We also thank Miguel Ballesteros for his kind help regarding the use of MaltOptimizer.

 

© INSTITUT DE LINGÜÍSTICA APLICADA - UNIVERSITAT POMPEU FABRA