IULA Resources. Corpus & Tools. MaltParser for Spanish

We present an instance of MaltParser trained for Spanish. The parser has been trained using the IULA Spanish LSP Treebank, which contains more than 42,000 sentences and almost 590,000 tokens. In order to achieve optimal performance, MaltOptimizer has been used to set the best parameters to train MaltParser.

We performed some evaluation experiments using 80% of the IULA Spanish LSP Treebank as train and 20% as test. We also evaluated the model over Tibidabo corpus, which is a good evaluation set because it contains very different kind of sentences (newspaper domain).

The training and test corpora are available for download to perform Machine Learning experiments. In that way the same partitions can be used by different researchers and their results can be directly compared.

The following table summarizes the evaluation results over the two corpora:

	LAS: Labelled Attachment Score	LCM: Labelled Complete Match
IULA Spanish LSP Treebank Test Set	93,14 %	47,60 %
Tibidabo Treebank	88,95 %	36,20 %

The resulting parser can be used in two different ways:

Accessing the malt_parser web service
Running Malt parser Spanish module espmalt-1.0.mco

1. Accessing the malt_parser web service

Access ws

Access malt_parser web service

MaltParser is expected to work better if the PoS and the tokens are as similar as possible to those used to train it; this is, the PoS and tokenization of the IULA Spanish LSP Treebank. For that reason we have developed a package that performs the PoS step with FreeLing configured as it was when building the Treebank and then calls the MaltParser model to get the dependencies. This package has been deployed as a web service and can be freely used to parse plain text sentences.

Input format: plain text sentence
Output format: CoNLL format with filled-in relation and function columns

2. Running Malt parser Spanish module espmalt-1.0.mco

Download

Download Malt parser Spanish module espmalt-1.0.mco: e-repositori

The file espmalt-1.0.mco contains a single malt configuration for parsing Spanish text with MaltParser.

To run the Malt parser Spanish module follow the steps bellow:

download and install MaltParser package
download the Spanish model espmalt-1.0.mco (available soon) into MaltParser working directory
execute the following command:

prompt> java -Xmx1024m -jar maltparser-1.7.jar -c espmalt -i infile.conll -o outfile.conll -m parse

where infile.conll and outfile.conll should be replaced by the names of your input and output files.

Input file format: CoNLL format with PoS tags

You can obtain this input using FreeLing PoS tagger and converting its output to the Conll-X format. Nevertheless, since FreeLing is highly configurable, there may be different possible outputs of the tagger, especially related to tokenization. Please note that MaltParser is expected to work better if the PoS and the tokens are as similar as possible to those used to train it; this is, the PoS and tokenization of the IULA Spanish LSP Treebank.

For example, this is the input needed for the sentence "En el tramo de Telefónica un toro descolgado ha creado peligro tras embestir contra un grupo de mozos.":

1	Los	el	d	DA0MP0	_	_	_	_	_
2	niños	niño	n	NCMP000	_	_	_	_	_
3	leen	leer	v	VMIP3P0	_	_	_	_	_
4	cuentos	cuento	n	NCMP000	_	_	_	_	_
5	de	de	s	SPS00	_	_	_	_	_
6	hadas	hada	n	NCFP000	_	_	_	_	_
7	.	.	f	Fp	_	_	_	_	_

Output file format: CoNLL format with filled-in relation and function columns

Following the previous example, this will be the obtained output:

1	Los	el	d	DA0MP0	_	2	SPEC	_	_
2	niños	niño	n	NCMP000	_	3	SUBJ	_	_
3	leen	leer	v	VMIP3P0	_	0	ROOT	_	_
4	cuentos	cuento	n	NCMP000	_	3	DO	_	_
5	de	de	s	SPS00	_	4	MOD	_	_
6	hadas	hada	n	NCFP000	_	5	COMP	_	_
7	.	.	f	Fp	_	6	punct	_	_

3. MaltParser for Spanish training and test corpora

Download

Download MaltParser for Spanish training and test corpora: e-repositori

In this package we offer a partition of the IULA Spanish LSP Treebank into train and test sets and we also deliver the Tibidabo Treebank (Marimon 2010) which contains a set of sentences extracted from Ancora corpus annotated in the same way than the Iula Treebank. Tibidabo Treebank is a very good test set for models trained with Iula Spanish LSP Treebank since the sentences that form it are from a very different domain than those of the Iula Spanish LSP Treebank.

From the IULA Spanish LSP Treebank, we took 80% for training and 20% for test. The following table summarizes the size of these two partitions plus the Tibidabo Treebank:

Corpus	Sentences	Tokens
IULA Spanish LSP Treebank Train	33,107	465,460
IULA Spanish LSP Treebank Test	8,125	114,610
Tibidabo Treebank	3,376	41,620

All these corpora follow Conll-X shared task format with the same function values than IULA Spanish LSP Treebank.

4. More information

For more information:

about the use of MaltParser see the MaltParser user guide.

5. Acknowledgments

The development of the Malt Parser for Spanish has been funded by PANACEA project (7FP-ITC-248064).

We thank the creators of the IulaTreebank, used to train this MaltParser model for Spanish. The creation of the Treebank was Funded by METANET4U project (CIP-PSP-270893) and IULA.

We also thank Miguel Ballesteros for his kind help regarding the use of MaltOptimizer.