An experimental investigation of part-of-speech taggers for Vietnamese

Part-of-speech (POS) tagging plays an important role in Natural Language Processing (NLP). Its applications

can be found in many other NLP tasks such as named entity recognition, syntactic parsing, dependency parsing and

text chunking. In the investigation conducted in this paper, we utilize the techniques of two widely-used toolkits,

ClearNLP and Stanford POS Tagger, and develop two new POS taggers for Vietnamese, then compare them to

three well-known Vietnamese taggers, namely JVnTagger, vnTagger and RDRPOSTagger. We make a systematic

comparison to find out the tagger having the best performance. We also design a new feature set to measure the

performance of the statistical taggers. Our new taggers built from Stanford Tagger and ClearNLP with the new

feature set can outperform all other current Vietnamese taggers in term of tagging accuracy. Moreover, we also

analyze the affection of some features to the performance of statistical taggers. Lastly, the experimental results also

reveal that the transformation-based tagger, RDRPOSTagger, can run faster than any statistical tagger significantly.

An experimental investigation of part-of-speech taggers for Vietnamese trang 1

Trang 1

An experimental investigation of part-of-speech taggers for Vietnamese trang 2

Trang 2

An experimental investigation of part-of-speech taggers for Vietnamese trang 3

Trang 3

An experimental investigation of part-of-speech taggers for Vietnamese trang 4

Trang 4

An experimental investigation of part-of-speech taggers for Vietnamese trang 5

Trang 5

An experimental investigation of part-of-speech taggers for Vietnamese trang 6

Trang 6

An experimental investigation of part-of-speech taggers for Vietnamese trang 7

Trang 7

An experimental investigation of part-of-speech taggers for Vietnamese trang 8

Trang 8

An experimental investigation of part-of-speech taggers for Vietnamese trang 9

Trang 9

An experimental investigation of part-of-speech taggers for Vietnamese trang 10

Trang 10

Tải về để xem bản đầy đủ

pdf 15 trang duykhanh 3460
Bạn đang xem 10 trang mẫu của tài liệu "An experimental investigation of part-of-speech taggers for Vietnamese", để tải tài liệu gốc về máy hãy click vào nút Download ở trên

Tóm tắt nội dung tài liệu: An experimental investigation of part-of-speech taggers for Vietnamese

An experimental investigation of part-of-speech taggers for Vietnamese
about 1.5% – 2% of words in the test set
which are unknown in every fold, as shown
in Table 6.
Table 6. The experimental datasets
Fold
Total number
of words
Number of
unknown words
1 63277 1164
2 63855 1203
3 63482 1247
4 62228 1168
5 59854 1056
6 63652 1216
7 63759 1146
8 63071 1224
9 65121 1242
10 63552 1288
4.2. Evaluation
In our experiments, we firstly evaluate the
current Vietnamese POS taggers which are
vnTagger, JVnTagger and RDRPOSTagger
with their default settings. Next, we design
a set of features to evaluate the statistical
taggers, including two international ones,
Stanford Tagger and ClearNLP, and a current
Vietnamese one, JVnTagger. There are two
terms of the taggers that we measure, which
are tagging accuracy and speed. The accuracy
N.T. Phong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 3 (2016) 11–25 21
is measured using 10-fold cross-validation
method on the datasets described above.
The speed test is processed on a personal
computer with 4 Intel Core i5-3337U CPUs
@ 1.80GHz and 6GB of memory. The data
used for the speed test is a corpus of 10k
sentences collected from Vietnamese websites.
This corpus was automatically segmented
by UETsegmenter7 and contains about 250k
words. All taggers use their single-threaded
implementation for the speed test. Moreover,
the test is processed many times to take the
average speed of the taggers. We only use the
Java-implemented version of RDRPOSTagger
in the experiments because it is claimed by
the author that this version runs significantly
faster than the other one.
We present the performance of the current
Vietnamese taggers in Table 7. As we can
see, the accuracy results of the taggers are
pretty similar to each other’s with their default
feature sets. The most accurate ones are
vnTagger and MaxEnt model of JVnTagger.
Especially, these two taggers provide very
high accuracies for unknown words. Their
specialized features for this kind of word seem
to be very effective. Inside the JVnTagger
toolkit, the two models provides different
results. The MaxEnt model of JVnTagger is
far more accurate than the CRFs one. Because
these two models use the same feature set,
we suspect that the MaxEnt model is more
efficient than the CRFs one for Vietnamese
POS tagging in term of the tagging accuracy.
These two models can provide nearly similar
tagging speeds which are 50k and 47k words
per second. That may be caused by their
same feature set (the CRFs model only has
an extra feature so its speed is slightly
7https://github.com/phongnt570/UETsegmenter
lower). vnTagger has some complicated
features such as the conjunction of two tags
and uses an outdated version of Stanford
Tagger so that its tagging speed is quite low.
Meanwhile, the only tagger that does not make
use of statistical approach, RDRPOSTagger,
produces an impressive tagging speed at
161k words per second. The tagging speed
of a transformation-based tagger is mainly
based on the speed of its initial tagger.
RDRPOSTagger only uses a lexicon for the
initial tagger so that it can perform really fast.
Nevertheless, its accuracy for unknown words
is not good. Its initial tagger just uses some
rules to assign initial tags and then it traverses
through the rule tree to determine the final
result for the each word. Those rules seem to
be unable to handle the unknown words well.
The major of the taggers in our experiments
is statistical taggers. In the next evaluation,
we will create a unique scheme to evaluate
these taggers which are Stanford POS Tagger,
ClearNLP and JVnTagger. Although vnTagger
is also an statistical one, we do not carry it to
the second evaluation because it is based on
the basis of Stanford Tagger as mentioned.
It is worth repeating that the performance
of each statistical tagger is mainly based on
its feature set. The feature set we designed for
the second evalution is presented in Table 8.
Firstly, a simple feature set will be applied to
all of the taggers. This set only contains the
1-gram, 2-gram features for words and some
simple one to catch the word shape and the
position of the word in the sentence. Next,
we will continuously add more advanced
features to the feature set to discover which
one makes big impact. The three kinds of
avanced feature are bidirectional-context, affix
and distributional semantic ones. Whereas,
the first and the third one are new to the
22 N.T. Phong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 3 (2016) 11–25
Table 7. The accuracy results (%) of current Vietnamese POS taggers with their default settings.
Ovr.: the overall accuracy. Unk.: the unknown words accuracy.
Spd.: the tagging speed (words per second)
Feature set
vnTagger JVn – MaxEnt RDRPOSTagger JVn – CRFs
Accuracy
Spd.
Accuracy
Spd.
Accuracy
Spd.
Accuracy
Spd.
Ovr. Unk. Ovr. Unk. Ovr. Unk. Ovr. Unk.
default 93.88 77.70 13k 93.83 79.60 50k 93.68 66.07 161k 93.59 69.51 47k
Table 8. Feature set designed for experiments of
four statistical taggers. Dist. Semantics:
distributional semantics, dsi is the cluster id of the
word wi in the Brown cluster set
Feature set Template
Simple
w{−2,−1,0,1,2}
(w−1,w0), (w0,w1), (w−1,w1)
w0 has initial uppercase letter?
w0 contains number(s)?
w0 contains punctuation mark(s)?
w0 contains all uppercase letters?
w0 is first or middle or last token?
Bidirectional (w0, t−1), (w0, t1)
Affix
the first syllable of w0
the last syllable of w0
Dist. Semantics ds−1, ds0, ds1
current Vietnamese POS taggers. The second
one is important for predicting the tags of
unknown words.
The performance of four statistical taggers
are presented in Table 9. Because JVnTagger
does not support the bidirectional-context
features so we do not have results for it
with the feature sets containing this kind of
feature. From Table 9, we can see that with
the same simple feature set, these taggers
can perform with very similar speeds which
are approximately 100k words per second.
However, their accuracies are different. With
the same feature set, the MaxEnt model of
Stanford Tagger can significantly outperform
the MaxEnt model of JVnTagger. We
suspect that it is caused by the algorithm for
optimization and some advanced techniques
used in Stanford Tagger. Moreover, with
this simple feature set, Stanford Tagger
also outperforms any other Vietnamese
tagger with its default settings in the first
evaluation presented above. Stanford Tagger’s
techniques seem to be really efficient. Next,
inside JVnTagger, with the same feature set,
the MaxEnt model still performs better than
the CRFs one, again, just like the results
conducted in Table 7.
Bidirectional tagging is one of the
techniques that have not been applied
for current statistical Vietnamese POS
taggers. In this experiment, we add
two bidirectional-context features which are
(w0, t−1) and (w0, t1) to the feature set. These
two features capture the information of the
tags nearby the current word. The results
in Table 9 reveals that bidirectional-context
features help to increase the overall accuracy
of Stanford Tagger significantly. Moreover,
it also draws the tagging speed of this tagger
dramatically. However, this kind of feature
only makes small impact for ClearNLP which
use SVMs for machine learning process, in
terms of tagging accuracy and speed.
N.T. Phong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 3 (2016) 11–25 23
Table 9. The accuracy results (%) of the four statistical taggers. spl: the simple feature set. bi: the
bidirectional-context feature set. affix: the affix features. ds: the distributional semantic features
Feature set
Stanford ClearNLP JVn – MaxEnt JVn – CRFs
Accuracy
Spd.
Accuracy
Spd.
Accuracy
Spd.
Accuracy
Spd.
Ovr. Unk. Ovr. Unk. Ovr. Unk. Ovr. Unk.
spl 93.96 72.19 105k 92.95 68.36 107k 92.53 67.38 102k 91.57 67.34 99k
spl+bi 94.24 72.40 11k 93.08 68.35 93k
N/Aspl+bi+affix 94.42 78.03 10k 93.83 75.89 90k
spl+bi+affix+ds 94.53 81.00 8k 94.19 79.01 64k
Bidirectional-context features do not affect
the accuracy of unknown words. Meanwhile,
affix feature plays an important role to predict
Vietnamese part-of-speech tags. In the next
phase of the evaluation, we add the features
to catch the first and the last syllable of the
current predicting word to discover its impact
on the tagging accuracy. As revealed in
Table 9, we can conclude that affix features
can help to increase the unknown words
accuracy sharply, approximately 6% for both
Stanford Tagger and ClearNLP. Especially,
those features make a very big improvement in
the overall accuracy of ClearNLP. Moreover,
the tagging speeds of these taggers are affected
a little bit with these added features.
The last kind of advanced feature is the
distributional semantic one. This is a new
technique which has been applied to other
languages successfully. To extract this feature,
we build 1000 clusters of words based on
Brown clustering algorithm [16] using Liang’s
implementation8. The input corpus consists
of 2m articles collected from Vietnamese
websites. The result in Table 9 shows that
distributional semantic features also help to
8https://github.com/percyliang/brown-cluster
improve the unknown words accuracy of
the taggers, at approximately 3% for both
taggers. The overall precision is also increased
especially in ClearNLP. The tagging speeds
of the tagger are decreased about 20% to 30%
after adding this kind of feature.
Overall, Stanford POS Tagger is the
one that has the best performance with
every feature set. ClearNLP also has a
good performance. With the full set of
features (spl+bi+affix+ds), both of these
two international taggers can outperform the
current Vietnamese ones with their default
settings in term of tagging accuracy. It leads to
the fact that some of the specialized features
in current Vietnamese taggers are not really
useful. The final results of Stanford Tagger
and ClearNLP are also the most accurate ones
for Vietnamese POS tagging known to us.
5. Conclusion
In this paper, we present an experimental
investigation of five part-of-speech taggers
for Vietnamese. In the investigation,
there are four statistical taggers, Stanford
POS Tagger, ClearNLP, vnTagger and
JVnTagger. The other one is RDRPOSTagger,
24 N.T. Phong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 3 (2016) 11–25
a transformation-based tagger. In term of
tagging accuracy, we evaluate the statistical
taggers by continuously adding several kinds
of feature to them. The result reveals
that bidirectional tagging algorithm, affix
features and distributional semantic features
help to improve the tagging accuracy of
the statistical taggers significantly. With
the full provided feature set, both Stanford
Tagger and ClearNLP can outperform the
current Vietnamese taggers. In the speed
test, RDRPOSTagger produces an impressive
tagging speed. The experimental results also
show that tagging speed of any statistical
tagger is mainly based on its feature set. With
a simple feature set, all of the statistical
taggers in our experiments can perform at
nearly similar speeds. However, giving an
complex feature set to the taggers can draw
their tagging speeds deeply.
Acknowledgments
This work has been supported by Vietnam
National University, Hanoi (VNU), under
Project No. QG.14.04.
References
[1] P. Le-Hong, A. Roussanaly, T. M. H. Nguyen,
M. Rossignol, An empirical study of maximum
entropy approach for part-of-speech tagging of
Vietnamese texts, in: Traitement Automatique des
Langues Naturelles-TALN 2010, 2010, p. 12.
[2] D. Q. Nguyen, D. Q. Nguyen, D. D. Pham,
S. B. Pham, RDRPOSTagger: A Ripple
Down Rules-based Part-Of-Speech Tagger,
in: Proceedings of the Demonstrations at the
14th Conference of the European Chapter of
the Association for Computational Linguistics,
Association for Computational Linguistics,
Gothenburg, Sweden, 2014, pp. 17–20.
[3] C.-T. Nguyen, X.-H. Phan, T.-T. Nguyen,
JVnTextPro: A tool to process Vietnamese
texts, Tech. rep., Tech. rep., Version 2.0,
 sourceforge. net (2010).
[4] J. D. Choi, M. Palmer, Fast and robust
part-of-speech tagging using dynamic model
selection, in: Proceedings of the 50th Annual
Meeting of the Association for Computational
Linguistics: Short Papers-Volume 2, Association
for Computational Linguistics, 2012, pp.
363–367.
[5] K. Toutanova, D. Klein, C. D. Manning, Y. Singer,
Feature-rich part-of-speech tagging with a cyclic
dependency network, in: Proceedings of the 2003
Conference of the North American Chapter of
the Association for Computational Linguistics
on Human Language Technology-Volume 1,
Association for Computational Linguistics, 2003,
pp. 173–180.
[6] P.-T. Nguyen, X.-L. Vu, T.-M.-H. Nguyen,
V.-H. Nguyen, H.-P. Le, Building a large
syntactically-annotated corpus of Vietnamese, in:
Proceedings of the Third Linguistic Annotation
Workshop, Association for Computational
Linguistics, 2009, pp. 182–185.
[7] Z. Huang, W. Xu, K. Yu, Bidirectional
LSTM-CRF models for sequence tagging, arXiv
preprint arXiv:1508.01991.
[8] E. Brill, A simple rule-based part of speech tagger,
in: Proceedings of the workshop on Speech and
Natural Language, Association for Computational
Linguistics, 1992, pp. 112–116.
[9] R. Prins, G. Van Noord, Unsupervised
POS-Tagging Improves Parsing Accuracy
and Parsing Efficiency, in: IWPT, 2001.
[10] E. Brill, Unsupervised learning of disambiguation
rules for part of speech tagging, in: Proceedings
of the third workshop on very large corpora,
Vol. 30, Somerset, New Jersey: Association for
Computational Linguistics, 1995, pp. 1–13.
[11] O. T. Tran, C. A. Le, T. Q. Ha, Q. H. Le, An
experimental study on Vietnamese POS tagging,
in: International Conference on Asian Language
Processing, 2009. IALP’09., IEEE, 2009, pp.
23–27.
[12] X.-H. Phan, L.-M. Nguyen, Flexcrfs: Flexible
conditional random field toolkit.
[13] M. P. Marcus, M. A. Marcinkiewicz, B. Santorini,
Building a large annotated corpus of English: The
Penn Treebank, Computational linguistics 19 (2)
(1993) 313–330.
N.T. Phong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 3 (2016) 11–25 25
[14] D. Q. Nguyen, D. Q. Nguyen, D. D. Pham,
S. B. Pham, A robust transformation-based
learning approach using ripple down
rules for part-of-speech tagging, AI
Communications (Preprint) (2014) 1–14.
[15] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang,
C.-J. Lin, LIBLINEAR: A library for large linear
classification, The Journal of Machine Learning
Research 9 (2008) 1871–1874.
[16] P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. D.
Pietra, J. C. Lai, Class-based n-gram models of
natural language, Comput. Linguist. 18 (4) (1992)
467–479.

File đính kèm:

  • pdfan_experimental_investigation_of_part_of_speech_taggers_for.pdf