An experimental investigation of part-of-speech taggers for Vietnamese
Part-of-speech (POS) tagging plays an important role in Natural Language Processing (NLP). Its applications
can be found in many other NLP tasks such as named entity recognition, syntactic parsing, dependency parsing and
text chunking. In the investigation conducted in this paper, we utilize the techniques of two widely-used toolkits,
ClearNLP and Stanford POS Tagger, and develop two new POS taggers for Vietnamese, then compare them to
three well-known Vietnamese taggers, namely JVnTagger, vnTagger and RDRPOSTagger. We make a systematic
comparison to find out the tagger having the best performance. We also design a new feature set to measure the
performance of the statistical taggers. Our new taggers built from Stanford Tagger and ClearNLP with the new
feature set can outperform all other current Vietnamese taggers in term of tagging accuracy. Moreover, we also
analyze the affection of some features to the performance of statistical taggers. Lastly, the experimental results also
reveal that the transformation-based tagger, RDRPOSTagger, can run faster than any statistical tagger significantly.
Trang 1
Trang 2
Trang 3
Trang 4
Trang 5
Trang 6
Trang 7
Trang 8
Trang 9
Trang 10
Tải về để xem bản đầy đủ
Tóm tắt nội dung tài liệu: An experimental investigation of part-of-speech taggers for Vietnamese
about 1.5% – 2% of words in the test set which are unknown in every fold, as shown in Table 6. Table 6. The experimental datasets Fold Total number of words Number of unknown words 1 63277 1164 2 63855 1203 3 63482 1247 4 62228 1168 5 59854 1056 6 63652 1216 7 63759 1146 8 63071 1224 9 65121 1242 10 63552 1288 4.2. Evaluation In our experiments, we firstly evaluate the current Vietnamese POS taggers which are vnTagger, JVnTagger and RDRPOSTagger with their default settings. Next, we design a set of features to evaluate the statistical taggers, including two international ones, Stanford Tagger and ClearNLP, and a current Vietnamese one, JVnTagger. There are two terms of the taggers that we measure, which are tagging accuracy and speed. The accuracy N.T. Phong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 3 (2016) 11–25 21 is measured using 10-fold cross-validation method on the datasets described above. The speed test is processed on a personal computer with 4 Intel Core i5-3337U CPUs @ 1.80GHz and 6GB of memory. The data used for the speed test is a corpus of 10k sentences collected from Vietnamese websites. This corpus was automatically segmented by UETsegmenter7 and contains about 250k words. All taggers use their single-threaded implementation for the speed test. Moreover, the test is processed many times to take the average speed of the taggers. We only use the Java-implemented version of RDRPOSTagger in the experiments because it is claimed by the author that this version runs significantly faster than the other one. We present the performance of the current Vietnamese taggers in Table 7. As we can see, the accuracy results of the taggers are pretty similar to each other’s with their default feature sets. The most accurate ones are vnTagger and MaxEnt model of JVnTagger. Especially, these two taggers provide very high accuracies for unknown words. Their specialized features for this kind of word seem to be very effective. Inside the JVnTagger toolkit, the two models provides different results. The MaxEnt model of JVnTagger is far more accurate than the CRFs one. Because these two models use the same feature set, we suspect that the MaxEnt model is more efficient than the CRFs one for Vietnamese POS tagging in term of the tagging accuracy. These two models can provide nearly similar tagging speeds which are 50k and 47k words per second. That may be caused by their same feature set (the CRFs model only has an extra feature so its speed is slightly 7https://github.com/phongnt570/UETsegmenter lower). vnTagger has some complicated features such as the conjunction of two tags and uses an outdated version of Stanford Tagger so that its tagging speed is quite low. Meanwhile, the only tagger that does not make use of statistical approach, RDRPOSTagger, produces an impressive tagging speed at 161k words per second. The tagging speed of a transformation-based tagger is mainly based on the speed of its initial tagger. RDRPOSTagger only uses a lexicon for the initial tagger so that it can perform really fast. Nevertheless, its accuracy for unknown words is not good. Its initial tagger just uses some rules to assign initial tags and then it traverses through the rule tree to determine the final result for the each word. Those rules seem to be unable to handle the unknown words well. The major of the taggers in our experiments is statistical taggers. In the next evaluation, we will create a unique scheme to evaluate these taggers which are Stanford POS Tagger, ClearNLP and JVnTagger. Although vnTagger is also an statistical one, we do not carry it to the second evaluation because it is based on the basis of Stanford Tagger as mentioned. It is worth repeating that the performance of each statistical tagger is mainly based on its feature set. The feature set we designed for the second evalution is presented in Table 8. Firstly, a simple feature set will be applied to all of the taggers. This set only contains the 1-gram, 2-gram features for words and some simple one to catch the word shape and the position of the word in the sentence. Next, we will continuously add more advanced features to the feature set to discover which one makes big impact. The three kinds of avanced feature are bidirectional-context, affix and distributional semantic ones. Whereas, the first and the third one are new to the 22 N.T. Phong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 3 (2016) 11–25 Table 7. The accuracy results (%) of current Vietnamese POS taggers with their default settings. Ovr.: the overall accuracy. Unk.: the unknown words accuracy. Spd.: the tagging speed (words per second) Feature set vnTagger JVn – MaxEnt RDRPOSTagger JVn – CRFs Accuracy Spd. Accuracy Spd. Accuracy Spd. Accuracy Spd. Ovr. Unk. Ovr. Unk. Ovr. Unk. Ovr. Unk. default 93.88 77.70 13k 93.83 79.60 50k 93.68 66.07 161k 93.59 69.51 47k Table 8. Feature set designed for experiments of four statistical taggers. Dist. Semantics: distributional semantics, dsi is the cluster id of the word wi in the Brown cluster set Feature set Template Simple w{−2,−1,0,1,2} (w−1,w0), (w0,w1), (w−1,w1) w0 has initial uppercase letter? w0 contains number(s)? w0 contains punctuation mark(s)? w0 contains all uppercase letters? w0 is first or middle or last token? Bidirectional (w0, t−1), (w0, t1) Affix the first syllable of w0 the last syllable of w0 Dist. Semantics ds−1, ds0, ds1 current Vietnamese POS taggers. The second one is important for predicting the tags of unknown words. The performance of four statistical taggers are presented in Table 9. Because JVnTagger does not support the bidirectional-context features so we do not have results for it with the feature sets containing this kind of feature. From Table 9, we can see that with the same simple feature set, these taggers can perform with very similar speeds which are approximately 100k words per second. However, their accuracies are different. With the same feature set, the MaxEnt model of Stanford Tagger can significantly outperform the MaxEnt model of JVnTagger. We suspect that it is caused by the algorithm for optimization and some advanced techniques used in Stanford Tagger. Moreover, with this simple feature set, Stanford Tagger also outperforms any other Vietnamese tagger with its default settings in the first evaluation presented above. Stanford Tagger’s techniques seem to be really efficient. Next, inside JVnTagger, with the same feature set, the MaxEnt model still performs better than the CRFs one, again, just like the results conducted in Table 7. Bidirectional tagging is one of the techniques that have not been applied for current statistical Vietnamese POS taggers. In this experiment, we add two bidirectional-context features which are (w0, t−1) and (w0, t1) to the feature set. These two features capture the information of the tags nearby the current word. The results in Table 9 reveals that bidirectional-context features help to increase the overall accuracy of Stanford Tagger significantly. Moreover, it also draws the tagging speed of this tagger dramatically. However, this kind of feature only makes small impact for ClearNLP which use SVMs for machine learning process, in terms of tagging accuracy and speed. N.T. Phong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 3 (2016) 11–25 23 Table 9. The accuracy results (%) of the four statistical taggers. spl: the simple feature set. bi: the bidirectional-context feature set. affix: the affix features. ds: the distributional semantic features Feature set Stanford ClearNLP JVn – MaxEnt JVn – CRFs Accuracy Spd. Accuracy Spd. Accuracy Spd. Accuracy Spd. Ovr. Unk. Ovr. Unk. Ovr. Unk. Ovr. Unk. spl 93.96 72.19 105k 92.95 68.36 107k 92.53 67.38 102k 91.57 67.34 99k spl+bi 94.24 72.40 11k 93.08 68.35 93k N/Aspl+bi+affix 94.42 78.03 10k 93.83 75.89 90k spl+bi+affix+ds 94.53 81.00 8k 94.19 79.01 64k Bidirectional-context features do not affect the accuracy of unknown words. Meanwhile, affix feature plays an important role to predict Vietnamese part-of-speech tags. In the next phase of the evaluation, we add the features to catch the first and the last syllable of the current predicting word to discover its impact on the tagging accuracy. As revealed in Table 9, we can conclude that affix features can help to increase the unknown words accuracy sharply, approximately 6% for both Stanford Tagger and ClearNLP. Especially, those features make a very big improvement in the overall accuracy of ClearNLP. Moreover, the tagging speeds of these taggers are affected a little bit with these added features. The last kind of advanced feature is the distributional semantic one. This is a new technique which has been applied to other languages successfully. To extract this feature, we build 1000 clusters of words based on Brown clustering algorithm [16] using Liang’s implementation8. The input corpus consists of 2m articles collected from Vietnamese websites. The result in Table 9 shows that distributional semantic features also help to 8https://github.com/percyliang/brown-cluster improve the unknown words accuracy of the taggers, at approximately 3% for both taggers. The overall precision is also increased especially in ClearNLP. The tagging speeds of the tagger are decreased about 20% to 30% after adding this kind of feature. Overall, Stanford POS Tagger is the one that has the best performance with every feature set. ClearNLP also has a good performance. With the full set of features (spl+bi+affix+ds), both of these two international taggers can outperform the current Vietnamese ones with their default settings in term of tagging accuracy. It leads to the fact that some of the specialized features in current Vietnamese taggers are not really useful. The final results of Stanford Tagger and ClearNLP are also the most accurate ones for Vietnamese POS tagging known to us. 5. Conclusion In this paper, we present an experimental investigation of five part-of-speech taggers for Vietnamese. In the investigation, there are four statistical taggers, Stanford POS Tagger, ClearNLP, vnTagger and JVnTagger. The other one is RDRPOSTagger, 24 N.T. Phong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 3 (2016) 11–25 a transformation-based tagger. In term of tagging accuracy, we evaluate the statistical taggers by continuously adding several kinds of feature to them. The result reveals that bidirectional tagging algorithm, affix features and distributional semantic features help to improve the tagging accuracy of the statistical taggers significantly. With the full provided feature set, both Stanford Tagger and ClearNLP can outperform the current Vietnamese taggers. In the speed test, RDRPOSTagger produces an impressive tagging speed. The experimental results also show that tagging speed of any statistical tagger is mainly based on its feature set. With a simple feature set, all of the statistical taggers in our experiments can perform at nearly similar speeds. However, giving an complex feature set to the taggers can draw their tagging speeds deeply. Acknowledgments This work has been supported by Vietnam National University, Hanoi (VNU), under Project No. QG.14.04. References [1] P. Le-Hong, A. Roussanaly, T. M. H. Nguyen, M. Rossignol, An empirical study of maximum entropy approach for part-of-speech tagging of Vietnamese texts, in: Traitement Automatique des Langues Naturelles-TALN 2010, 2010, p. 12. [2] D. Q. Nguyen, D. Q. Nguyen, D. D. Pham, S. B. Pham, RDRPOSTagger: A Ripple Down Rules-based Part-Of-Speech Tagger, in: Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, Gothenburg, Sweden, 2014, pp. 17–20. [3] C.-T. Nguyen, X.-H. Phan, T.-T. Nguyen, JVnTextPro: A tool to process Vietnamese texts, Tech. rep., Tech. rep., Version 2.0, sourceforge. net (2010). [4] J. D. Choi, M. Palmer, Fast and robust part-of-speech tagging using dynamic model selection, in: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, Association for Computational Linguistics, 2012, pp. 363–367. [5] K. Toutanova, D. Klein, C. D. Manning, Y. Singer, Feature-rich part-of-speech tagging with a cyclic dependency network, in: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, Association for Computational Linguistics, 2003, pp. 173–180. [6] P.-T. Nguyen, X.-L. Vu, T.-M.-H. Nguyen, V.-H. Nguyen, H.-P. Le, Building a large syntactically-annotated corpus of Vietnamese, in: Proceedings of the Third Linguistic Annotation Workshop, Association for Computational Linguistics, 2009, pp. 182–185. [7] Z. Huang, W. Xu, K. Yu, Bidirectional LSTM-CRF models for sequence tagging, arXiv preprint arXiv:1508.01991. [8] E. Brill, A simple rule-based part of speech tagger, in: Proceedings of the workshop on Speech and Natural Language, Association for Computational Linguistics, 1992, pp. 112–116. [9] R. Prins, G. Van Noord, Unsupervised POS-Tagging Improves Parsing Accuracy and Parsing Efficiency, in: IWPT, 2001. [10] E. Brill, Unsupervised learning of disambiguation rules for part of speech tagging, in: Proceedings of the third workshop on very large corpora, Vol. 30, Somerset, New Jersey: Association for Computational Linguistics, 1995, pp. 1–13. [11] O. T. Tran, C. A. Le, T. Q. Ha, Q. H. Le, An experimental study on Vietnamese POS tagging, in: International Conference on Asian Language Processing, 2009. IALP’09., IEEE, 2009, pp. 23–27. [12] X.-H. Phan, L.-M. Nguyen, Flexcrfs: Flexible conditional random field toolkit. [13] M. P. Marcus, M. A. Marcinkiewicz, B. Santorini, Building a large annotated corpus of English: The Penn Treebank, Computational linguistics 19 (2) (1993) 313–330. N.T. Phong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 32, No. 3 (2016) 11–25 25 [14] D. Q. Nguyen, D. Q. Nguyen, D. D. Pham, S. B. Pham, A robust transformation-based learning approach using ripple down rules for part-of-speech tagging, AI Communications (Preprint) (2014) 1–14. [15] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, C.-J. Lin, LIBLINEAR: A library for large linear classification, The Journal of Machine Learning Research 9 (2008) 1871–1874. [16] P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. D. Pietra, J. C. Lai, Class-based n-gram models of natural language, Comput. Linguist. 18 (4) (1992) 467–479.
File đính kèm:
- an_experimental_investigation_of_part_of_speech_taggers_for.pdf