Vietnamese semantic role labelling
In this paper, we study semantic role labelling (SRL), a subtask of semantic parsing of natural language
sentences and its application for the Vietnamese language. We present our effort in building Vietnamese
PropBank, the first Vietnamese SRL corpus and a software system for labelling semantic roles of Vietnamese
texts. In particular, we present a novel constituent extraction algorithm in the argument candidate identification
step which is more suitable and more accurate than the common node-mapping method. In the machine learning
part, our system integrates distributed word features produced by two recent unsupervised learning models in
two learned statistical classifiers and makes use of integer linear programming inference procedure to improve
the accuracy. The system is evaluated in a series of experiments and achieves a good result, an F1 score of
74.77%. Our system, including corpus and software, is available as an open source project for free research and
we believe that it is a good baseline for the development of future Vietnamese SRL systems.
Trang 1
Trang 2
Trang 3
Trang 4
Trang 5
Trang 6
Trang 7
Trang 8
Trang 9
Trang 10
Tải về để xem bản đầy đủ
Tóm tắt nội dung tài liệu: Vietnamese semantic role labelling
cience: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 39-58 55 The main idea of GloVe is to use word-word occurrence counts to estimate the co-occurrence probabilities rather than the probabilities by themselves. Let Pij denote the probability that word j appear in the context of d d word i ; wi R and wj R denote the word vectors of word i and word j respectively. It is shown that wi wj = log(Pij ) = log(Cij ) log(Ci ), (12) where Cij is the number of times word j occurs in the context of word i . It turns out that GloVe is a global log-bilinear regression model. Finding word vectors is equivalent to solving a weighted least-squares regression model with the cost function: n 2 Figure 11. Some Vietnamese words produced by the J = f (Cij )(wi wj bi bj log(Cij )) , (13) i, j=1 GloVe model, projected onto two dimensions. where n is the size of the vocabulary, b 4.3.3. Text corpus i To create distributed word representations, and b j are additional bias terms and f (Cij ) is we use a dataset consisting of 7.3GB of text a weighting function. A class of weighting from 2 million articles collected through a functions which are found to work well can be Vietnamese news portal10. The text is first parameterized as normalized to lower case and all special characters are removed except these common x ifx < x symbols: the comma, the semicolon, the colon, f (x) = max (14) xmax the full stop and the percentage sign. All 1 otherwise numeral sequences are replaced with the special token , so that correlations between certain words and numbers are correctly The training code was obtained from the recognized by the neural network or the log- tool GloVe9 and we used a word appearance bilinear regression model. threshold of 2,000. Figure 11 shows the scatter plot of the same words in Figure 10, but this Each word in the Vietnamese language may consist of more than one syllables with spaces time their word vectors are produced by the in between, which could be regarded as GloVe model. multiple words by the unsupervised models. Hence it is necessary to replace the spaces within each word with underscores to create full word tokens. The tokenization process follows the method described in [37]. After removal of special characters and tokenization, the articles add up to 969 million ________ ________ 9 10 56 L.H. Phuong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 39-58 word tokens, spanning a vocabulary of 1.5 system. In other words, their use can help million unique tokens. We train the generalize the system to unseen words. unsupervised models with the full vocabulary to obtain the representation vectors, and then 5. Conclusion prune the collection of word vectors to the 65,000 most frequent words, excluding special We have presented our work on developing a semantic role labelling system for the symbols and the token Vietnamese language. The system comprises representing numeral sequences. two main component, a corpus and a software. 4.3.4. SRL with distributed word Our system achieves a good accuracy of about representations We train the two word embedding models 74.8% of F1 score. on the same text corpus presented in the We have argued that one cannot assume a previous subsections to produce distributed good applicability of existing methods and tools word representations, where each word is developed for English and other occidental represented by a real-valued vector of 50 languages and that they may not offer a cross- dimensions. language validity. For an isolating language In the last experiment, we replace predicate such as Vietnamese, techniques developed for or head word features in our SRL system by inflectional languages cannot be applied “as is”. their corresponding word vectors. For In particular, we have developed an algorithm predicates which are composed of multiple for extracting argument candidates which has a words, we first tokenize them into individual better accuracy than the 1-1 node mapping words and then average their vectors to get algorithm. We have proposed some novel vector representations. Table 14 and Table 15 features which are proved to be useful for shows performances of the Skip-gram and Vietnamese semantic role labelling, notably and GloVe models for predicate feature and for function tags and distributed word head word feature, respectively. representations. We have employed integer linear programming, a recent inference Table 14. The impact technique capable of incorporate a wide variety of word embeddings of predicate of linguistic constraints to improve the performance of the system. We have also Precision Recall F1 demonstrated the efficacy of distributed word A 78.29% 71.48% 74.73% representations produced by two unsupervised B 78.37% 71.49% 74.77% learning models in dealing with unknown words. C 78.29% 71.38% 74.67% In the future, we plan to improve further our A: Predicate word system, in the one hand, by enlarging our B: Skip-gram vector C: GloVe vector corpus so as to provide more data for the system. On the other hand, we would like to Table 15. The impact investigate different models used in SRL, for of word embeddings of head word example joint models [38], where arguments Precision Recall F and semantic roles are jointly embedded in a 1 shared vector space for a given predicate. In A 78.29% 71.48% 74.73% B 77.53% 70.76% 73.99% addition, we would like to explore the C 78.12% 71.58% 74.71% possibility of integrating dynamic constraints in A: Head word the integer linear programming procedure. We B: Skip-gram vector expect the overall performance of our SRL C: GloVe vector system to improve. Our system, including software and corpus, We see that both of the two types of word is available as an open source project for free embeddings do not decrease the accuracy of the research purpose and we believe that it is a L.H. Phuong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 39-58 57 good baseline for the development and [9] Tagami, H., Hizuka, S., and Saito, H. 2009, comparison of future Vietnamese SRL "Automatic semantic role labeling based on systems11. We plan to integrate this tool to Vitk, Japanese FrameNet–A Progress Report", In Proceedings of Conference of the Pacific an open-source toolkit for processing Association for Computational Linguistics, Japan: Vietnamese text, which contains fundamental Hokkaido University, Sapporo, pp. 181–6. processing tools and are readily scalable for [10]Nguyen, T.-L., Ha, M.-L., Nguyen, V.-H., Nguyen, processing very large text data12. T.-M.-H., Le-Hong, P. and Phan, T.-H. 2014, "Building a semantic role annotated corpus for References Vietnamese", in Proceedings of the 17th National Symposium on Information and Communication [1] Shen, D., and Lapata, M. 2007, "Using semantic Technology, Daklak, Vietnam, pp. 409–414. roles to improve question answering", In [11] Pham, T. H., Pham, X. K., and Le-Hong, P. 2015, Proceedings of Conference on Empirical Methods "Building a semantic role labelling system for on Natural Language Processing and Vietnamese", In Proceedings of the 10th Computational Natural Language Learning, Czech International Conference on Digital Information Republic: Prague, pp. 12–21. Management", South Korea: Jeju Island, pp. 77–84 [2] Lo, C. K., and Wu, D. 2010, "Evaluating machine [12] Baker, C. F., Fillmore, C. J., and Cronin, B. 2003, translation utility via semantic role labels", In "The structure of the FrameNet database", Proceedings of The International Conference on International Journal of Lexicography, 16(3): Language Resources and Evaluation, Malta: 281–96. Valletta, pp. 2873–7. [13] Boas, H. C. 2005, "From theory to practice: Frame [3] Aksoy, C., Bugdayci, A., Gur, T., Uysal, I., and semantics and the design of FrameNet", Can, F. 2009, "Semantic argument frequency-based Semantisches Wissen im Lexikon: 129–60. multi-document summarization", In Proceedings of [14] Palmer, M., Kingsbury, P., and Gildea, D. 2005. the 24th of the International Symposium on "The proposition bank: An annotated corpus of Computer and Information Sciences, Guzelyurt, semantic roles", Computational Linguistics, Turkey, pp. 460–4. 31(1): 71–106. [4] Christensen, J., Soderland, S., and Etzioni, O. 2010, [15 Schuler, K. K. 2006, "VerbNet: A broad-coverage, "Semantic role labeling for open information comprehensive verb lexicon", PhD Thesis, extraction", In Proceedings of the Conference of the University of Pennsylvania. North American Chapter of the Association for [16] Levin, B. 1993, "English Verb Classes and Computational Linguistics – Human Language Alternation: A Preliminary Investigation", Technologies, USA: Los Angeles, CA, pp. 52–60. Chicago: The University of Chicago Press. [5] Gildea, D., and Jurafsky D. 2002, "Automatic [17] Cao, X. H. 2006, "Tiếng Việt - Sơ thảo ngữ pháp labeling of semantic roles", Computational chức năng (Vietnamese - Introduction to Linguistics, 28(3): 245–88. Functional Grammar)", Hà Nội: NXB Giáo dục [6] Carreras, X., and Màrquez, L. 2004, "Introduction [18] Nguyễn, V. H. 2008, "Cơ sở ngữ nghĩa phân tích to the CoNLL-2004 shared task: semantic role cú pháp (Semantic Basis of Grammatical labeling", In Proceedings of the 8th Conference on Parsing)", Hà Nội: NXB Giáo dục. Computational Natural Language Learning, USA: [19] Diệp, Q. B. 1998, "Ngữ pháp tiếng Việt, Tập I, II Boston, MA, pp. 89–97. (Vietnamese Grammar, Volume I, II)", Hà Nội: [7] Carreras X., and Màrquez, L. 2005, "Introduction to NXB Giáo dục. the CoNLL-2005 shared task: semantic role [20] Nguyen, P. T., Vu, X. L., Nguyen, T. M. H., labeling", In Proceedings of the 9th Conference on Nguyen, V. H., and Le-Hong, P. 2009, "Building Computational Natural Language Learning, USA: a large syntactically-annotated corpus of Ann Arbor, MI, pp. 152–64. Vietnamese", In Proceedings of the 3rd [8] Xue, N., and Palmer, M. 2005, "Automatic Linguistic Annotation Workshop, ACL-IJCNLP, semantic role labeling for Chinese verbs", In Singapore: Suntec City, pp. 182–5. Proceedings of International Joint Conferences on [21] Koomen, P., Punyakanok, V., Roth, D., and Yih, Artificial Intelligence, Scotland: Edinburgh, pp. W. T. 2005, "Generalized inference with multiple 1160–5. semantic role labeling systems", In Proceedings of the 9th Conference on Computational Natural ________ Language Learning, USA: Ann Arbor, MI, 11 https://github.com/pth1993/vnSRL pp. 181–4. 12 https://github.com/phuonglh/vn.vitk 58 L.H. Phuong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 39-58 [22] Haghighi, A., Toutanova, K., and Manning, C. D. [31] Morin, F., and Bengio, Y. 2005, "Hierarchical 2005, "A joint model for semantic role labeling", probabilistic neural network language model", In In Proceedings of the 9th Conference on Proceedings of AISTATS, Barbados, pp. 246–52. Computational Natural Language Learning, [32] Collobert, R., and Weston, J. 2008, "A unified USA: Ann Arbor, MI, pp. 173–6. architecture for natural language processing: deep [23] Surdeanu, M., and Turmo, J. 2005, "Semantic role neural networks with multitask learning", In labeling using complete syntactic analysis", In Proceedings of ICML, USA: New York, NY, Proceedings of the 9th Conference on pp. 160–7. Computational Natural Language Learning, [33] Mnih, A., and Hinton, G. E. 2009, "A scalable USA: Ann Arbor, MI, pp. 221–4. hierarchical distributed language model", In [24] Màrquez, L., Comas, P., Gimenez, J., and Catala, Koller, D., Schuurmans, D., Bengio, Y., and N. 2005, "Semantic role labeling as sequential Bottou, L. (ed.) Advances in Neural Information tagging", In Proceedings of the 9th Conference Processing Systems 21, Curran Associates, Inc. on Computational Natural Language Learning, pp. 1081–8. USA: Ann Arbor, MI, pp. 193–6. [34] Mikolov, T., Chen, K., Corrado, G., and Dean, J. [25] Pradhan, S., Hacioglu, K., Ward, W., Martin, J. 2013, "Efficient estimation of word H., and Jurafsky, D. 2005, "Semantic role representations in vector space", In Proceedings chunking combining complementary syntactic of Workshop at ICLR, USA: Scottsdale, AZ, views", In Proceedings of the 9th Conference on pp. 1–12. Computational Natural Language Learning, [35] Pennington, J., Socher, R., and Manning, C. D. USA: Ann Arbor, MI, pp. 217–20. 2014, "GloVe: Global vectors for word [26] Le-Hong, P., Roussanaly, A., and Nguyen, T. M. representation", In Proceedings of the 2014 H. 2015, "A syntactic component for Vietnamese Conference on Empirical Methods in Natural language processing", Journal of Language Language Processing", Qatar: Doha, pp. 1532–43 Modelling, 3(1): 145–84. [36] Mikolov, T., Sutskever, I., Chen, K., Corrado, G., [27] Xue, N., and Palmer, M. 2004, "Calibrating and Dean, J. 2013, "Distributed representations of features for semantic role labeling", In words and phrases and their compositionality", In Proceedings of the 2004 Conference on Empirical Burges, C. J. C., Bottou, L., Welling, M., Methods in Natural Language Processing, Spain: Ghahramani, Z., and Weinberger, K. Q. (ed.), Barcelona, pp. 88–94. Advances in Neural Information Processing [28] Punyakanok, V., Roth, D., Yih, W. T., and Zimak, Systems 26, Curran Associates, Inc. pp. 3111–19. D. 2004, "Semantic role labeling via integer [37] Le-Hong, P., Nguyen, T. M. H, Roussanaly, A., linear programming inference" In Proceedings of and Ho, T. V. 2008, "A hybrid approach to word the 20th International Conference on segmentation of Vietnamese texts", In Carlos, M- Computational Linguistics, Switzerland: V., Friedrich, O., and Henning, F. (ed.), University of Geneva, pp. 1346–52. Language and Automata Theory and [29] Turian, J., Ratinov, L., and Bengio, Y. 2010, "Word Applications, Lecture Notes in Computer representations: A simple and general method for Science. Berlin: Springer Berlin Heidelberg, pp. semi-supervised learning", In Proceedings of ACL, 240–49. Sweden: Uppsala, pp. 384–94. [38] FitzGerald, N., Täckström, O., Ganchev, K., and Das, [30] Bengio, Y., Ducharme, R., Vincent, P., and Janvin, D. 2015, "Semantic role labeling with neural C. 2003, "A neural probabilistic language network factors", In Proceedings of the 2015 model", Journal of Machine Learning Research 3: Conference on Empirical Methods in Natural 1137–55. Language Processing, Portugal: Lisbon, pp. 960–70. H k
File đính kèm:
- vietnamese_semantic_role_labelling.pdf