Vietnamese semantic role labelling

In this paper, we study semantic role labelling (SRL), a subtask of semantic parsing of natural language
sentences and its application for the Vietnamese language. We present our effort in building Vietnamese
PropBank, the first Vietnamese SRL corpus and a software system for labelling semantic roles of Vietnamese
texts. In particular, we present a novel constituent extraction algorithm in the argument candidate identification
step which is more suitable and more accurate than the common node-mapping method. In the machine learning
part, our system integrates distributed word features produced by two recent unsupervised learning models in
two learned statistical classifiers and makes use of integer linear programming inference procedure to improve
the accuracy. The system is evaluated in a series of experiments and achieves a good result, an F1 score of
74.77%. Our system, including corpus and software, is available as an open source project for free research and
we believe that it is a good baseline for the development of future Vietnamese SRL systems.
Download
Trang 1
Trang 2
Trang 3
Trang 4
Trang 5
Trang 6
Trang 7
Trang 8
Trang 9
Trang 10
Tải về để xem bản đầy đủ
20 trang duykhanh 14020
Download
Bạn đang xem 10 trang mẫu của tài liệu "Vietnamese semantic role labelling", để tải tài liệu gốc về máy hãy click vào nút Download ở trên
Tóm tắt nội dung tài liệu: Vietnamese semantic role labelling

cience: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 39-58 55
 The main idea of GloVe is to use 
word-word occurrence counts to estimate the 
co-occurrence probabilities rather than the 
probabilities by themselves. Let Pij denote the 
probability that word j appear in the context of 
 d d
word i ; wi R and wj R denote the 
word vectors of word i and word j 
respectively. It is shown that 
 
 wi wj = log(Pij ) = log(Cij ) log(Ci ), (12) 
 where Cij is the number of times word j 
occurs in the context of word i . 
 It turns out that GloVe is a global 
log-bilinear regression model. Finding word 
vectors is equivalent to solving a weighted 
least-squares regression model with the cost 
function: 
 n 
  2 Figure 11. Some Vietnamese words produced by the 
 J =  f (Cij )(wi wj bi bj log(Cij )) , (13) 
 i, j=1 GloVe model, projected onto two dimensions. 
 where n is the size of the vocabulary, b 4.3.3. Text corpus 
 i To create distributed word representations, 
and b j are additional bias terms and f (Cij ) is we use a dataset consisting of 7.3GB of text 
a weighting function. A class of weighting from 2 million articles collected through a 
functions which are found to work well can be Vietnamese news portal10. The text is first 
parameterized as normalized to lower case and all special 
 characters are removed except these common 
 x 
 ifx < x symbols: the comma, the semicolon, the colon, 
 f (x) = max (14) 
 xmax the full stop and the percentage sign. All 
 1 otherwise numeral sequences are replaced with the special 
 token , so that correlations between 
 certain words and numbers are correctly 
 The training code was obtained from the 
 recognized by the neural network or the log-
tool GloVe9 and we used a word appearance 
 bilinear regression model. 
threshold of 2,000. Figure 11 shows the scatter 
plot of the same words in Figure 10, but this Each word in the Vietnamese language may 
 consist of more than one syllables with spaces 
time their word vectors are produced by the 
 in between, which could be regarded as 
GloVe model. 
 multiple words by the unsupervised models. 
 Hence it is necessary to replace the spaces 
 within each word with underscores to create full 
 word tokens. The tokenization process follows 
 the method described in [37]. 
 After removal of special characters and 
 tokenization, the articles add up to 969 million 
________ ________ 
9  10  
56 L.H. Phuong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 39-58 
word tokens, spanning a vocabulary of 1.5 system. In other words, their use can help 
million unique tokens. We train the generalize the system to unseen words. 
unsupervised models with the full vocabulary to 
obtain the representation vectors, and then 5. Conclusion 
prune the collection of word vectors to the 
 65,000 most frequent words, excluding special We have presented our work on developing 
 a semantic role labelling system for the 
symbols and the token 
 Vietnamese language. The system comprises 
representing numeral sequences. 
 two main component, a corpus and a software. 
 4.3.4. SRL with distributed word 
 Our system achieves a good accuracy of about 
representations 
 We train the two word embedding models 74.8% of F1 score. 
on the same text corpus presented in the We have argued that one cannot assume a 
previous subsections to produce distributed good applicability of existing methods and tools 
word representations, where each word is developed for English and other occidental 
represented by a real-valued vector of 50 languages and that they may not offer a cross-
dimensions. language validity. For an isolating language 
 In the last experiment, we replace predicate such as Vietnamese, techniques developed for 
or head word features in our SRL system by inflectional languages cannot be applied “as is”. 
their corresponding word vectors. For In particular, we have developed an algorithm 
predicates which are composed of multiple for extracting argument candidates which has a 
words, we first tokenize them into individual better accuracy than the 1-1 node mapping 
words and then average their vectors to get algorithm. We have proposed some novel 
vector representations. Table 14 and Table 15 features which are proved to be useful for 
shows performances of the Skip-gram and Vietnamese semantic role labelling, notably and 
GloVe models for predicate feature and for function tags and distributed word 
head word feature, respectively. representations. We have employed integer 
 linear programming, a recent inference 
 Table 14. The impact 
 technique capable of incorporate a wide variety 
 of word embeddings of predicate 
 of linguistic constraints to improve the 
 performance of the system. We have also 
 Precision Recall 
 F1 demonstrated the efficacy of distributed word 
 A 78.29% 71.48% 74.73% representations produced by two unsupervised 
 B 78.37% 71.49% 74.77% learning models in dealing with unknown words. 
 C 78.29% 71.38% 74.67% In the future, we plan to improve further our 
 A: Predicate word 
 system, in the one hand, by enlarging our 
 B: Skip-gram vector 
 C: GloVe vector corpus so as to provide more data for the 
 system. On the other hand, we would like to 
 Table 15. The impact investigate different models used in SRL, for 
 of word embeddings of head word example joint models [38], where arguments 
 Precision Recall F and semantic roles are jointly embedded in a 
 1 shared vector space for a given predicate. In 
A 78.29% 71.48% 74.73% 
B 77.53% 70.76% 73.99% addition, we would like to explore the 
C 78.12% 71.58% 74.71% possibility of integrating dynamic constraints in 
A: Head word the integer linear programming procedure. We 
B: Skip-gram vector expect the overall performance of our SRL 
C: GloVe vector system to improve. 
 Our system, including software and corpus, 
 We see that both of the two types of word is available as an open source project for free 
embeddings do not decrease the accuracy of the research purpose and we believe that it is a 
 L.H. Phuong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 39-58 57
good baseline for the development and [9] Tagami, H., Hizuka, S., and Saito, H. 2009, 
comparison of future Vietnamese SRL "Automatic semantic role labeling based on 
systems11. We plan to integrate this tool to Vitk, Japanese FrameNet–A Progress Report", In 
 Proceedings of Conference of the Pacific 
an open-source toolkit for processing Association for Computational Linguistics, Japan: 
Vietnamese text, which contains fundamental Hokkaido University, Sapporo, pp. 181–6. 
processing tools and are readily scalable for [10]Nguyen, T.-L., Ha, M.-L., Nguyen, V.-H., Nguyen, 
processing very large text data12. T.-M.-H., Le-Hong, P. and Phan, T.-H. 2014, 
 "Building a semantic role annotated corpus for 
References Vietnamese", in Proceedings of the 17th National 
 Symposium on Information and Communication 
[1] Shen, D., and Lapata, M. 2007, "Using semantic Technology, Daklak, Vietnam, pp. 409–414. 
 roles to improve question answering", In [11] Pham, T. H., Pham, X. K., and Le-Hong, P. 2015, 
 Proceedings of Conference on Empirical Methods "Building a semantic role labelling system for 
 on Natural Language Processing and Vietnamese", In Proceedings of the 10th 
 Computational Natural Language Learning, Czech International Conference on Digital Information 
 Republic: Prague, pp. 12–21. Management", South Korea: Jeju Island, pp. 77–84 
[2] Lo, C. K., and Wu, D. 2010, "Evaluating machine [12] Baker, C. F., Fillmore, C. J., and Cronin, B. 2003, 
 translation utility via semantic role labels", In "The structure of the FrameNet database", 
 Proceedings of The International Conference on International Journal of Lexicography, 16(3): 
 Language Resources and Evaluation, Malta: 281–96. 
 Valletta, pp. 2873–7. [13] Boas, H. C. 2005, "From theory to practice: Frame 
[3] Aksoy, C., Bugdayci, A., Gur, T., Uysal, I., and semantics and the design of FrameNet", 
 Can, F. 2009, "Semantic argument frequency-based Semantisches Wissen im Lexikon: 129–60. 
 multi-document summarization", In Proceedings of [14] Palmer, M., Kingsbury, P., and Gildea, D. 2005. 
 the 24th of the International Symposium on "The proposition bank: An annotated corpus of 
 Computer and Information Sciences, Guzelyurt, semantic roles", Computational Linguistics, 
 Turkey, pp. 460–4. 31(1): 71–106. 
[4] Christensen, J., Soderland, S., and Etzioni, O. 2010, [15 Schuler, K. K. 2006, "VerbNet: A broad-coverage, 
 "Semantic role labeling for open information comprehensive verb lexicon", PhD Thesis, 
 extraction", In Proceedings of the Conference of the University of Pennsylvania. 
 North American Chapter of the Association for [16] Levin, B. 1993, "English Verb Classes and 
 Computational Linguistics – Human Language Alternation: A Preliminary Investigation", 
 Technologies, USA: Los Angeles, CA, pp. 52–60. Chicago: The University of Chicago Press. 
[5] Gildea, D., and Jurafsky D. 2002, "Automatic [17] Cao, X. H. 2006, "Tiếng Việt - Sơ thảo ngữ pháp 
 labeling of semantic roles", Computational chức năng (Vietnamese - Introduction to 
 Linguistics, 28(3): 245–88. Functional Grammar)", Hà Nội: NXB Giáo dục 
[6] Carreras, X., and Màrquez, L. 2004, "Introduction [18] Nguyễn, V. H. 2008, "Cơ sở ngữ nghĩa phân tích 
 to the CoNLL-2004 shared task: semantic role cú pháp (Semantic Basis of Grammatical 
 labeling", In Proceedings of the 8th Conference on Parsing)", Hà Nội: NXB Giáo dục. 
 Computational Natural Language Learning, USA: 
 [19] Diệp, Q. B. 1998, "Ngữ pháp tiếng Việt, Tập I, II 
 Boston, MA, pp. 89–97. 
 (Vietnamese Grammar, Volume I, II)", Hà Nội: 
[7] Carreras X., and Màrquez, L. 2005, "Introduction to NXB Giáo dục. 
 the CoNLL-2005 shared task: semantic role 
 [20] Nguyen, P. T., Vu, X. L., Nguyen, T. M. H., 
 labeling", In Proceedings of the 9th Conference on 
 Nguyen, V. H., and Le-Hong, P. 2009, "Building 
 Computational Natural Language Learning, USA: 
 a large syntactically-annotated corpus of 
 Ann Arbor, MI, pp. 152–64. 
 Vietnamese", In Proceedings of the 3rd 
[8] Xue, N., and Palmer, M. 2005, "Automatic Linguistic Annotation Workshop, ACL-IJCNLP, 
 semantic role labeling for Chinese verbs", In Singapore: Suntec City, pp. 182–5. 
 Proceedings of International Joint Conferences on 
 [21] Koomen, P., Punyakanok, V., Roth, D., and Yih, 
 Artificial Intelligence, Scotland: Edinburgh, pp. 
 W. T. 2005, "Generalized inference with multiple 
 1160–5. 
 semantic role labeling systems", In Proceedings 
 of the 9th Conference on Computational Natural 
________ Language Learning, USA: Ann Arbor, MI, 
11 https://github.com/pth1993/vnSRL pp. 181–4. 
12 https://github.com/phuonglh/vn.vitk 
58 L.H. Phuong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 39-58 
[22] Haghighi, A., Toutanova, K., and Manning, C. D. [31] Morin, F., and Bengio, Y. 2005, "Hierarchical 
 2005, "A joint model for semantic role labeling", probabilistic neural network language model", In 
 In Proceedings of the 9th Conference on Proceedings of AISTATS, Barbados, pp. 246–52. 
 Computational Natural Language Learning, [32] Collobert, R., and Weston, J. 2008, "A unified 
 USA: Ann Arbor, MI, pp. 173–6. architecture for natural language processing: deep 
[23] Surdeanu, M., and Turmo, J. 2005, "Semantic role neural networks with multitask learning", In 
 labeling using complete syntactic analysis", In Proceedings of ICML, USA: New York, NY, 
 Proceedings of the 9th Conference on pp. 160–7. 
 Computational Natural Language Learning, [33] Mnih, A., and Hinton, G. E. 2009, "A scalable 
 USA: Ann Arbor, MI, pp. 221–4. hierarchical distributed language model", In 
[24] Màrquez, L., Comas, P., Gimenez, J., and Catala, Koller, D., Schuurmans, D., Bengio, Y., and 
 N. 2005, "Semantic role labeling as sequential Bottou, L. (ed.) Advances in Neural Information 
 tagging", In Proceedings of the 9th Conference Processing Systems 21, Curran Associates, Inc. 
 on Computational Natural Language Learning, pp. 1081–8. 
 USA: Ann Arbor, MI, pp. 193–6. [34] Mikolov, T., Chen, K., Corrado, G., and Dean, J. 
[25] Pradhan, S., Hacioglu, K., Ward, W., Martin, J. 2013, "Efficient estimation of word 
 H., and Jurafsky, D. 2005, "Semantic role representations in vector space", In Proceedings 
 chunking combining complementary syntactic of Workshop at ICLR, USA: Scottsdale, AZ, 
 views", In Proceedings of the 9th Conference on pp. 1–12. 
 Computational Natural Language Learning, [35] Pennington, J., Socher, R., and Manning, C. D. 
 USA: Ann Arbor, MI, pp. 217–20. 2014, "GloVe: Global vectors for word 
[26] Le-Hong, P., Roussanaly, A., and Nguyen, T. M. representation", In Proceedings of the 2014 
 H. 2015, "A syntactic component for Vietnamese Conference on Empirical Methods in Natural 
 language processing", Journal of Language Language Processing", Qatar: Doha, pp. 1532–43 
 Modelling, 3(1): 145–84. [36] Mikolov, T., Sutskever, I., Chen, K., Corrado, G., 
[27] Xue, N., and Palmer, M. 2004, "Calibrating and Dean, J. 2013, "Distributed representations of 
 features for semantic role labeling", In words and phrases and their compositionality", In 
 Proceedings of the 2004 Conference on Empirical Burges, C. J. C., Bottou, L., Welling, M., 
 Methods in Natural Language Processing, Spain: Ghahramani, Z., and Weinberger, K. Q. (ed.), 
 Barcelona, pp. 88–94. Advances in Neural Information Processing 
[28] Punyakanok, V., Roth, D., Yih, W. T., and Zimak, Systems 26, Curran Associates, Inc. pp. 3111–19. 
 D. 2004, "Semantic role labeling via integer [37] Le-Hong, P., Nguyen, T. M. H, Roussanaly, A., 
 linear programming inference" In Proceedings of and Ho, T. V. 2008, "A hybrid approach to word 
 the 20th International Conference on segmentation of Vietnamese texts", In Carlos, M-
 Computational Linguistics, Switzerland: V., Friedrich, O., and Henning, F. (ed.), 
 University of Geneva, pp. 1346–52. Language and Automata Theory and 
[29] Turian, J., Ratinov, L., and Bengio, Y. 2010, "Word Applications, Lecture Notes in Computer 
 representations: A simple and general method for Science. Berlin: Springer Berlin Heidelberg, pp. 
 semi-supervised learning", In Proceedings of ACL, 240–49. 
 Sweden: Uppsala, pp. 384–94. [38] FitzGerald, N., Täckström, O., Ganchev, K., and Das, 
[30] Bengio, Y., Ducharme, R., Vincent, P., and Janvin, D. 2015, "Semantic role labeling with neural 
 C. 2003, "A neural probabilistic language network factors", In Proceedings of the 2015 
 model", Journal of Machine Learning Research 3: Conference on Empirical Methods in Natural 
 1137–55. Language Processing, Portugal: Lisbon, pp. 960–70. 
 H 
 k
File đính kèm:
vietnamese_semantic_role_labelling.pdf