Dependency - Based pre - ordering for English - Vietnamese statistical machine translation
Phrase-based statistical machine translation
[8] is the state-of-the-art of SMT because of its
power in modelling short reordering and local
context. However, with phrase-based SMT,
long distance reordering is still problematic.
The reordering problem (global reordering) is
one of the major problems, since different
languages have different word order
requirements. In recent years, many reordering
methods have been proposed to tackle the long
distance reordering problem. Many solutions
solving the reordering problem have been
proposed, such as syntax-based model [15],
lexicalized reordering [10]. Chiang [15] shows
significant improvements by keeping the
strengths of phrases, while incorporating syntax
into SMT. Some approaches were applied at the
word level [3]. They are useful for language
with rich morphology, for reducing data
sparseness. Other kinds of syntax reordering
methods require parser trees, such as the work
in [3]. The parsed tree is more powerful in
capturing the sentence structure. However, it is
expensive to create tree structure and build a
good quality parser. All the above approaches
require much decoding time, which is
expensive.
Trang 1
Trang 2
Trang 3
Trang 4
Trang 5
Trang 6
Trang 7
Trang 8
Trang 9
Trang 10
Tải về để xem bản đầy đủ
Tóm tắt nội dung tài liệu: Dependency - Based pre - ordering for English - Vietnamese statistical machine translation
reordering words in English sentence according to Vietnamese word English sentences according to Vietnamese order and types of rules including noun phrase, word order. adjectival and adverbial phrase, preposition • We implemented preprocessing step which is described in table 1. during both training and decoding time. 6.3. Using automatic rules • Using the SMT Moses decoder [7] for decoding. We present our experiments to translate We give some definitions for our from English to Vietnamese in a statistical experiments: machine translation system. In hence, the • Baseline: use the baseline phrase-based language pair chosen is English-Vietnamese. SMT system using the lexicalized reordering We used Stanford Parser [14] to parse source model in Moses toolkit. sentence (English sentences). • Manual Rules: the phrase-based SMT We used dependency parsing and rules systems applying manual rules [23]. extracted from training the features-rich • Auto Rules : the phrase-based SMT discriminative classifiers for reordering source- systems applying automatic rules [24]. side sentences. The rules are automatically • Auto Rules + Manual Rules: the phrase- extracted from English-Vietnamese parallel based SMT systems applying automatic rules, corpus and the dependency parser of English then applying manual rules. examples. Finally, they used these rules to reorder source sentences. We evaluated our Table 5. Our experimental systems on English- Vietnamese parallel corpus approach on English-Vietnamese machine translation tasks with systems in table 5 which Name Description shows that it can outperform the baseline Baseline Phrase-based system phrase-based SMT system. Manual Rules Phrase-based system with corpus Table 6. Size of phrase tables which preprocessed using manual rules Name Size of phrase-table Auto Rules Phrase-based system Baseline 1152216 with corpus which preprocessed using Manual Rules 1231365 automatic learning rules Auto Rules 1213401 Auto Rules + Phrase-based system Manual Rules with corpus which Auto Rules + 1253401 preprocessed using Manual Rules automatic learning rules and manual rules 24 T.H. Viet et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 14-27 Table 7. Translation performance Table 9. An example of a translation produced by for the English-Vietnamese task our system for an input sentence sampled from English-Vietnamese corpus System BLEU (%) Input Translation Translation Translation Baseline 36.89 sentence: (Baseline): (Auto): (human): Manual Rules 37.71 The coat was far too big Auto Rules 37.12 - it completely Auto Rules + Manual Rules 37.85 enveloped him. 6.4. BLEU score Chiếc áo khoác là quá The result of our experiments in table 6 lớn showed size of phrase tables built from - nó hoàn toàn phủ anh translation model base on our method. In this ta. method, we can find out various phrases in the translation model. So that, they enable us to have more options for decoder to generate the Chiếc áo khoác là quá best translation. lớn Table 7 describes the BLEU score of our - nó phủ hoàn toàn anh experiments. As we can see, by applying ta. preprocessing in both training and decoding, the BLEU score of "Auto Rules" system is lower by 0.49 point than "Manual Rules" system. This Chiếc áo khoác quá lớn result is due to the fact that manual rules have - nó hoàn toàn phủ anh ta. better quality than automatic rules. However, "Auto Rules + Manual Rules" system is the best system because applying the combination rules Manh Cuong is a young can cover much linguistic phenomena. football player The above result proved that the effect of with potential great. applying transformation rule base on the dependency parse tree. Table 8. Statistical number of family on Manh Cuong là một cầu corpus English-Vietnamese thủ bóng đá với nhiều tiềm Number Number Description năng. children of head 79142 Family has 1 children Manh Cuong là một cầu 40822 Family has 2 children thủ 26008 Family has 3 children bóng đá trẻ có tiềm 15990 Family has 4 children năng lớn. 7442 Family has 5 children 2728 Family has 6 children 942 Family has 7 children Mạnh Cường là cầu thủ 307 Family has 8 children bóng đá trẻ rất nhiều 83 Family has 9 triển vọng. children T.H. Viet et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 14-27 25 7. Analysis and discussion it is shown that applying classifier method to solve reordering problems automatically. We have found that in our experiments According to typical differences of word work is sufficiently correlated to the translation order between English and Vietnamese, we quality done manually. Besides, we also have have created a set of automatic rules for found some errors cause such as parse tree reordering words in English sentence according source sentence quality, word alignment quality to Vietnamese word order and types of rules and quality of corpus. All the above errors can including noun phrase, adjectival and adverbial effect automatic reordering rules. Table 9 phrase, as well as preposition phrase. Table 8 showed the translation output examples are gives statistical families which have larger or better than baseline system produced by our equal 4 children in our corpus. The number of system for the input sentences from English- children in each family has limited 4 children in Vietnamese test set. Go here for more examples our approach. So in target language of translations for input sentences sampled (Vietnamese), the number of children in each randomly from our corpus. Some phrases in family is the same. English source sentence were reordered The manual rules have good quality corresponding to Vietnamese target sentence [27, 18], the phrase-based SMT systems order. We focus mainly on some typical applying manual rules is better than the phrase- relations as noun phrase, adjectival and based SMT systems applying automatic rules. adverbial phrase, preposition and created We believe that the quality of the phrase-based manually written reordering rule set for SMT systems applying automatic rules will be English-Vietnamese language pair. Our study better when we have a better corpus. employed dependency syntactic and transformation rules to reorder the source sentence and applied to English to Vietnamese 8. Conclusion translation systems. For example, with noun phrase, there In this paper, we present a preprocessing always exists a head noun and the components approach based on the dependency parser. The before and after it. These auxiliary components proposed approach is applying for English - will move to new positions according to Vietnamese translation system. The Vietnamese translational order. These rules can experimental results show that our approach popular source linguistic phenomena equivalent achieved statistical improvements in BLEU to target language ones as follows: scores over a state-of-the-art phrase-based • The phrase-based systems applying rules baseline system. By applying manual rules and with category JJ or JJS automatic rules, the quality of English- • The phrase-based systems applying rules Vietnamese translation system is improving. In with category NN or NNS our study, our rules cover some linguistic • The phrase-based systems applying rules reordering phenomena. These reordering rules with category IN or TO benefit English-Vietnamese languages pair. Based on these phenomena, translation We will focus on word order problems quality has significantly improved. We carried much more with linguistic reordering out error analysis sentences and compared to phenomena on English-Vietnamese to learn the golden reordering. Our analysis has also the better the dependency-based reordering rules benefits of automatic reordering rules on (manual rules and automatic rules). This is translation quality. In combination with necessary in improving SMT systems and that machine learning method in related work [21], might lead to its a wider adoption. 26 T.H. Viet et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 14-27 Acknowledgments Innovation, and Vision for the Future, RIVF 2013, Hanoi, Vietnam, November 10-13, 2013, This work described in this paper has been 2013, pp. 147-151. partially funded by Hanoi National University [11] F. J. Och, H. Ney, A systematic comparison of (QG.15.23 project). various statistical alignment models, Computational Linguistics 29 (1) (2003) 19-51. [12] B. M. de Marneffe, C. D.Manning, Generating typed dependency parses from phrase structure References parses, in: In the Proceeding of the 5th International Conference on Language [1] S. R. T. W. Papineni, Kishore, W. Zhu, Bleu: A Resources and Evaluation, 2006. method for automatic evaluation of machine [13] A. Stolcke, Srilm - an extensible language translation., in: ACL, 2002. modeling toolkit, in: Proceedings of [2] E. S. Y. Z. Jingsheng Cai, Masao Utiyama, International Conference on Spoken Language Dependency-based pre-ordering for chinese- Processing, Vol. 29, 2002, pp. 901-904. english machine translation, in: Proceedings of [14] N. Bach, Q. Gao, S. Vogel, Source-side the 52nd Annual Meeting of the Association for dependency tree reordering models with subtree Computational Linguistics, 2014. movements and constraints, in: Proceedings of [3] M. Collins, P. Koehn, I. Kucerová, Clause the Twelfth Machine Translation Summit restructuring for statistical machine translation, (MTSummit-XII), International Association for in: Proc. ACL 2005, Ann Arbor, USA, 2005, Machine Translation, Ottawa, Canada, 2009. pp. 531-540. [15] D. Cer, M.-C. de Marneffe, D. Jurafsky, C. D. [4] C. Ding, K. Sakanushi, H. Touji, M. Yamamoto, Manning, Parsing to stanford dependencies: Inter-, intra-, and extra-chunk pre-ordering for Trade-offs between speed and accuracy, in: 7th statistical japanese-to-english machine International Conference on Language translation , ACM Trans. Asian Low-Resour. Resources and Evaluation (LREC 2010), 2010. Lang. Inf. Process. 15 (3) (2016) 20:1-20:28. [16] D. Chiang, A hierarchical phrase-based model doi:10.1145/2818381 . for statistical machine translation, in: [5] URL Proceedings of the 43rd Annual Meeting of the [6] D. Genzel, Automatically learning source-side Association for Computational Linguistics reordering rules for large scale machine (ACL’05), Ann Arbor, Michigan, 2005, translation, in: Proceedings of the 23rd pp. 263-270. International Conference on Computational [17] J. Daiber, M. Stanojevic, W. Aziz, K. Linguistics, COLING ’10, 2010, pp. 376-384. Sima’an, Examining the relationship [7] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. between preordering and word order freedom in Reutemann, I. H. Witten, The weka data mining machine translation, in: Proceedings of the First software: An update, SIGKDD Explor. Newsl. Conference on Machine Translation (WMT16), 11 (1) (2009) 10-18. Berlin, Germany, August. Association for Computational Linguistics, 2016. [8] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. [18] I. Goto, M. Utiyama, E. Sumita, S. Kurohashi, Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, Preordering using a target-language parser via E. Herbst, Moses: Open source toolkit for statistical cross-language syntactic projection for statistical machine translation, in: Proceedings of ACL, machine translation, ACM Transactions on Demonstration Session, 2007. Asian and Low-Resource Language Information Processing 14 (3) (2015) 13. [9] P. Koehn, F. J. Och, D. Marcu, Statistical phrase-based translation, in: Proceedings of [19] N. Habash, Syntactic preprocessing for statistical HLT-NAACL 2003, Edmonton, Canada, 2003, machine translation, Proceedings of the 11th MT pp. 127-133. Summit, 2007. [10] T. L. Nguyen, M. L. Ha, V. H. Nguyen, T. M. H. [20] C. Hadiwinoto, Y. Liu, H. T. Ng, To swap or not Nguyen, P. Le-Hong, Building a treebank for to swap? exploiting dependency word pairs for vietnamese dependency parsing, in: 2013 IEEE reordering in statistical machine translation, in: RIVF International Conference on Computing Thirtieth AAAI Conference on Artificial and Communication Technologies, Research, Intelligence, 2016. T.H. Viet et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 14-27 27 [21] C. Hadiwinoto, H. T. Ng, A dependency- [26] L. Wang, Support Vector Machines: theory and based neural reordering model for statistical applications, Vol. 177, Springer Science & machine translation, arXiv preprint Business Media, 2005. arXiv:1702.04510, 2017. [27] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. [22] U. Lerner, S. Petrov, Source-side classifier Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. preordering for machine translation., in: Gao, K. Macherey, et al., Google€™s neural EMNLP, 2013, pp. 513-523. machine translation system: Bridging the gap [23] H. V. Huy, T.-L. N. Phuong-Thai Nguyen, M. between human and machine translation, arXiv Nguyen, Boostrapping phrase – based preprint arXiv:1609.08144, 2016. statistical machine translation via wsd [28] P. Xu, J. Kang, M. Ringgaard, F. Och, Using a integration, in: In Proceeding of the Sixth dependency parser to improve smt for subject- International Joint Conference on Natural object-verb languages, in: Proceedings of Language Processing (IJCNLP 2013), 2013, Human Language Technologies: The 2009 pp. 1042-1046. Annual Conference of the North American [24] V. H. Tran, V. V. Nguyen, M. L. Nguyen, Chapter of the Association for Computational Improving english-vietnamese statistical Linguistics, Association for Computational machine translation using preprocessing Linguistics, Boulder, Colorado, 2009, dependency syntactic, In Proceedings of the pp. 245-253. 2015 Conference of the Pacific Association for [29] N. Yang, M. Li, D. Zhang, N. Yu, A ranking- Computational Linguistics (Pacling 2015) based approach to word reordering for statistical pp. 115-121. machine translation, in: Proceedings of the 50th [25] V. H. Tran, H. T. Vu, V. V. Nguyen, M. L. Annual Meeting of the Association for Nguyen, A classifier-based preordering Computational Linguistics: Long Papers- approach for english-vietnamese statistical Volume 1, Association for Computational machine translation, 17th International Linguistics, 2012, pp. 912-920. Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2016). G h
File đính kèm:
- dependency_based_pre_ordering_for_english_vietnamese_statist.pdf