Dependency - Based pre - ordering for English - Vietnamese statistical machine translation

Phrase-based statistical machine translation

[8] is the state-of-the-art of SMT because of its

power in modelling short reordering and local

context. However, with phrase-based SMT,

long distance reordering is still problematic.

The reordering problem (global reordering) is

one of the major problems, since different

languages have different word order

requirements. In recent years, many reordering

methods have been proposed to tackle the long

distance reordering problem. Many solutions

solving the reordering problem have been

proposed, such as syntax-based model [15],

lexicalized reordering [10]. Chiang [15] shows

significant improvements by keeping the

strengths of phrases, while incorporating syntax

into SMT. Some approaches were applied at the

word level [3]. They are useful for language

with rich morphology, for reducing data

sparseness. Other kinds of syntax reordering

methods require parser trees, such as the work

in [3]. The parsed tree is more powerful in

capturing the sentence structure. However, it is

expensive to create tree structure and build a

good quality parser. All the above approaches

require much decoding time, which is

expensive.

Dependency - Based pre - ordering for English - Vietnamese statistical machine translation trang 1

Trang 1

Dependency - Based pre - ordering for English - Vietnamese statistical machine translation trang 2

Trang 2

Dependency - Based pre - ordering for English - Vietnamese statistical machine translation trang 3

Trang 3

Dependency - Based pre - ordering for English - Vietnamese statistical machine translation trang 4

Trang 4

Dependency - Based pre - ordering for English - Vietnamese statistical machine translation trang 5

Trang 5

Dependency - Based pre - ordering for English - Vietnamese statistical machine translation trang 6

Trang 6

Dependency - Based pre - ordering for English - Vietnamese statistical machine translation trang 7

Trang 7

Dependency - Based pre - ordering for English - Vietnamese statistical machine translation trang 8

Trang 8

Dependency - Based pre - ordering for English - Vietnamese statistical machine translation trang 9

Trang 9

Dependency - Based pre - ordering for English - Vietnamese statistical machine translation trang 10

Trang 10

Tải về để xem bản đầy đủ

pdf 14 trang duykhanh 3340
Bạn đang xem 10 trang mẫu của tài liệu "Dependency - Based pre - ordering for English - Vietnamese statistical machine translation", để tải tài liệu gốc về máy hãy click vào nút Download ở trên

Tóm tắt nội dung tài liệu: Dependency - Based pre - ordering for English - Vietnamese statistical machine translation

Dependency - Based pre - ordering for English - Vietnamese statistical machine translation
 reordering words in English sentence according to Vietnamese word 
English sentences according to Vietnamese order and types of rules including noun phrase, 
word order. adjectival and adverbial phrase, preposition 
 • We implemented preprocessing step which is described in table 1. 
during both training and decoding time. 6.3. Using automatic rules 
 • Using the SMT Moses decoder [7] for 
decoding. We present our experiments to translate 
 We give some definitions for our from English to Vietnamese in a statistical 
experiments: machine translation system. In hence, the 
 • Baseline: use the baseline phrase-based language pair chosen is English-Vietnamese. 
SMT system using the lexicalized reordering We used Stanford Parser [14] to parse source 
model in Moses toolkit. sentence (English sentences). 
 • Manual Rules: the phrase-based SMT We used dependency parsing and rules 
systems applying manual rules [23]. extracted from training the features-rich 
 • Auto Rules : the phrase-based SMT discriminative classifiers for reordering source-
systems applying automatic rules [24]. side sentences. The rules are automatically 
 • Auto Rules + Manual Rules: the phrase- extracted from English-Vietnamese parallel 
based SMT systems applying automatic rules, corpus and the dependency parser of English 
then applying manual rules. examples. Finally, they used these rules to 
 reorder source sentences. We evaluated our 
 Table 5. Our experimental systems on English-
 Vietnamese parallel corpus approach on English-Vietnamese machine 
 translation tasks with systems in table 5 which 
 Name Description shows that it can outperform the baseline 
 Baseline Phrase-based system phrase-based SMT system. 
 Manual Rules Phrase-based system 
 with corpus Table 6. Size of phrase tables 
 which preprocessed 
 using manual rules Name Size of phrase-table 
 Auto Rules Phrase-based system Baseline 1152216 
 with corpus which 
 preprocessed using Manual Rules 1231365 
 automatic learning rules 
 Auto Rules 1213401 
 Auto Rules + Phrase-based system 
 Manual Rules with corpus which Auto Rules + 1253401 
 preprocessed using Manual Rules 
 automatic learning rules 
 and manual rules 
24 T.H. Viet et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 14-27 
 Table 7. Translation performance Table 9. An example of a translation produced by 
 for the English-Vietnamese task our system for an input sentence sampled from 
 English-Vietnamese corpus 
 System BLEU (%) 
 Input Translation Translation Translation 
 Baseline 36.89 sentence: (Baseline): (Auto): (human): 
 Manual Rules 37.71 
 The coat was far too big 
 Auto Rules 37.12 - it completely 
 Auto Rules + Manual Rules 37.85 enveloped him. 
6.4. BLEU score 
 Chiếc áo khoác là quá 
 The result of our experiments in table 6 lớn 
showed size of phrase tables built from - nó hoàn toàn phủ anh 
translation model base on our method. In this ta. 
method, we can find out various phrases in the 
translation model. So that, they enable us to 
have more options for decoder to generate the Chiếc áo khoác là quá 
best translation. lớn 
 Table 7 describes the BLEU score of our - nó phủ hoàn toàn anh 
experiments. As we can see, by applying ta. 
preprocessing in both training and decoding, the 
BLEU score of "Auto Rules" system is lower 
by 0.49 point than "Manual Rules" system. This Chiếc áo khoác quá lớn 
result is due to the fact that manual rules have - nó hoàn toàn phủ anh 
 ta. 
better quality than automatic rules. However, 
"Auto Rules + Manual Rules" system is the best 
system because applying the combination rules Manh Cuong is a young 
can cover much linguistic phenomena. football player 
 The above result proved that the effect of with potential great. 
applying transformation rule base on the 
dependency parse tree. 
 Table 8. Statistical number of family on Manh Cuong là một cầu 
 corpus English-Vietnamese thủ 
 bóng đá với nhiều tiềm 
Number Number Description năng. 
children of head 
 79142 Family has 1 children 
 Manh Cuong là một cầu 
 40822 Family has 2 children 
 thủ 
 26008 Family has 3 children bóng đá trẻ có tiềm 
 15990 Family has 4 children năng lớn. 
 7442 Family has 5 children 
 2728 Family has 6 children 
 942 Family has 7 children Mạnh Cường là cầu thủ 
 307 Family has 8 children 
 bóng đá trẻ rất nhiều 
 83 Family has 9 triển vọng. 
 children 
 T.H. Viet et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 14-27 25
7. Analysis and discussion it is shown that applying classifier method to 
 solve reordering problems automatically. 
 We have found that in our experiments According to typical differences of word 
work is sufficiently correlated to the translation order between English and Vietnamese, we 
quality done manually. Besides, we also have have created a set of automatic rules for 
found some errors cause such as parse tree reordering words in English sentence according 
source sentence quality, word alignment quality to Vietnamese word order and types of rules 
and quality of corpus. All the above errors can including noun phrase, adjectival and adverbial 
effect automatic reordering rules. Table 9 phrase, as well as preposition phrase. Table 8 
showed the translation output examples are gives statistical families which have larger or 
better than baseline system produced by our equal 4 children in our corpus. The number of 
system for the input sentences from English- children in each family has limited 4 children in 
Vietnamese test set. Go here for more examples our approach. So in target language 
of translations for input sentences sampled (Vietnamese), the number of children in each 
randomly from our corpus. Some phrases in family is the same. 
English source sentence were reordered The manual rules have good quality 
corresponding to Vietnamese target sentence [27, 18], the phrase-based SMT systems 
order. We focus mainly on some typical applying manual rules is better than the phrase-
relations as noun phrase, adjectival and based SMT systems applying automatic rules. 
adverbial phrase, preposition and created We believe that the quality of the phrase-based 
manually written reordering rule set for SMT systems applying automatic rules will be 
English-Vietnamese language pair. Our study better when we have a better corpus. 
employed dependency syntactic and 
transformation rules to reorder the source 
sentence and applied to English to Vietnamese 8. Conclusion 
translation systems. 
 For example, with noun phrase, there In this paper, we present a preprocessing 
always exists a head noun and the components approach based on the dependency parser. The 
before and after it. These auxiliary components proposed approach is applying for English - 
will move to new positions according to Vietnamese translation system. The 
Vietnamese translational order. These rules can experimental results show that our approach 
popular source linguistic phenomena equivalent achieved statistical improvements in BLEU 
to target language ones as follows: scores over a state-of-the-art phrase-based 
 • The phrase-based systems applying rules baseline system. By applying manual rules and 
with category JJ or JJS automatic rules, the quality of English-
 • The phrase-based systems applying rules Vietnamese translation system is improving. In 
with category NN or NNS our study, our rules cover some linguistic 
 • The phrase-based systems applying rules reordering phenomena. These reordering rules 
with category IN or TO benefit English-Vietnamese languages pair. 
 Based on these phenomena, translation We will focus on word order problems 
quality has significantly improved. We carried much more with linguistic reordering 
out error analysis sentences and compared to phenomena on English-Vietnamese to learn 
the golden reordering. Our analysis has also the better the dependency-based reordering rules 
benefits of automatic reordering rules on (manual rules and automatic rules). This is 
translation quality. In combination with necessary in improving SMT systems and that 
machine learning method in related work [21], might lead to its a wider adoption. 
26 T.H. Viet et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 14-27 
Acknowledgments Innovation, and Vision for the Future, RIVF 
 2013, Hanoi, Vietnam, November 10-13, 2013, 
 This work described in this paper has been 2013, pp. 147-151. 
partially funded by Hanoi National University [11] F. J. Och, H. Ney, A systematic comparison of 
(QG.15.23 project). various statistical alignment models, 
 Computational Linguistics 29 (1) (2003) 19-51. 
 [12] B. M. de Marneffe, C. D.Manning, Generating 
 typed dependency parses from phrase structure 
References parses, in: In the Proceeding of the 5th 
 International Conference on Language 
 [1] S. R. T. W. Papineni, Kishore, W. Zhu, Bleu: A Resources and Evaluation, 2006. 
 method for automatic evaluation of machine [13] A. Stolcke, Srilm - an extensible language 
 translation., in: ACL, 2002. modeling toolkit, in: Proceedings of 
 [2] E. S. Y. Z. Jingsheng Cai, Masao Utiyama, International Conference on Spoken Language 
 Dependency-based pre-ordering for chinese- Processing, Vol. 29, 2002, pp. 901-904. 
 english machine translation, in: Proceedings of [14] N. Bach, Q. Gao, S. Vogel, Source-side 
 the 52nd Annual Meeting of the Association for dependency tree reordering models with subtree 
 Computational Linguistics, 2014. movements and constraints, in: Proceedings of 
 [3] M. Collins, P. Koehn, I. Kucerová, Clause the Twelfth Machine Translation Summit 
 restructuring for statistical machine translation, (MTSummit-XII), International Association for 
 in: Proc. ACL 2005, Ann Arbor, USA, 2005, Machine Translation, Ottawa, Canada, 2009. 
 pp. 531-540. [15] D. Cer, M.-C. de Marneffe, D. Jurafsky, C. D. 
 [4] C. Ding, K. Sakanushi, H. Touji, M. Yamamoto, Manning, Parsing to stanford dependencies: 
 Inter-, intra-, and extra-chunk pre-ordering for Trade-offs between speed and accuracy, in: 7th 
 statistical japanese-to-english machine International Conference on Language 
 translation , ACM Trans. Asian Low-Resour. Resources and Evaluation (LREC 2010), 2010. 
 Lang. Inf. Process. 15 (3) (2016) 20:1-20:28. [16] D. Chiang, A hierarchical phrase-based model 
 doi:10.1145/2818381 . for statistical machine translation, in: 
 [5] URL  Proceedings of the 43rd Annual Meeting of the 
 [6] D. Genzel, Automatically learning source-side Association for Computational Linguistics 
 reordering rules for large scale machine (ACL’05), Ann Arbor, Michigan, 2005, 
 translation, in: Proceedings of the 23rd pp. 263-270. 
 International Conference on Computational [17] J. Daiber, M. Stanojevic, W. Aziz, K. 
 Linguistics, COLING ’10, 2010, pp. 376-384. Sima’an, Examining the relationship 
 [7] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. between preordering and word order freedom in 
 Reutemann, I. H. Witten, The weka data mining machine translation, in: Proceedings of the First 
 software: An update, SIGKDD Explor. Newsl. Conference on Machine Translation (WMT16), 
 11 (1) (2009) 10-18. Berlin, Germany, August. Association for 
 Computational Linguistics, 2016. 
 [8] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, 
 M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. [18] I. Goto, M. Utiyama, E. Sumita, S. Kurohashi, 
 Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, Preordering using a target-language parser via 
 E. Herbst, Moses: Open source toolkit for statistical cross-language syntactic projection for statistical 
 machine translation, in: Proceedings of ACL, machine translation, ACM Transactions on 
 Demonstration Session, 2007. Asian and Low-Resource Language Information 
 Processing 14 (3) (2015) 13. 
 [9] P. Koehn, F. J. Och, D. Marcu, Statistical 
 phrase-based translation, in: Proceedings of [19] N. Habash, Syntactic preprocessing for statistical 
 HLT-NAACL 2003, Edmonton, Canada, 2003, machine translation, Proceedings of the 11th MT 
 pp. 127-133. Summit, 2007. 
 [10] T. L. Nguyen, M. L. Ha, V. H. Nguyen, T. M. H. [20] C. Hadiwinoto, Y. Liu, H. T. Ng, To swap or not 
 Nguyen, P. Le-Hong, Building a treebank for to swap? exploiting dependency word pairs for 
 vietnamese dependency parsing, in: 2013 IEEE reordering in statistical machine translation, in: 
 RIVF International Conference on Computing Thirtieth AAAI Conference on Artificial 
 and Communication Technologies, Research, Intelligence, 2016. 
 T.H. Viet et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 14-27 27
 [21] C. Hadiwinoto, H. T. Ng, A dependency- [26] L. Wang, Support Vector Machines: theory and 
 based neural reordering model for statistical applications, Vol. 177, Springer Science & 
 machine translation, arXiv preprint Business Media, 2005. 
 arXiv:1702.04510, 2017. [27] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. 
 [22] U. Lerner, S. Petrov, Source-side classifier Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. 
 preordering for machine translation., in: Gao, K. Macherey, et al., Google€™s neural 
 EMNLP, 2013, pp. 513-523. machine translation system: Bridging the gap 
 [23] H. V. Huy, T.-L. N. Phuong-Thai Nguyen, M. between human and machine translation, arXiv 
 Nguyen, Boostrapping phrase – based preprint arXiv:1609.08144, 2016. 
 statistical machine translation via wsd [28] P. Xu, J. Kang, M. Ringgaard, F. Och, Using a 
 integration, in: In Proceeding of the Sixth dependency parser to improve smt for subject-
 International Joint Conference on Natural object-verb languages, in: Proceedings of 
 Language Processing (IJCNLP 2013), 2013, Human Language Technologies: The 2009 
 pp. 1042-1046. Annual Conference of the North American 
 [24] V. H. Tran, V. V. Nguyen, M. L. Nguyen, Chapter of the Association for Computational 
 Improving english-vietnamese statistical Linguistics, Association for Computational 
 machine translation using preprocessing Linguistics, Boulder, Colorado, 2009, 
 dependency syntactic, In Proceedings of the pp. 245-253. 
 2015 Conference of the Pacific Association for [29] N. Yang, M. Li, D. Zhang, N. Yu, A ranking-
 Computational Linguistics (Pacling 2015) based approach to word reordering for statistical 
 pp. 115-121. machine translation, in: Proceedings of the 50th 
 [25] V. H. Tran, H. T. Vu, V. V. Nguyen, M. L. Annual Meeting of the Association for 
 Nguyen, A classifier-based preordering Computational Linguistics: Long Papers-
 approach for english-vietnamese statistical Volume 1, Association for Computational 
 machine translation, 17th International Linguistics, 2012, pp. 912-920. 
 Conference on Intelligent Text Processing and 
 Computational Linguistics (CICLing 2016). 
G 
h 

File đính kèm:

  • pdfdependency_based_pre_ordering_for_english_vietnamese_statist.pdf