Dependency - Based pre - ordering for English - Vietnamese statistical machine translation

Phrase-based statistical machine translation

[8] is the state-of-the-art of SMT because of its

power in modelling short reordering and local

context. However, with phrase-based SMT,

long distance reordering is still problematic.

The reordering problem (global reordering) is

one of the major problems, since different

languages have different word order

requirements. In recent years, many reordering

methods have been proposed to tackle the long

distance reordering problem. Many solutions

solving the reordering problem have been

proposed, such as syntax-based model [15],

lexicalized reordering [10]. Chiang [15] shows

significant improvements by keeping the

strengths of phrases, while incorporating syntax

into SMT. Some approaches were applied at the

word level [3]. They are useful for language

with rich morphology, for reducing data

sparseness. Other kinds of syntax reordering

methods require parser trees, such as the work

in [3]. The parsed tree is more powerful in

capturing the sentence structure. However, it is

expensive to create tree structure and build a

good quality parser. All the above approaches

require much decoding time, which is

expensive.

Download

Trang 1

Trang 2

Trang 3

Trang 4

Trang 5

Trang 6

Trang 7

Trang 8

Trang 9

Trang 10

Tải về để xem bản đầy đủ

14 trang duykhanh 5560

Download

Bạn đang xem 10 trang mẫu của tài liệu "Dependency - Based pre - ordering for English - Vietnamese statistical machine translation", để tải tài liệu gốc về máy hãy click vào nút Download ở trên

Tóm tắt nội dung tài liệu: Dependency - Based pre - ordering for English - Vietnamese statistical machine translation

reordering words in English sentence according to Vietnamese word
English sentences according to Vietnamese order and types of rules including noun phrase,
word order. adjectival and adverbial phrase, preposition
• We implemented preprocessing step which is described in table 1.
during both training and decoding time. 6.3. Using automatic rules
• Using the SMT Moses decoder [7] for
decoding. We present our experiments to translate
We give some definitions for our from English to Vietnamese in a statistical
experiments: machine translation system. In hence, the
• Baseline: use the baseline phrase-based language pair chosen is English-Vietnamese.
SMT system using the lexicalized reordering We used Stanford Parser [14] to parse source
model in Moses toolkit. sentence (English sentences).
• Manual Rules: the phrase-based SMT We used dependency parsing and rules
systems applying manual rules [23]. extracted from training the features-rich
• Auto Rules : the phrase-based SMT discriminative classifiers for reordering source-
systems applying automatic rules [24]. side sentences. The rules are automatically
• Auto Rules + Manual Rules: the phrase- extracted from English-Vietnamese parallel
based SMT systems applying automatic rules, corpus and the dependency parser of English
then applying manual rules. examples. Finally, they used these rules to
reorder source sentences. We evaluated our
Table 5. Our experimental systems on English-
Vietnamese parallel corpus approach on English-Vietnamese machine
translation tasks with systems in table 5 which
Name Description shows that it can outperform the baseline
Baseline Phrase-based system phrase-based SMT system.
Manual Rules Phrase-based system
with corpus Table 6. Size of phrase tables
which preprocessed
using manual rules Name Size of phrase-table
Auto Rules Phrase-based system Baseline 1152216
with corpus which
preprocessed using Manual Rules 1231365
automatic learning rules
Auto Rules 1213401
Auto Rules + Phrase-based system
Manual Rules with corpus which Auto Rules + 1253401
preprocessed using Manual Rules
automatic learning rules
and manual rules
24 T.H. Viet et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 14-27
Table 7. Translation performance Table 9. An example of a translation produced by
for the English-Vietnamese task our system for an input sentence sampled from
English-Vietnamese corpus
System BLEU (%)
Input Translation Translation Translation
Baseline 36.89 sentence: (Baseline): (Auto): (human):
Manual Rules 37.71
The coat was far too big
Auto Rules 37.12 - it completely
Auto Rules + Manual Rules 37.85 enveloped him.
6.4. BLEU score
Chiếc áo khoác là quá
The result of our experiments in table 6 lớn
showed size of phrase tables built from - nó hoàn toàn phủ anh
translation model base on our method. In this ta.
method, we can find out various phrases in the
translation model. So that, they enable us to
have more options for decoder to generate the Chiếc áo khoác là quá
best translation. lớn
Table 7 describes the BLEU score of our - nó phủ hoàn toàn anh
experiments. As we can see, by applying ta.
preprocessing in both training and decoding, the
BLEU score of "Auto Rules" system is lower
by 0.49 point than "Manual Rules" system. This Chiếc áo khoác quá lớn
result is due to the fact that manual rules have - nó hoàn toàn phủ anh
ta.
better quality than automatic rules. However,
"Auto Rules + Manual Rules" system is the best
system because applying the combination rules Manh Cuong is a young
can cover much linguistic phenomena. football player
The above result proved that the effect of with potential great.
applying transformation rule base on the
dependency parse tree.
Table 8. Statistical number of family on Manh Cuong là một cầu
corpus English-Vietnamese thủ
bóng đá với nhiều tiềm
Number Number Description năng.
children of head
79142 Family has 1 children
Manh Cuong là một cầu
40822 Family has 2 children
thủ
26008 Family has 3 children bóng đá trẻ có tiềm
15990 Family has 4 children năng lớn.
7442 Family has 5 children
2728 Family has 6 children
942 Family has 7 children Mạnh Cường là cầu thủ
307 Family has 8 children
bóng đá trẻ rất nhiều
83 Family has 9 triển vọng.
children
T.H. Viet et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 14-27 25
7. Analysis and discussion it is shown that applying classifier method to
solve reordering problems automatically.
We have found that in our experiments According to typical differences of word
work is sufficiently correlated to the translation order between English and Vietnamese, we
quality done manually. Besides, we also have have created a set of automatic rules for
found some errors cause such as parse tree reordering words in English sentence according
source sentence quality, word alignment quality to Vietnamese word order and types of rules
and quality of corpus. All the above errors can including noun phrase, adjectival and adverbial
effect automatic reordering rules. Table 9 phrase, as well as preposition phrase. Table 8
showed the translation output examples are gives statistical families which have larger or
better than baseline system produced by our equal 4 children in our corpus. The number of
system for the input sentences from English- children in each family has limited 4 children in
Vietnamese test set. Go here for more examples our approach. So in target language
of translations for input sentences sampled (Vietnamese), the number of children in each
randomly from our corpus. Some phrases in family is the same.
English source sentence were reordered The manual rules have good quality
corresponding to Vietnamese target sentence [27, 18], the phrase-based SMT systems
order. We focus mainly on some typical applying manual rules is better than the phrase-
relations as noun phrase, adjectival and based SMT systems applying automatic rules.
adverbial phrase, preposition and created We believe that the quality of the phrase-based
manually written reordering rule set for SMT systems applying automatic rules will be
English-Vietnamese language pair. Our study better when we have a better corpus.
employed dependency syntactic and
transformation rules to reorder the source
sentence and applied to English to Vietnamese 8. Conclusion
translation systems.
For example, with noun phrase, there In this paper, we present a preprocessing
always exists a head noun and the components approach based on the dependency parser. The
before and after it. These auxiliary components proposed approach is applying for English -
will move to new positions according to Vietnamese translation system. The
Vietnamese translational order. These rules can experimental results show that our approach
popular source linguistic phenomena equivalent achieved statistical improvements in BLEU
to target language ones as follows: scores over a state-of-the-art phrase-based
• The phrase-based systems applying rules baseline system. By applying manual rules and
with category JJ or JJS automatic rules, the quality of English-
• The phrase-based systems applying rules Vietnamese translation system is improving. In
with category NN or NNS our study, our rules cover some linguistic
• The phrase-based systems applying rules reordering phenomena. These reordering rules
with category IN or TO benefit English-Vietnamese languages pair.
Based on these phenomena, translation We will focus on word order problems
quality has significantly improved. We carried much more with linguistic reordering
out error analysis sentences and compared to phenomena on English-Vietnamese to learn
the golden reordering. Our analysis has also the better the dependency-based reordering rules
benefits of automatic reordering rules on (manual rules and automatic rules). This is
translation quality. In combination with necessary in improving SMT systems and that
machine learning method in related work [21], might lead to its a wider adoption.
26 T.H. Viet et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 14-27
Acknowledgments Innovation, and Vision for the Future, RIVF
2013, Hanoi, Vietnam, November 10-13, 2013,
This work described in this paper has been 2013, pp. 147-151.
partially funded by Hanoi National University [11] F. J. Och, H. Ney, A systematic comparison of
(QG.15.23 project). various statistical alignment models,
Computational Linguistics 29 (1) (2003) 19-51.
[12] B. M. de Marneffe, C. D.Manning, Generating
typed dependency parses from phrase structure
References parses, in: In the Proceeding of the 5th
International Conference on Language
[1] S. R. T. W. Papineni, Kishore, W. Zhu, Bleu: A Resources and Evaluation, 2006.
method for automatic evaluation of machine [13] A. Stolcke, Srilm - an extensible language
translation., in: ACL, 2002. modeling toolkit, in: Proceedings of
[2] E. S. Y. Z. Jingsheng Cai, Masao Utiyama, International Conference on Spoken Language
Dependency-based pre-ordering for chinese- Processing, Vol. 29, 2002, pp. 901-904.
english machine translation, in: Proceedings of [14] N. Bach, Q. Gao, S. Vogel, Source-side
the 52nd Annual Meeting of the Association for dependency tree reordering models with subtree
Computational Linguistics, 2014. movements and constraints, in: Proceedings of
[3] M. Collins, P. Koehn, I. Kucerová, Clause the Twelfth Machine Translation Summit
restructuring for statistical machine translation, (MTSummit-XII), International Association for
in: Proc. ACL 2005, Ann Arbor, USA, 2005, Machine Translation, Ottawa, Canada, 2009.
pp. 531-540. [15] D. Cer, M.-C. de Marneffe, D. Jurafsky, C. D.
[4] C. Ding, K. Sakanushi, H. Touji, M. Yamamoto, Manning, Parsing to stanford dependencies:
Inter-, intra-, and extra-chunk pre-ordering for Trade-offs between speed and accuracy, in: 7th
statistical japanese-to-english machine International Conference on Language
translation , ACM Trans. Asian Low-Resour. Resources and Evaluation (LREC 2010), 2010.
Lang. Inf. Process. 15 (3) (2016) 20:1-20:28. [16] D. Chiang, A hierarchical phrase-based model
doi:10.1145/2818381 . for statistical machine translation, in:
[5] URL Proceedings of the 43rd Annual Meeting of the
[6] D. Genzel, Automatically learning source-side Association for Computational Linguistics
reordering rules for large scale machine (ACL’05), Ann Arbor, Michigan, 2005,
translation, in: Proceedings of the 23rd pp. 263-270.
International Conference on Computational [17] J. Daiber, M. Stanojevic, W. Aziz, K.
Linguistics, COLING ’10, 2010, pp. 376-384. Simaâ€™an, Examining the relationship
[7] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. between preordering and word order freedom in
Reutemann, I. H. Witten, The weka data mining machine translation, in: Proceedings of the First
software: An update, SIGKDD Explor. Newsl. Conference on Machine Translation (WMT16),
11 (1) (2009) 10-18. Berlin, Germany, August. Association for
Computational Linguistics, 2016.
[8] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch,
M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. [18] I. Goto, M. Utiyama, E. Sumita, S. Kurohashi,
Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, Preordering using a target-language parser via
E. Herbst, Moses: Open source toolkit for statistical cross-language syntactic projection for statistical
machine translation, in: Proceedings of ACL, machine translation, ACM Transactions on
Demonstration Session, 2007. Asian and Low-Resource Language Information
Processing 14 (3) (2015) 13.
[9] P. Koehn, F. J. Och, D. Marcu, Statistical
phrase-based translation, in: Proceedings of [19] N. Habash, Syntactic preprocessing for statistical
HLT-NAACL 2003, Edmonton, Canada, 2003, machine translation, Proceedings of the 11th MT
pp. 127-133. Summit, 2007.
[10] T. L. Nguyen, M. L. Ha, V. H. Nguyen, T. M. H. [20] C. Hadiwinoto, Y. Liu, H. T. Ng, To swap or not
Nguyen, P. Le-Hong, Building a treebank for to swap? exploiting dependency word pairs for
vietnamese dependency parsing, in: 2013 IEEE reordering in statistical machine translation, in:
RIVF International Conference on Computing Thirtieth AAAI Conference on Artificial
and Communication Technologies, Research, Intelligence, 2016.
T.H. Viet et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 2 (2017) 14-27 27
[21] C. Hadiwinoto, H. T. Ng, A dependency- [26] L. Wang, Support Vector Machines: theory and
based neural reordering model for statistical applications, Vol. 177, Springer Science &
machine translation, arXiv preprint Business Media, 2005.
arXiv:1702.04510, 2017. [27] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M.
[22] U. Lerner, S. Petrov, Source-side classifier Norouzi, W. Macherey, M. Krikun, Y. Cao, Q.
preordering for machine translation., in: Gao, K. Macherey, et al., Google€™s neural
EMNLP, 2013, pp. 513-523. machine translation system: Bridging the gap
[23] H. V. Huy, T.-L. N. Phuong-Thai Nguyen, M. between human and machine translation, arXiv
Nguyen, Boostrapping phrase â€“ based preprint arXiv:1609.08144, 2016.
statistical machine translation via wsd [28] P. Xu, J. Kang, M. Ringgaard, F. Och, Using a
integration, in: In Proceeding of the Sixth dependency parser to improve smt for subject-
International Joint Conference on Natural object-verb languages, in: Proceedings of
Language Processing (IJCNLP 2013), 2013, Human Language Technologies: The 2009
pp. 1042-1046. Annual Conference of the North American
[24] V. H. Tran, V. V. Nguyen, M. L. Nguyen, Chapter of the Association for Computational
Improving english-vietnamese statistical Linguistics, Association for Computational
machine translation using preprocessing Linguistics, Boulder, Colorado, 2009,
dependency syntactic, In Proceedings of the pp. 245-253.
2015 Conference of the Pacific Association for [29] N. Yang, M. Li, D. Zhang, N. Yu, A ranking-
Computational Linguistics (Pacling 2015) based approach to word reordering for statistical
pp. 115-121. machine translation, in: Proceedings of the 50th
[25] V. H. Tran, H. T. Vu, V. V. Nguyen, M. L. Annual Meeting of the Association for
Nguyen, A classifier-based preordering Computational Linguistics: Long Papers-
approach for english-vietnamese statistical Volume 1, Association for Computational
machine translation, 17th International Linguistics, 2012, pp. 912-920.
Conference on Intelligent Text Processing and
Computational Linguistics (CICLing 2016).
G
h

File đính kèm:

dependency_based_pre_ordering_for_english_vietnamese_statist.pdf