Single concatenated input is better than indenpendent multiple-Input for cnns to predict chemical - induced disease relation from literature

Chemical compounds (drugs) and diseases are among top searched keywords on the

PubMed database of biomedical literature by biomedical researchers all over the world (according

to a study in 2009). Working with PubMed is essential for researchers to get insights into drugs’

side effects (chemical-induced disease relations (CDR), which is essential for drug safety and

toxicity. It is, however, a catastrophic burden for them as PubMed is a huge database of

unstructured texts, growing steadily very fast (~28 millions scientific articles currently,

approximately two deposited per minute). As a result, biomedical text mining has been empirically

demonstrated its great implications in biomedical research communities. Biomedical text has its

own distinct challenging properties, attracting much attetion from natural language processing

communities. A large-scale study recently in 2018 showed that incorporating information into

indenpendent multiple-input layers outperforms concatenating them into a single input layer (for

biLSTM), producing better performance when compared to state-of-the-art CDR classifying

models. This paper demonstrates that for a CNN it is vice-versa, in which concatenation is better

for CDR classification. To this end, we develop a CNN based model with multiple input

concatenated for CDR classification. Experimental results on the benchmark dataset demonstrate

its outperformance over other recent state-of-the-art CDR classification models.

Download

Single concatenated input is better than indenpendent multiple-Input for cnns to predict chemical - induced disease relation from literature trang 1

Trang 1

Trang 2

Trang 3

Trang 4

Trang 5

Trang 6

6 trang duykhanh 12560

Download

Bạn đang xem tài liệu "Single concatenated input is better than indenpendent multiple-Input for cnns to predict chemical - induced disease relation from literature", để tải tài liệu gốc về máy hãy click vào nút Download ở trên

Tóm tắt nội dung tài liệu: Single concatenated input is better than indenpendent multiple-Input for cnns to predict chemical - induced disease relation from literature

researchers. The rest of this paper is organized as
It can be, however, accelerated with the follows. Section 2 describes the proposed
application of biomedical text mining, hereby method in detail. Experimental results are
for drug (chemical) - disease relation discussed in section 3. Finally, section 4
prediction, in particular. Biomedical text concludes this paper.
mining has been empirically demonstrated its
great implications in biomedical research
communities [5-7]. 2. Method
Biomedical text has its own distinct
challenging properties, attracting much attetion Given a preprocessed and tokenized
from natural language processing communities sentence containing two entity types of interest
[8, 9]. In 2004, an annual challenge, called (i.e. chemical and disease), our model first
BioCreative (Critical Assessment of extracts the shortest dependency path (SDP) (on
Information Extraction systems in Biology) was the dependency tree) between such two entities.
launched for biomedical text mining The SDP contains tokens (together with
researchers. In 2016, researchers from NCBI dependency relations between them) that are
organized the chemical disease relationship important for understanding the semantic
extraction task for the challenge [10]. connection between two entities (see Figure 1 for
To date, almost all proposed models are only an example of the SDP).
for prediction of relationships between chemicals
and diseases that appear within a sentence (intra-
sentence relationships) [11]. We note that those
models that produce the state-of-the-art
performance are mainly based on deep neural
architechtures [12-14], such as recurrent neural
networks (RNN) like bi-directional long short-
Figure 1. Dependency tree for an example sentence.
term memory (biLSTM) in [15] and convolutional
The shortest dependency path between two entities
neural networks (CNN) in [16-18]. (i.e. depression and methyldopa) goes through the
Recently, Le et al. developed a biLSTM tokens “occurring” and “patients”.
based intra-sentence biomedical relation
prediction model that incorporates various Each token t on a SDP is encoded with the
informative linguistic properties in an embedding et by concatenating three
w pt
independent multiple-layer manner [19]. Their embeddings of equal dimension d (i.e. e ⨁ e ⨁
experimental results demonstrate that eps), which represent important linguistic
incorporating information into independent information, including its token itself (ew), part
multiple-input layers outperforms concatenating of speech (POS) (ept) and its position (eps). Two
them into a single input layer (for biLSTM), former partial embeddings are fine-tuned during
P.T.Q. Trang, et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 36, No. 1 (2020) 11-16 13
the model training. Position embeddings are embedding channel (c) independently, creating
l r
indexed by distance pairs [d %5, d %5], where a corresponding feature map ℱic. The max
dl and dr are distances from a token to the left pooling operator is then applied on those
and the right entity, respectively. created feature maps on all channels (three in
For each dependency relation (r) on the our case) to create a feature value for filter fi
SDP, its embedding has the dimension of 3*d, (Figure 3).
and is randomly initialized and fine-tuned as the
2.2. Hyper-parameters
model’s parameters during training.
To this end, each SDP is embedded into the The model’s hyper-parameters are
RNxD space (see Figure 2), where N is the empirically set as follows:
number of all tokens and dependency relations ● Filter size: n x d, where d is the embedding
on the SDP and D=3*d. The embedded SDP dimension (300 in our experiments), n is a number
will be fed as input into a conventional of consecutive elements (tokens/POS tags,
convolutional neural network (CNN [20]) for relations) on SDPs (Figure 3).
being classified if there is or not a predefined ● Number of filters: 32 filters of the size 2 x
relation (i.e. chemical-induced disease relation) 300, 128 of 3 x 300, 32 of 4 x 300, 96 of 5 x 300.
between two entities. ● Number of hidden layers: 2.
● Number of units at each layer: 128.
- The number of training epochs: 100
- Patience for early stopping: 10
- Optimizer: Adam
3. Experimental results
3.1. Dataset
Our experiments are conducted on the Bio
Creative V data [10]. It’s an annotated text
corpus that consists of human annotations for
Figure 2. Embedding by concatenation mechanism chemicals, diseases and their chemical-induced-
of the Shortest Dependency Path (SDP) from the disease (CID) relation at the abstract level. The
example in Figure 1. dataset contains 1500 PubMed articles divided
2.1. Multiple-channel embedding into three subsets for training, development and
testing. In 1500 articles, most were selected
For multi-channel embedding, instead of from the CTD data set (accounting for
concatenating three partial embeddings of each 1400/1500 articles). The remaining 100 articles
token on a SDP we maintain three independent in the test set are completely different articles,
embedding channels for them. Channels for which are carefully selected. All these data is
relations on the SDP are identical embeddings. manually curated. The detail information is
nxdxc
As a result, SDPs are embedded into R , shown in Table 1.
where n is the number of all tokens and
dependency relations between them, d is the 3.2. Model evaluation
dimension number of embeddings, and c=3 is We merge the training and development
the number of embedding channels. subsets of the BioCreative V CDR into a single
To calculate feature maps for CNN we training dataset, which is then divided into the
follow the scheme in the work of Kim 2014 new training and validation/development data
[21]. Each CNN’s filter fi is slided along each with a ratio 85%:15%. To stop training process
14 P.T.Q. Trang, et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 36, No. 1 (2020) 11-16
at the right time, we use the early stop technique training and evaluating 15 times on the new
on F1-score on the new validation data. training and development set, the averaged F1
The entire text will be passed through a on the test set is chosen as the final evaluation
sentence splitter. Then based on the name of the result across the entire dataset to make sure that
disease, the name of the chemical has been the model can work well with strange samples.
marked from the previous step, we filter out all Finally, the models that achieve the best
the sentences containing at least one pair of results based on the sentence level will be applied
chemical-disease entities. With all the sentences
to the problem on the abstract level to compare
found, we can classify the relation for each pair
of chemical-disease entities. We perform model with other very recent state-of-the-art methods.
U
Ơ
Figure 3. Model architecture with three-channel embedding as an input for an SDP.
Table 1. Statistics on BioCreative V CDR dataset [10]
Chemical Disease
Dataset Articles CID
Mention ID Mention ID
Training 500 5203 1467 4182 1965 1038
Development 500 5347 1507 4244 1865 1012
Test 500 5385 1435 4424 1988 1066
g
3.3. Results and comparison when contributing 0.9% of the F1 improvement to
the final performance of the model.
Experiment results show that the model
Table 2. Performance of our model with different
achieves the averaged F1 of 57% (Precision of linguistic information used as input
55.6% and Recall of 58.6%) at the abstract
level. Compared with its variant that does not Information used Precision Recall F1
use dependency relations, we observe a big Tokens only 53.7 55.4 54.5
outperformance of about 2.6% at F1, which is Token, Dependency
55.7 56.8 56.2
very significant (see Table 2). It indicates that (depRE)
Tokens, DepRE and
dependency relations contain much information 55.7 57.5 56.6
for relation extraction. In the meanwhile, POS tag POS tags
Tokens, depRE,
and position information are also very useful 55.6 58.6 57.0
POS and Position
P.T.Q. Trang, et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 36, No. 1 (2020) 11-16 15
Compared with recent state-of-the-art Acknowledgements
models such as MASS [19], ASM [22], and the
This research is funded by Vietnam
tree kernel based model [23], our model
National Foundation for Science and
performs better (Table 3). Ours and MASS only
Technology Development (NAFOSTED) under
exploit intra-sentence information (namely grant number 102.05-2016.14.
SDPs, POS and positions), ignoring prediction
for cross-sentence relations, while the other two
incorporate cross-sentence information. We References
note that cross-sentence relations account for
30% of all relations in the CDR dataset. This [1] Paul SM, D.S. Mytelka, C.T. Dunwiddie,
probably explains why ASM could achieve C.C. Persinger, B.H. Munos, S.R. Lindborg, A.L.
better recall (67.4%) than our model (58.6%). Schacht, How to improve R&D productivity: The
pharmaceutical industry's grand challenge, Nat
Table 3. Performance of our model in comparison Rev Drug Discov. 9(3) (2010) 203-14.
with other state-of-the-art models https://doi.org/10.1038/nrd3078.
[2] J.A. DiMasi, New drug development in the United
Model Relations Precision Recall F1 States from 1963 to 1999, Clinical pharmacology
and therapeutics 69 (2001) 286-296.
Intra- and https://doi.org/10.1067/mcp.2001.115132.
Zhou et
inter- 64.9 49.2 56.0 [3] C.P. Adams, V. Van Brantner, Estimating the cost
al., 2016
sentence of new drug development: Is it really $802
million? Health Affairs 25 (2006) 420-428.
Panyam Intra- and https://doi.org/10.1377/hlthaff.25.2.420.
et al., inter- 49.0 67.4 56.8 [4] R.I. Doğan, G.C. Murray, A. Névéol et al.,
2018 sentence "Understanding PubMed user search behavior
Le et al., Intra- through log analysis", Oxford Database, 2009.
58.9 54.9 56.9
2018 sentence [5] G.K. Savova, J.J. Masanz, P.V. Ogren et al., "Mayo
clinical text analysis and knowledge extraction
Our Intra-
55.6 58.6 57.0 system (cTAKES): Architecture, component
model sentence evaluation and applications", Journal of the
American Medical Informatics Association, 2010.
4. Conclusion [6] T.C. Wiegers, A.P. Davis, C.J. Mattingly,
Collaborative biocuration-text mining
This paper experimentally demonstrates development task for document prioritization for
curation, Database 22 (2012) pp. bas037.
that CNNs perform better prediction of abstract- [7] N. Kang, B. Singh, C. Bui et al., "Knowledge-
level chemical-induced disease relations in based extraction of adverse drug events from
biomedical literature when using concatenated biomedical text", BMC Bioinformatics 15, 2014.
input embedding channels rather than [8] A. Névéol, R.L. Doğan, Z. Lu, "Semi-automatic
semantic annotation of PubMed queries: A study
independent multiple channels. It is vice versa on quality, Efficiency, Satisfaction", Journal of
for BiLSTM when multiple independent Biomedical Informatics 44, 2011.
channels give better performance, as shown in a [9] L. Hirschman, G.A. Burns, M. Krallinger, C.
recent large-scale related study [Le et al., 2018]. Arighi, K.B. Cohen et al., Text mining for the
biocuration workflow, Database Apr 18, 2012,
To this end, this paper present a model for pp. bas020.
prediction of chemical-induced disease relations [10] Wei et al., "Overview of the BioCreative V
in biomedical text based on a CNN with Chemical Disease Relation (CDR) Task",
concatenated input embeddings. Experimental Proceedings of the Fifth BioCreative Challenge
Evaluation Workshop, 2015.
results on the benchmark dataset show that our [11] P. Verga, E. Strubell, A. McCallum,
model outperforms three recent state-of-the-art Simultaneously Self-Attending to All Mentions
related models. for Full-Abstract Biological Relation Extraction,
16 P.T.Q. Trang, et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 36, No. 1 (2020) 11-16
In Proceedings of the 2018 Conference of the Proceedings of the Fifty-fourth Annual Meeting of
North American Chapter of the Association for the Association for Computational Linguistics 1
Computational Linguistics: Human Language (2016) 1298-1307.
Technologies 1 (2018) 872-884. https://doi.org/10.18653/v1/P16-1123.
[12] Y. Shen, X. Huang, Attention-based convolutional [18] J. Gu, F. Sun, L. Qian et al., Chemical-induced
neural network for semantic relation extraction, disease relation extraction via convolutional
In: Proceedings of COLING 2016, the Twenty- neural network, Database (2017) 1-12.
sixth International Conference on Computational https://doi.org/10.1093/database/bax024.
Linguistics: Technical Papers, The COLING 2016 [19] H.Q. Le, D.C. Can, S.T. Vu, T.H. Dang, M.T.
Organizing Committee, Osaka, Japan, 2016, Pilehvar, N. Collier, Large-scale Exploration of
pp. 2526-2536. Neural Relation Classification Architectures,
[13] Y. Peng, Z. Lu, Deep learning for extracting In Proceedings of the 2018 Conference on
protein-protein interactions from biomedical Empirical Methods in Natural Language
literature, In: Proceedings of the BioNLP 2017 Processing, 2018, pp. 2266-2277.
Workshop, Association for Computational [20] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner,
Linguistics, Vancouver, Canada, 2016, pp. 29-38. Gradient-based learning applied to document
[14] S. Liu, F. Shen, R. Komandur Elayavilli, Y. recognition, In Proceedings of the IEEE. 86(11)
Wang, M. Rastegar-Mojarad, V. Chaudhary, H. (1998) 2278-2324.
Liu, Extracting chemical-protein relations using [21] Y. Kim, Convolutional neural networks for
attention-based neural networks, Database, 2018. sentence classification, ArXiv preprint
[15] H. Zhou, H. Deng, L. Chen, Y. Yang, C. Jia, arXiv:1408.5882.
D. Huang, Exploiting syntactic and semantics [22] C. Nagesh, Panyam, Karin Verspoor, Trevor Cohn
information for chemical-disease relation and Kotagiri Ramamohanarao, Exploiting graph
extraction, Database, 2016, pp. baw048. kernels for high performance biomedical relation
[16] S. Liu, B. Tang, Q. Chen et al., Drug–drug extraction, Journal of biomedical semantics 9(1)
interaction extraction via convolutional neural (2018) 7.
networks, Comput, Math, Methods Med, Vol [23] H. Zhou, H. Deng, L. Chen, Y. Yang, C. Jia, D.
(2016) 1-8. https://doi.org/10.1155/2016/6918381. Huang, Exploiting syntactic and semantics
[17] L. Wang, Z. Cao, G. De Melo et al., Relation information for chemical-disease relation
classification via multi-level attention CNNs, In: extraction, Database, 2016.
Uu
u

File đính kèm:

single_concatenated_input_is_better_than_indenpendent_multip.pdf