A boosting classification approach based on SOM

Self-organizing map (SOM) is well known for
its ability to visualize and reduce the dimension of the data.
It has been a useful unsupervised tool for clustering
problems for years. In this paper, a new classification
framework based on SOM is introduced. In this approach,
SOM is combined with the learning vector quantization
(LVQ) to form a modified version of the SOM classifier,
SOM-LVQ. The classification system is improved by
applying an adaptive boosting algorithm with base learners
to be SOM-LVQ classifiers. Two decision fusion
strategies are adopted in the boosting algorithm, which are
majority voting and weighted voting. Experimental results
based on a real dataset show that the newly proposed
classification approach for SOM outperforms traditional
supervised SOM. The results also suggest that this model
can be applicable in real classification problems.
Download
Trang 1
Trang 2
Trang 3
Trang 4
Trang 5
Trang 6
6 trang xuanhieu 12760
Download
Bạn đang xem tài liệu "A boosting classification approach based on SOM", để tải tài liệu gốc về máy hãy click vào nút Download ở trên
Tóm tắt nội dung tài liệu: A boosting classification approach based on SOM

in the middle create a strong learner which eventually improve the 
region of a class that it does not represent, its weight vector prediction power of the model. 
may have to travel through a long path to get out of its Originally, adaptive boosting algorithm is developed for 
surrounding region. Because the weights of such a neuron binary classification problem [14]. However, the 
will be repulsed by vectors in the region it must cross. As classification problem gets more complicated when it 
a result, that neuron may not be able to the region of correct comes to multi-class classification. One simple solution to 
labeled data. This problem can be solved by a proper label 
A BOOSTING CLASSIFICATION APPROACH BASED ON SOM 
this is to break down the problem in to several two-class 2. mcg: McGeoch's method for signal sequence 
problems. Zhu et al. [15] introduced an algorithm for recognition. 
Adaboost that generalizes the original binary classification 3. gvh: von Heijne's method for signal sequence 
to the multi-class problem, called SAMME. Motivated by recognition. 
SAMME algorithm, this research also focuses on multi- 4. lip: von Heijne's Signal Peptidase II consensus sequence 
classification problem with the use of SOM-LVQ as the score. Binary attribute. 
base learner. 5. chg: Presence of charge on N-terminus of predicted 
 Adaptive boosting can be applied to any supervised lipoproteins. Binary attribute. 
machine learning algorithm. However, it is pointed out by 6. aac: score of discriminant analysis of the amino acid 
Hastie et al. [13] that Adaboost algorithm works well with content of outer membrane and periplasmic proteins. 
weak learners, and decision tree model is especially suited 7. alm1: score of the ALOM membrane spanning region 
for boosting. Adaboost mainly focuses at reducing bias. prediction program. 
The base learners that are often considered for boosting are 8. alm2: score of ALOM program after excluding putative 
weak models with low variance but high bias. The most cleavable signal regions from the sequence. 
important motivation for the use of low variance but high There are 8 class labels in the dataset. Those labels are 
bias models as weak learners for boosting is that these distributed as in Table 1 as follows. 
models are in general less computationally expensive to fit. 
 Table 1. The distribution of data samples in the dataset 
Indeed, as computations to fit the different models can’t be 
done in parallel, it could become too expensive to fit Class Class name Number of 
sequentially several complex models. code samples 
 SOM-LVQ can be considered as a weak learner since it 0 cp (cytoplasm) 143 
applies a naïve method (usually, majority voting) to label 1 im (inner membrane without 77 
its nodes, therefore, it often classifies incorrectly samples signal sequence) 
positioning in the border regions of different classes. In this 2 pp (perisplasm) 52 
research, supervised SOM, aka. SOM-LVQ, model is used 3 imU (inner membrane, 35 
as a weak learner for the Adaboost algorithm. In Adaboost uncleavable signal sequence) 
algorithm, multiple SOM-LVQ models are generated 4 om (outer membrane) 20 
sequentially. Combining the outputs of these models can 5 omL (outer membrane 5 
follow one of pre-determined strategies as bellow. lipoprotein) 
 Majority voting strategy 6 imL (inner membrane 2 
 lipoprotein) 
 In this strategy, all base learners have equal weights. 7 imS (inner membrane, 2 
Given a test sample, multiple base learners will provide cleavable signal sequence) 
multiple classification answers based on the label of the 
BMU of each base learner. These answers will be fused to As presented in Table 1, the majority of the samples fall 
make the final decision as the class label having the most 
 into the first 5 classes. In the experimental results, 
count from all base learners. 
 classification performance of the system for classes 5, 6, 7 
 Weighted voting strategy can be negligible. 
 In order to evaluate the performance of the classification 
 Different from majority voting strategy, in this weighted model, some metrics are used as follows. 
voting, each base learner is assigned a weight to its answer Precision is the number of correct positive samples 
based on the weight of the BMU in that base learner. divided by the number of positive results predicted by the 
Specifically, after training, each node of the SOM-LVQ model. 
model is assigned a class label together with a weight 푡 푒 표푠푖푡푖푣푒푠
determining how confident that node can represent the 푒 푖푠푖표푛 = (7) 
 푡 푒 표푠푡푖푣푒푠 + 푙푠푒 표푠푖푡푖푣푒푠
label of the sample closest to it. This weight is set as the 
 Recall is the number of correct positive samples 
number of times the node is selected as the BMU during 
 divided by the total number of actual positive samples. 
training process. If a node never wins during the training, 푡 푒 표푠푖푡푖푣푒푠
its weight is set to a very small value. At the fusion stage, 푒 푙푙 = (8) 
all weights belonging to each class label is summed up and 푡 푒 표푠푡푖푣푒푠 + 푙푠푒 푛푒 푡푖푣푒푠
the class with the highest total weight with be decided. In any classification model, the accuracy score is an 
 important metric to present the quality of the model. The 
 classification accuracy is simply the rate of correct 
IV. EXPERIMENTAL RESULTS 
 classifications. However, in this dataset, there is a 
 Dataset significant imbalance among all classes of the data, the 
 In this research, the Ecoli dataset collected from the UCI accuracy is not necessary the precise score to present the 
Machine Learning Repository [16] is used. The dataset performance of the system. Instead, the F1 score can be 
contains protein localization sites. There are 336 instances used here. 
with 8 attributes in the dataset. Each sample has a class F1-score is the harmonic mean of precision and recall. 
representing the localization site of protein. The attribute 푒 푖푠푖표푛 × 푒 푙푙
 퐹 = 2 (9) 
information is given as follows. 1 푒 푖푠푖표푛 + 푒 푙푙
1. Sequence Name: Accession number for the SWISS-
PROT database Results and discussions 
 Nguyễn Đình Hóa 
 The dataset is divided into training (with 75% samples) algorithm is good at presenting the clusters of the input 
and testing (with 25% samples) subsets randomly. SOM- data, while LVQ algorithm is good for the process of 
LVQ base learners are generated with size of 10 x 10 moving labeled nodes to its representative class region. 
neurons. Table 2 presents the classification results for The combination of SOM and LVQ algorithm in the 
different setups. proposed method is empirically shown to be effective 
 Supervised SOM is generally a modified version of the compared to commonly used supervised SOM. Adaptive 
traditional SOM, in which each node is assigned a label boosting algorithm with two different fusion strategies is 
corresponding the class of its closest training sample after also proposed in this research. This approach seems to be 
training process. This supervised SOM model is widely effective in utilizing the nature information of the data by 
used in the literature. 
 the creation of multiple base SOM-LVQ models 
 Experimental results show that the proposed SOM-LVQ 
 sequentially. Weighting each learner by assigning different 
outperforms traditional supervised SOM commonly used 
 weight values to different nodes inside it is a flexible way 
in the literature in terms of all performance measurements. 
Additionally, the boosting algorithm significantly to present how close the relation between each learner and 
improves the quality of the SOM-LVQ model. This is due the input data is. Experimental results show that the new 
to two main reasons. approach significantly improves the classification 
 First, in boosting algorithm, multiple based learners are performance of the SOM structures. In our future work, 
created sequentially based on the missed classified samples some more different real applications of the proposed 
from previous models. This helps base classifiers learn classification framework will be investigated using 
different knowledge from different training subsets, different real datasets. 
especially the knowledge from the samples that may 
contain different relationship between inputs and outputs, Model Precision Recall F1 
which results in their missed classifying results. Model setup (%) (%) (%) 
 Seconds, each base learner is created from a small subset Single 73.4 79.6 75.3 
of training data. This helps each learner capture different Boostingmodel 
nature characteristics of the data. As a result, when the 
 SOM - (Majority 
outputs of all base learners are combined, these different 85.7 83.9 83.9 
 LVQ voting) 
information angles are put into a pool and provide a better Boosting 
decision than if only one classification model is used. (Weighted 
 Table 2: Classification results of different model setups. voting) 87.4 85.2 86.3 
 Regarding Adaboost algorithm, the weighted voting Supervised Single 
works slightly better than majority voting when it utilizes SOM model 69.5 62.7 64.6 
the relationship between each node and the training data 
during training process. Specifically, if one node is more REFERENCES 
frequently selected as the BMU during training than other [1] T. Kohonen. "Self-Organized Formation of Topologically 
nodes, its weight vector is closely related to the input data, Correct Feature Maps", Biological Cybernetics. 43 (1), pp. 
which means it is more relevant to represent the region of 59 - 69, 1982. 
its class in the training data. Assigning a classification [2] M.T. Hagan, H.B. Demuth, M.H. Beale, O.D. Jesus. 
weight to each node is an effective way to reflect that “Neural network design (2nd edition)”, ISBN-10: 0-
relationship and helps improve the classification 9717321-1-6, ISBN-13: 978-0-9717321-1-7, 1996. 
 [3] A. Rauber and D. Merkl, "Automatic Labeling of Self-
performance. Here, each base classifier does not have one 
 Organizing Maps: Making a Treasure-Map Reveal Its 
fixed weight. Instead, it has multiple weights Secrets", Methodologies for Knowledge Discovery and 
corresponding to multiple nodes inside. This dynamic Data Mining, p. 228 – 237, 1999. 
weighting method is designed to adapt with the nature of [4] P.N. Suganthan. "Hierarchical overlapped SOM's for 
the data, in which data samples of the same class may have pattern classification", IEEE Transactions on Neural 
different input distributions. Networks, vol. 10, pp. 193-196, 1999. 
 As expected, the traditional supervised SOM has the [5] O. Kurasova, "Strategies for Big Data Clustering", 2014 
worse classification performance since their nodes are just IEEE 26th International Conference on Tools with 
 Artificial Intelligence, 2014. 
trained to present the clusters of the input data. Each node 
 [6] P. Stefanovič, O. Kurasova. "Outlier detection in self-
is assigned with the label of its closest training sample. As organizing maps and their quality estimation", Neural 
a result, the labeled nodes are not representing the region Network World, 28 (2), pp. 105-117, 2018. 
of their respective class regions. SOM-LVQ is much better [7] L.A. Silva, E.D.M. Hernandez. "A SOM combined with 
than supervised SOM since their nodes are arranged by KNN for classification task", Proceedings of the 
SOM training process, then assigned labels before LVQ International Joint Conference on Neural Networks, pp. 
 2368-2373, 2011. 
training. This helps each node in the model better represent 
 [8] M. Mishra, H.S. Behera. "Kohonen Self Organizing Map 
the region of its class. However, if only one single SOM- with Modified K-means clustering For High Dimensional 
LVQ is used, its nodes cannot present all possible nature Data Set", International Journal of Applied Information 
characteristics of the data. Systems, pp. 34-39, 2012. 
 [9] T. Kohonen, "Self-Organizing Maps", 3rd Edition ed., 
V. CONCLUSIONS 2000, pp. X-XI. 
 [10] M. C. Kind and R. J. Brunner, "SOMz: photometric redshift 
 In this research, a new framework to improve the PDFs with self-organizing maps and random atlas", 2013. 
classification capability of SOM is introduced. SOM 
A BOOSTING CLASSIFICATION APPROACH BASED ON SOM 
[11] E. D. Bodt, M. Cottrell, P. Letrémy, and M. Verleysen, "On 
 the use of self-organizing maps to accelerate vector 
 quantization," Neurocomputing , vol. 56, pp. 187-203, 
 2004. 
[12] R. E. Schapire and Y. Freund, "A Decision-Theoretic 
 Generalization of On-Line Learning and an Application to 
 Boosting," journal of computer and system sciences, pp. 
 119-139, 1996. 
[13] T. Hastie, R. Tibshirani and F. Jerome, The Elements of 
 Statistical Learning, 2nd edition, Stanford, California: 
 Springer, 2008, p. 340. 
[14] P. Dangeti, Statistics for Machine Learning, Birmingham: 
 Packt Publishing, July 2017. 
[15] J. Zhu, H. Zou, S. Rosset and T. Hastie, "Multi-class 
 AdaBoost," Statistics and Its Interface , vol. 2, pp. 349-360, 
 2009. 
[16] "UCI Machine Learning Repository," [Online]. Available: 
 https://archive.ics.uci.edu/ml/index.php. 
[17] T. Kohonen, P. Somervuo. “How to make large self-
 organizing maps for nonvectorial data”, Neural Networks, 
 15 (8-9), pp. 945-52, 2002. 
 MỘT PHƯƠNG PHÁP NÂNG CAO KHẢ NĂNG 
 PHÂN LOẠI DỮ LIỆU CỦA SOM SỬ DỤNG 
 THUẬT TOÁN BOOSTING 
Tóm tắt: Bản đồ tự tổ chức (SOM) được biết đến là một 
công cụ hữu hiệu trong việc trực quan hóa và giảm kích 
thước của dữ liệu. SOM là công cụ học không giám sát và 
rất hữu ích cho các bài toán phân cụm. Bài báo này trình 
bày về một cách tiếp cận mới cho bài toán phân loại dựa 
trên SOM. Trong phương pháp này, SOM được kết hợp với 
thuật toán huấn luyện lượng tử hóa vectơ (LVQ) để tạo 
thành một mô hình mới là SOM-LVQ. Mộ hình phân loại 
dữ liệu sử dụng SOM-LVQ được tiếp tục cải tiến bằng cách 
áp dụng thuật toán tăng cường thích ứng (Adaboost) sử 
dụng SOM-LVQ làm các bộ phân loại cơ sở. Để kết hợp 
các kết quả từ các bộ phân loại cơ sở, hai kỹ thuật được áp 
dụng bao gồm bỏ phiếu theo đa số và bỏ phiếu theo trọng 
số. Kết quả thử nghiệm dựa trên bộ dữ liệu thực tế cho thấy 
phương pháp phân loại mới được đề xuất nhằm cải tiến 
SOM trong nghiên cứu này vượt trội hơn mô hình SOM 
truyền thống. Kết quả cũng cho thấy khả năng ứng dụng 
thực tế của mô hình này là rất khả quan. 
Từ khoá: Bản đồ tự tổ chức, học lượng tử hoá vector, thuật 
toán tăng cường, kết hợp theo trọng số. 
 Hoa Dinh Nguyen earned bachelor 
 and master of science degrees from 
 Hanoi University of Technology in 
 2000 and 2002, respectively. He got 
 his PhD. degree in electrical and 
 computer engineering in 2013 from 
 Oklahoma State University. He is 
 now a lecturer in information 
technology at PTIT. His research fields of interest include 
dynamic systems, data mining, and machine learning.
File đính kèm:
a_boosting_classification_approach_based_on_som.pdf