An information-theoretic metric based method for selecting clustering attribute

ABSTRACT—Clustering problem appears in many different fields like Data Mining, Pattern Recognition, Bioinfor-Matics, etc. The basic objective of clustering is to group objects into clusters so that objects in the same cluster are more similar to one another than they are to objects in other clusters. Recently, many researchers have contributed to categorical data clustering, where data objects are made up of non-numerical attributes. Especially, rough set theory based attribute selection clustering approaches for categorical data have attracted much attention. The key to these approaches is how to select only one attribute that is the best to cluster the objects at each time from many candidates of attributes. In this paper, we review three rough set based techniques: Total Roughness (TR), Min-Min Roughness (MMR) and Maximum Dependency Attribute (MDA), and propose MAMD (Minimum value of Average Mantaras Distance), an alternative algorithm for hierarchical clustering attribute selection. MAMD uses Mantaras metric which is an information-theoretic metric on the set of partitions of a finite set of objects and seeks to determine a clustering attribute such that the average distance between the partition generated by this attribute and the partitions generated by other attributes of the objects has a minimum value. To evaluate and compare MAMD with three rough set based techniques, we use the concept of average intra-class similarity to measure the clustering quality of selected attribute. The experiment results show that the clustering quality of the attribute selected by our method is higher than that of attributes selected by TR, MMR and MDA methods

An information-theoretic metric based method for selecting clustering attribute trang 1

Trang 1

An information-theoretic metric based method for selecting clustering attribute trang 2

Trang 2

An information-theoretic metric based method for selecting clustering attribute trang 3

Trang 3

An information-theoretic metric based method for selecting clustering attribute trang 4

Trang 4

An information-theoretic metric based method for selecting clustering attribute trang 5

Trang 5

An information-theoretic metric based method for selecting clustering attribute trang 6

Trang 6

An information-theoretic metric based method for selecting clustering attribute trang 7

Trang 7

An information-theoretic metric based method for selecting clustering attribute trang 8

Trang 8

An information-theoretic metric based method for selecting clustering attribute trang 9

Trang 9

An information-theoretic metric based method for selecting clustering attribute trang 10

Trang 10

pdf 10 trang xuanhieu 3720
Bạn đang xem tài liệu "An information-theoretic metric based method for selecting clustering attribute", để tải tài liệu gốc về máy hãy click vào nút Download ở trên

Tóm tắt nội dung tài liệu: An information-theoretic metric based method for selecting clustering attribute

An information-theoretic metric based method for selecting clustering attribute
tion we present MAMD technique, which is based on Minimum Average Mantaras Distance, to select 
clustering attribute. 
Definition 7. Let be an information system, and is an attribute, the Average Mantaras Distance 
of to all , , is defined by the following formula: 
∑ ( )
| |
| | 
where are the partitions of induced by and respectively. 
In the above definitions, ( ) is a measurement to the distance between and From the view of 
clustering, the lower is, the higher the crispness of the clustering. Based on the above definition, we present the 
MAMD algorithm as follows. 
Input: Dataset (information system) without clustering attribute 
Output: Clustering attribute 
Begin 
Step 1. For each attribute compute the equivalence classes of partition 
induced by indiscernibility relation . 
Step 2. For each attribute compute the condition entropy of partition 
with respect to partition , where : 
 | ∑ ( )
∑ ( | ) ( | )
where { }, { } and ( | ) | | | |⁄ , 
and 
Step 3. For each every pair , where , compute the distance between 
two partitions and using Mantaras metric: 
 ( ) ( | ) ( | ) 
Step 4. For each attribute compute the Average Mantaras Distance 
according to (22). 
Step 5. Select the attribute with the lowest as clustering attribute. 
End 
Let us illustrate the MAMD algorithm by an example. 
Example. Table 1 shows Credit dataset as in [7]. There are ten objects with five categorical attributes: Magazine 
Promotion (MP), Watch Promotion (WP), Life Insurance Promotion (LIP), Credit Card Insurance (CCI), and Sex (S). 
First, we deal with attribute MP. The partition of induced by attribute MP is: 
 {{1, 2, 4, 5, 7, 9, 10}, {3, 6, 8}}. 
We have: 
36 AN INFORMATION-THEORETIC METRIC BASED METHOD FOR SELECTING CLUSTERING ATTRIBUTE 
The partition of induced by attribute WP is: 
 {{1, 3, 5, 6, 7, 9}, {2, 4, 8, 10}} 
Table 1. The Credit dataset 
# MP WP LIP CCI S 
1 yes no no no Male 
2 yes yes yes no Female 
3 no no no no Male 
4 yes yes yes yes Male 
5 yes no yes no Female 
6 no no no no Female 
7 yes no yes yes Male 
8 no yes no no Male 
9 yes no no no Male 
10 yes yes yes no Female 
The partition of induced by {MP,WP} is: 
 { } {{1, 5, 7, 9}, {2, 4, 10}, {3, 6}, {8}} 
 ( { }) 
Applying Eq. (10), we get the distance between and as follows: 
 ( { }) 
With the same process, we can get the distance between MP and other attributes: 
The Average Mantaras Distance of attribute MP can be computed by Eq.(22) as: 
With the same process as MP, we can deal with other attributes . The and of all attributes are 
summarized in Table 2. 
Table 2. All values of distance and average distance between attributes in Table 1 
Attribute 
Distance (w.r.t.) 
MP WP LIP CCI S 
MP 0.000 1.840 
WP 0.000 1.722 1.678 1.902 1.786 
LIP 1.722 0.000 1.249 1.446 
CCI 1.678 1.249 0.000 1.351 1.412 
S 1.902 1.722 1.351 0.000 1.704 
From Table 2, we can see that attribute CCI has the smallest ; therefore CCI is selected as clustering 
attribute using MAMD algorithm. 
V. COMPARISON TESTS 
A. clustering quality measure 
The four techniques TR, MMR, MDA and MAMD techniques use different methods for selecting clustering 
attribute. Measuring the clustering quality of selected attribute in a just manner is a non-trivial task. Since the goal of 
cluster analysis is to group data with similar characteristics, we use average intra-class similarity to measure the 
quality. 
Definition 8. Let be an information system and suppose that all attributes in are categorical. Then 
the similarity between two objects and in is defined as: 
Pham Cong Xuyen, Đo Si Truong, Nguyen Thanh Tung 37 
 ( ) 
|{{ }| }|
| |
Definition 9. Let be an information system. Suppose is selected as clustering attribute and the 
clustering (partition) induced by is { } where { | |}, . The average 
similarity ( ) of object with respect to other objects in is defined as: 
 ( ) 
∑ ( )
| |
| | 
Definition 10. Let be an information system. Suppose is selected as clustering attribute and the 
clustering (partition) induced by is { } where { | |} , . The intra-
class similarity ( ) of cluster is defined as: 
∑ ( )
| |
| |
Definition 11. Let be an information system. Suppose is selected as clustering attribute and the 
clustering (partition) induced by is { } where { | |}, . The average 
intra-class similarity ( ) of clustering induced by is defined as: 
 ( ) 
∑ 
The higher the average intra-class similarity is the higher the clustering quality of the selected attribute. 
B. Datasets for testing and results 
The data sets of four test cases, as in [7], are presented in Table 1, Table 3, Table 4 and Table 5. These are Credit, 
Animal world, Parmar, and student‟s enrollment qualification data sets. 
Table 3. The animal world data set 
Animal Hair Teeth Eye Feather Feet Eat Milk Fly Swim 
Tiger 
Cheetah 
Giraffe 
Zebra 
Ostrich 
Penguin 
Albatross 
Eagle 
Viper 
Y 
Y 
Y 
Y 
N 
N 
N 
N 
N 
pointed 
pointed 
blunt 
blunt 
N 
N 
N 
N 
pointed 
forward 
forward 
side 
side 
side 
side 
side 
forward 
forward 
N 
N 
N 
N 
Y 
Y 
Y 
Y 
N 
claw 
claw 
hoof 
hoof 
claw 
web 
craw 
craw 
N 
meat 
meat 
grass 
grass 
grain 
fish 
grain 
meat 
meat 
Y 
Y 
Y 
Y 
N 
N 
N 
N 
N 
N 
N 
N 
N 
N 
N 
Y 
Y 
N 
Y 
Y 
N 
N 
N 
Y 
Y 
N 
N 
Table 4. The Parmar data set 
Rows 
1 Big Blue Hard Indefinite Plastic Negative 
2 Medium Red Moderate Smooth Wood Neutral 
3 Small Yellow Soft Fuzzy Plush Positive 
4 Medium Blue Moderate Fuzzy Plastic Negative 
5 Small Yellow Soft Indefinite Plastic Neutral 
6 Big Green Hard Smooth Wood Positive 
7 Small Yellow Hard Indefinite Metal Positive 
8 Small Yellow Soft Indefinite Plastic Positive 
9 Big Green Hard Smooth Wood Neutral 
10 Medium Green Moderate Smooth Plastic Neutral 
38 AN INFORMATION-THEORETIC METRIC BASED METHOD FOR SELECTING CLUSTERING ATTRIBUTE 
Table 5. The student‟s enrollment qualification data set 
Student Degree English Experience IT Maths Programming Statistics 
1 PhD Good Medium Good Good Good Good 
2 PhD Medium Medium Good Good Good Good 
3 M.Sc Medium Medium Medium Good Good Good 
4 M.Sc Medium Medium Medium Good Good Medium 
5 M.Sc Medium Medium Medium Medium Medium Medium 
6 M.Sc Medium Medium Medium Medium Medium Medium 
7 B.Sc Medium Good Good Medium Medium Medium 
8 B.Sc Bad Good Good Medium Medium Good 
We have installed all the four techniques TR, MMR, MDA and MAMD, in R programming language using 
RoughSets package. 
As the results of computations, Table 6 shows the clustering attributes chosen by the techniques in each data set. 
Table 6. Clustering attributes selected by the techniques in data set 
 Datasets 
Credit Animal Parmar Student 
TR CCI Hair Var1 Experience 
MMR CCI Hair Var1 Experience 
MDA CCI Hair Var1 Experience 
MAMD CCI Teeth Var2 Degree 
From Table 6, we can see that for all the four considered data sets, three techniques TR, MMR and MDA choose 
same attribute as the clustering attribute. Our technique MAMD chooses same attribute CCI in Credit data set, but 
other attributes in Animal world, Parmar and Student‟s enrollment qualification data sets. 
Now, let us measure the clustering quality of attributes chosen by TR, MMR, MDA and MAMD in these three 
data sets. 
We take attribute Hair in Credit data set as an example to calculate the average intra-class similarity. 
The partition of induced by attribute Hair consists of two equivalence classes: 
 = { } 
 { } 
We take animal Tiger in as an example to calculate the similarity, average similarity. Applying Eq. (25), we 
have 
Applying Eq. (26), the average similarity of Tiger with respect to other animals in is calculated as follows: 
With the same process, the similarity and average similarity of other animals in are calculated and summarized 
as in Table 7. 
Table 7. The similarity, and of all animals in induced by Hair 
Animal Tiger Cheetah Giraffe Zebra 
Tiger - 1.000 0.444 0.444 0.630 0.630 
Cheetah 1.000 - 0.444 0.444 0.630 
Giraffe 0.444 0.444 - 1.000 0.630 
Zebra 0.444 0.444 1.000 - 0.630 
Pham Cong Xuyen, Đo Si Truong, Nguyen Thanh Tung 39 
Applying Eq. (27), the intra-class similarity of is calculated below. 
Using the same way, we obtain . 
Lastly, using Eq. (28), the average intra-class similarity is calculated as follows: 
With the same process, we have computed the average intra-class similarity of clustering induced by Teeth in 
“Animal world”, by Var1 and by Var2 in Parmar, by Experience and by Degree in Student dataset. The computation 
results are given in Table 8. 
Table 8. The average intra-class similarity ( ) of clustering 
induced by attributes in the datasets 
 Selected attributes and their values 
Animal Parmar Student 
TR Hair 0.587 Var1 0.536 Experience 0.638 
MMR Hair 0.587 Var1 0.536 Experience 0.638 
MDA Hair 0.587 Var1 0.536 Experience 0.638 
MAMD Teeth 0.784 Var2 0.555 Degree 0.770 
The results in Table 8 show that the clustering quality of attribute selected using MAMD technique is higher than 
that of attribute selected by TR, MMR and MDA techniques. 
VI. CONCLUSION 
In the recent years, some techniques applying rough set theory for selecting clustering attributes have been 
proposed. However, although they look different, there is an inherent similarity among them, and the computational 
complexity is still an issue. 
In this paper, we review three rough set based techniques: Total Roughness (TR), Min-Min Roughness (MMR) 
and Maximum Dependency Attribute (MDA), and propose MAMD (Minimum value of Average Mantaras Distance), 
an alternative algorithm for hierarchical clustering attribute selection. MAMD uses Mantaras metric which is an 
information-theoretic metric on the set of partitions of a finite set of objects and seeks to determine a clustering 
attribute such that the average distance between the partition generated by this attribute and the partitions generated by 
other attributes of the objects has a minimum value. To evaluate and compare MAMD with three rough set based 
techniques, we use the concept of average intra-class similarity to measure the clustering quality of selected attribute. 
The experiment results show that the clustering quality of the attribute selected by our method is higher than that of 
attributes selected by TR, MMR and MDA methods. The proposed approach could be integrated into clustering 
algorithm based on attributes selection for categorical data. 
REFERENCES 
[1] Barbara, D., Li, Y., Couto, J.: COOLCAT: an entropy-based algorithm for categorical clustering. In: Proc. of 
CIKM 2002, 582–589, 2002. 
[2] Cao F. Y., Liang J. Y. , Li D. Y., Bai L., A new initialization method for categorical data clustering, Expert 
Syst. Appl. 36, 10223–10228, 2009. 
[3] Ganti, V., Gehrke, J., Ramakrishnan, R.: CACTUS – clustering categorical data using summaries. In: Fifth 
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 73–83, 1999. 
[4] Guha, S., Rastogi, R., Shim, K.: ROCK: a robust clustering algorithm for categorical attributes. Information 
Systems, 25(5), 345–366, 2000. 
[5] Huang, Z.: Extensions to the k-averages algorithm for clustering large datasets with categorical values. Data 
Mining and Knowledge Discovery, 2(3), 283–304, 1998. 
[6] Han J., and Kamber M., Data Mining: Concepts and Techniques, Morgan Kanufmann, 2006. 
40 AN INFORMATION-THEORETIC METRIC BASED METHOD FOR SELECTING CLUSTERING ATTRIBUTE 
[7] Herawan, T., Deris, M. M., Abawajy, J. H.: A rough set approach for selecting clustering attribute. Knowledge-
Based Systems 23, 220–231, 2010. 
[8] Jain A. K., Data clustering: 50 years beyond k-averages, Pattern Recogn. Lett., 31(8), 651–666, 2010. 
[9] Jyoti Dr., Clustering categorical data using rough sets: a review. International Journal of Advanced Research 
in IT and Engineering, Vol. 2, No. 12, December, 30-37, 2013. 
[10] Lopez de Mantaras R., A Distance-Based Attribute Selection Measure for Decision Tree Induction. Machine 
Learning, 6, N
o
 1, 81-92, 1991. 
[11] Mazlack, L. J., He, A., Zhu, Y., Coppock, S.: A rough set approach in choosing clustering attributes. In: 
Proceedings of the ISCA 13th International Conference (CAINE 2000), 1–6, 2000. 
[12] Parmar, D., Wu, T., Blackhurst, J.: MMR: an algorithm for clustering categorical data using rough set theory. 
Data and Knowledge Engineering, 63, 879–893, 2007. 
[13] Pawlak, Z.: Rough sets. International Journal of Computer and Information Science, 11, 341–356, 1982. 
[14] Suchita S. Mesakar, M. S. Chaudhari, Review Paper On Data Clustering Of Categorical Data. International 
Journal of Engineering Research & Technology, Vol. 1 Issue 10, December, 2012. 
MỘT PHƯƠNG PHÁP LỰA CHỌN THUỘC TÍNH GOM CỤM 
SỬ DỤNG METRIC LÝ THUYẾT THÔNG TIN 
Phạm Công Xuyên, Đỗ Sĩ Trường, Nguyễn Thanh Tùng
TÓM TẮT — Bài toán gom cụm dữ liệu xuất hiện trong nhiều lĩnh vực khác nhau như Khai thác dữ liệu, Nhận dạng, Tin-sinh học, 
vv. Mục tiêu cơ bản của gom cụm là nhóm đối tượng thành các cụm sao cho các đối tượng trong cùng một cụm thì tương tự như 
nhau hơn là các đối tượng từ các cụm khác nhau. Gần đây, nhiều nhà nghiên cứu quan tâm đến vấn đề gom cụm dữ liệu phạm trù 
(categorical), trong đó các đối tượng dữ liệu được mô tả bởi các thuộc tính không phải thuộc tính số. Đặc biệt, phương pháp tiếp 
cận sử dụng lý thuyết tập thô trong gom cụm phân cấp (hierarchical) dữ liệu phạm trù đã thu hút nhiều sự chú ý. Chìa khóa của các 
phương pháp này là làm thế nào để chọn được một thuộc gom cụm tốt nhất tại mỗi thời điểm trong số nhiều thuộc tính ứng viên. 
Trong bài báo này, chúng tôi xem xét ba kỹ thuật dựa trên lý thuyết tập thô: Total Roughness (TR), Min-Min Roughness (MMR) và 
Maximum Dependency Attribute (MDA), và đề xuất MAMD (Minimum value of Average Mantaras Distance), một thuật toán mới 
cho việc lựa chọn thuộc tính phân cụm theo tiếp cận phân cấp. MAMD sử dụng metric Mantaras, một metric lý thuyết thông tin trên 
tập các phân hoạch của một tập hợp gồm hữu hạn các đối tượng và tìm cách xác định thuộc tính gom cụm sao cho khoảng cách 
trung bình giữa phân hoạch sinh ra bởi thuộc tính này và các phân hoạch sinh ra bởi các thuộc tính khác đạt giá trị nhỏ nhất. Để 
đánh giá và so sánh MAMD với ba kỹ thuật dựa trên lý thuyết tập thô, chúng tôi sử dụng khái niệm “Độ tương tự trung bình bên 
trong cụm” của một phép gom cụm để đo lường chất lượng gom cụm của thuộc tính được chọn. Kết quả thực nghiệm cho thấy chất 
lượng gom cụm của thuộc tính chọn được bằng phương pháp của chúng tôi là cao hơn so với các thuộc tính chọn bởi các phương 
pháp TR, MMR và MDA. Do đó, MAMD có thể được sử dụng như là một kỹ thuật hiệu quả lựa chọn thuộc tính trong phân cụm 
phân cấp. 
Từ khóa — Gom cụm, Lý thuyết tập thô, Lựa chọn thuộc tính, Mantaras metric, Gom cụm phân cấp. 

File đính kèm:

  • pdfan_information_theoretic_metric_based_method_for_selecting_c.pdf