An information-theoretic metric based method for selecting clustering attribute

ABSTRACT—Clustering problem appears in many different fields like Data Mining, Pattern Recognition, Bioinfor-Matics, etc. The basic objective of clustering is to group objects into clusters so that objects in the same cluster are more similar to one another than they are to objects in other clusters. Recently, many researchers have contributed to categorical data clustering, where data objects are made up of non-numerical attributes. Especially, rough set theory based attribute selection clustering approaches for categorical data have attracted much attention. The key to these approaches is how to select only one attribute that is the best to cluster the objects at each time from many candidates of attributes. In this paper, we review three rough set based techniques: Total Roughness (TR), Min-Min Roughness (MMR) and Maximum Dependency Attribute (MDA), and propose MAMD (Minimum value of Average Mantaras Distance), an alternative algorithm for hierarchical clustering attribute selection. MAMD uses Mantaras metric which is an information-theoretic metric on the set of partitions of a finite set of objects and seeks to determine a clustering attribute such that the average distance between the partition generated by this attribute and the partitions generated by other attributes of the objects has a minimum value. To evaluate and compare MAMD with three rough set based techniques, we use the concept of average intra-class similarity to measure the clustering quality of selected attribute. The experiment results show that the clustering quality of the attribute selected by our method is higher than that of attributes selected by TR, MMR and MDA methods

Download

Trang 1

Trang 2

Trang 3

Trang 4

Trang 5

Trang 6

Trang 7

Trang 8

Trang 9

Trang 10

10 trang xuanhieu 6980

Download

Bạn đang xem tài liệu "An information-theoretic metric based method for selecting clustering attribute", để tải tài liệu gốc về máy hãy click vào nút Download ở trên

Tóm tắt nội dung tài liệu: An information-theoretic metric based method for selecting clustering attribute

tion we present MAMD technique, which is based on Minimum Average Mantaras Distance, to select
clustering attribute.
Definition 7. Let be an information system, and is an attribute, the Average Mantaras Distance
of to all , , is defined by the following formula:
∑ ( )
| |
| |
where are the partitions of induced by and respectively.
In the above definitions, ( ) is a measurement to the distance between and From the view of
clustering, the lower is, the higher the crispness of the clustering. Based on the above definition, we present the
MAMD algorithm as follows.
Input: Dataset (information system) without clustering attribute
Output: Clustering attribute
Begin
Step 1. For each attribute compute the equivalence classes of partition
induced by indiscernibility relation .
Step 2. For each attribute compute the condition entropy of partition
with respect to partition , where :
| ∑ ( )
∑ ( | ) ( | )
where { }, { } and ( | ) | | | |⁄ ,
and
Step 3. For each every pair , where , compute the distance between
two partitions and using Mantaras metric:
( ) ( | ) ( | )
Step 4. For each attribute compute the Average Mantaras Distance
according to (22).
Step 5. Select the attribute with the lowest as clustering attribute.
End
Let us illustrate the MAMD algorithm by an example.
Example. Table 1 shows Credit dataset as in [7]. There are ten objects with five categorical attributes: Magazine
Promotion (MP), Watch Promotion (WP), Life Insurance Promotion (LIP), Credit Card Insurance (CCI), and Sex (S).
First, we deal with attribute MP. The partition of induced by attribute MP is:
{{1, 2, 4, 5, 7, 9, 10}, {3, 6, 8}}.
We have:
36 AN INFORMATION-THEORETIC METRIC BASED METHOD FOR SELECTING CLUSTERING ATTRIBUTE
The partition of induced by attribute WP is:
{{1, 3, 5, 6, 7, 9}, {2, 4, 8, 10}}
Table 1. The Credit dataset
# MP WP LIP CCI S
1 yes no no no Male
2 yes yes yes no Female
3 no no no no Male
4 yes yes yes yes Male
5 yes no yes no Female
6 no no no no Female
7 yes no yes yes Male
8 no yes no no Male
9 yes no no no Male
10 yes yes yes no Female
The partition of induced by {MP,WP} is:
{ } {{1, 5, 7, 9}, {2, 4, 10}, {3, 6}, {8}}
( { })
Applying Eq. (10), we get the distance between and as follows:
( { })
With the same process, we can get the distance between MP and other attributes:
The Average Mantaras Distance of attribute MP can be computed by Eq.(22) as:
With the same process as MP, we can deal with other attributes . The and of all attributes are
summarized in Table 2.
Table 2. All values of distance and average distance between attributes in Table 1
Attribute
Distance (w.r.t.)
MP WP LIP CCI S
MP 0.000 1.840
WP 0.000 1.722 1.678 1.902 1.786
LIP 1.722 0.000 1.249 1.446
CCI 1.678 1.249 0.000 1.351 1.412
S 1.902 1.722 1.351 0.000 1.704
From Table 2, we can see that attribute CCI has the smallest ; therefore CCI is selected as clustering
attribute using MAMD algorithm.
V. COMPARISON TESTS
A. clustering quality measure
The four techniques TR, MMR, MDA and MAMD techniques use different methods for selecting clustering
attribute. Measuring the clustering quality of selected attribute in a just manner is a non-trivial task. Since the goal of
cluster analysis is to group data with similar characteristics, we use average intra-class similarity to measure the
quality.
Definition 8. Let be an information system and suppose that all attributes in are categorical. Then
the similarity between two objects and in is defined as:
Pham Cong Xuyen, Đo Si Truong, Nguyen Thanh Tung 37
( )
|{{ }| }|
| |
Definition 9. Let be an information system. Suppose is selected as clustering attribute and the
clustering (partition) induced by is { } where { | |}, . The average
similarity ( ) of object with respect to other objects in is defined as:
( )
∑ ( )
| |
| |
Definition 10. Let be an information system. Suppose is selected as clustering attribute and the
clustering (partition) induced by is { } where { | |} , . The intra-
class similarity ( ) of cluster is defined as:
∑ ( )
| |
| |
Definition 11. Let be an information system. Suppose is selected as clustering attribute and the
clustering (partition) induced by is { } where { | |}, . The average
intra-class similarity ( ) of clustering induced by is defined as:
( )
∑
The higher the average intra-class similarity is the higher the clustering quality of the selected attribute.
B. Datasets for testing and results
The data sets of four test cases, as in [7], are presented in Table 1, Table 3, Table 4 and Table 5. These are Credit,
Animal world, Parmar, and student‟s enrollment qualification data sets.
Table 3. The animal world data set
Animal Hair Teeth Eye Feather Feet Eat Milk Fly Swim
Tiger
Cheetah
Giraffe
Zebra
Ostrich
Penguin
Albatross
Eagle
Viper
Y
Y
Y
Y
N
N
N
N
N
pointed
pointed
blunt
blunt
N
N
N
N
pointed
forward
forward
side
side
side
side
side
forward
forward
N
N
N
N
Y
Y
Y
Y
N
claw
claw
hoof
hoof
claw
web
craw
craw
N
meat
meat
grass
grass
grain
fish
grain
meat
meat
Y
Y
Y
Y
N
N
N
N
N
N
N
N
N
N
N
Y
Y
N
Y
Y
N
N
N
Y
Y
N
N
Table 4. The Parmar data set
Rows
1 Big Blue Hard Indefinite Plastic Negative
2 Medium Red Moderate Smooth Wood Neutral
3 Small Yellow Soft Fuzzy Plush Positive
4 Medium Blue Moderate Fuzzy Plastic Negative
5 Small Yellow Soft Indefinite Plastic Neutral
6 Big Green Hard Smooth Wood Positive
7 Small Yellow Hard Indefinite Metal Positive
8 Small Yellow Soft Indefinite Plastic Positive
9 Big Green Hard Smooth Wood Neutral
10 Medium Green Moderate Smooth Plastic Neutral
38 AN INFORMATION-THEORETIC METRIC BASED METHOD FOR SELECTING CLUSTERING ATTRIBUTE
Table 5. The student‟s enrollment qualification data set
Student Degree English Experience IT Maths Programming Statistics
1 PhD Good Medium Good Good Good Good
2 PhD Medium Medium Good Good Good Good
3 M.Sc Medium Medium Medium Good Good Good
4 M.Sc Medium Medium Medium Good Good Medium
5 M.Sc Medium Medium Medium Medium Medium Medium
6 M.Sc Medium Medium Medium Medium Medium Medium
7 B.Sc Medium Good Good Medium Medium Medium
8 B.Sc Bad Good Good Medium Medium Good
We have installed all the four techniques TR, MMR, MDA and MAMD, in R programming language using
RoughSets package.
As the results of computations, Table 6 shows the clustering attributes chosen by the techniques in each data set.
Table 6. Clustering attributes selected by the techniques in data set
Datasets
Credit Animal Parmar Student
TR CCI Hair Var1 Experience
MMR CCI Hair Var1 Experience
MDA CCI Hair Var1 Experience
MAMD CCI Teeth Var2 Degree
From Table 6, we can see that for all the four considered data sets, three techniques TR, MMR and MDA choose
same attribute as the clustering attribute. Our technique MAMD chooses same attribute CCI in Credit data set, but
other attributes in Animal world, Parmar and Student‟s enrollment qualification data sets.
Now, let us measure the clustering quality of attributes chosen by TR, MMR, MDA and MAMD in these three
data sets.
We take attribute Hair in Credit data set as an example to calculate the average intra-class similarity.
The partition of induced by attribute Hair consists of two equivalence classes:
= { }
{ }
We take animal Tiger in as an example to calculate the similarity, average similarity. Applying Eq. (25), we
have
Applying Eq. (26), the average similarity of Tiger with respect to other animals in is calculated as follows:
With the same process, the similarity and average similarity of other animals in are calculated and summarized
as in Table 7.
Table 7. The similarity, and of all animals in induced by Hair
Animal Tiger Cheetah Giraffe Zebra
Tiger - 1.000 0.444 0.444 0.630 0.630
Cheetah 1.000 - 0.444 0.444 0.630
Giraffe 0.444 0.444 - 1.000 0.630
Zebra 0.444 0.444 1.000 - 0.630
Pham Cong Xuyen, Đo Si Truong, Nguyen Thanh Tung 39
Applying Eq. (27), the intra-class similarity of is calculated below.
Using the same way, we obtain .
Lastly, using Eq. (28), the average intra-class similarity is calculated as follows:
With the same process, we have computed the average intra-class similarity of clustering induced by Teeth in
“Animal world”, by Var1 and by Var2 in Parmar, by Experience and by Degree in Student dataset. The computation
results are given in Table 8.
Table 8. The average intra-class similarity ( ) of clustering
induced by attributes in the datasets
Selected attributes and their values
Animal Parmar Student
TR Hair 0.587 Var1 0.536 Experience 0.638
MMR Hair 0.587 Var1 0.536 Experience 0.638
MDA Hair 0.587 Var1 0.536 Experience 0.638
MAMD Teeth 0.784 Var2 0.555 Degree 0.770
The results in Table 8 show that the clustering quality of attribute selected using MAMD technique is higher than
that of attribute selected by TR, MMR and MDA techniques.
VI. CONCLUSION
In the recent years, some techniques applying rough set theory for selecting clustering attributes have been
proposed. However, although they look different, there is an inherent similarity among them, and the computational
complexity is still an issue.
In this paper, we review three rough set based techniques: Total Roughness (TR), Min-Min Roughness (MMR)
and Maximum Dependency Attribute (MDA), and propose MAMD (Minimum value of Average Mantaras Distance),
an alternative algorithm for hierarchical clustering attribute selection. MAMD uses Mantaras metric which is an
information-theoretic metric on the set of partitions of a ﬁnite set of objects and seeks to determine a clustering
attribute such that the average distance between the partition generated by this attribute and the partitions generated by
other attributes of the objects has a minimum value. To evaluate and compare MAMD with three rough set based
techniques, we use the concept of average intra-class similarity to measure the clustering quality of selected attribute.
The experiment results show that the clustering quality of the attribute selected by our method is higher than that of
attributes selected by TR, MMR and MDA methods. The proposed approach could be integrated into clustering
algorithm based on attributes selection for categorical data.
REFERENCES
[1] Barbara, D., Li, Y., Couto, J.: COOLCAT: an entropy-based algorithm for categorical clustering. In: Proc. of
CIKM 2002, 582–589, 2002.
[2] Cao F. Y., Liang J. Y. , Li D. Y., Bai L., A new initialization method for categorical data clustering, Expert
Syst. Appl. 36, 10223–10228, 2009.
[3] Ganti, V., Gehrke, J., Ramakrishnan, R.: CACTUS – clustering categorical data using summaries. In: Fifth
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 73–83, 1999.
[4] Guha, S., Rastogi, R., Shim, K.: ROCK: a robust clustering algorithm for categorical attributes. Information
Systems, 25(5), 345–366, 2000.
[5] Huang, Z.: Extensions to the k-averages algorithm for clustering large datasets with categorical values. Data
Mining and Knowledge Discovery, 2(3), 283–304, 1998.
[6] Han J., and Kamber M., Data Mining: Concepts and Techniques, Morgan Kanufmann, 2006.
40 AN INFORMATION-THEORETIC METRIC BASED METHOD FOR SELECTING CLUSTERING ATTRIBUTE
[7] Herawan, T., Deris, M. M., Abawajy, J. H.: A rough set approach for selecting clustering attribute. Knowledge-
Based Systems 23, 220–231, 2010.
[8] Jain A. K., Data clustering: 50 years beyond k-averages, Pattern Recogn. Lett., 31(8), 651–666, 2010.
[9] Jyoti Dr., Clustering categorical data using rough sets: a review. International Journal of Advanced Research
in IT and Engineering, Vol. 2, No. 12, December, 30-37, 2013.
[10] Lopez de Mantaras R., A Distance-Based Attribute Selection Measure for Decision Tree Induction. Machine
Learning, 6, N
o
1, 81-92, 1991.
[11] Mazlack, L. J., He, A., Zhu, Y., Coppock, S.: A rough set approach in choosing clustering attributes. In:
Proceedings of the ISCA 13th International Conference (CAINE 2000), 1–6, 2000.
[12] Parmar, D., Wu, T., Blackhurst, J.: MMR: an algorithm for clustering categorical data using rough set theory.
Data and Knowledge Engineering, 63, 879–893, 2007.
[13] Pawlak, Z.: Rough sets. International Journal of Computer and Information Science, 11, 341–356, 1982.
[14] Suchita S. Mesakar, M. S. Chaudhari, Review Paper On Data Clustering Of Categorical Data. International
Journal of Engineering Research & Technology, Vol. 1 Issue 10, December, 2012.
MỘT PHƯƠNG PHÁP LỰA CHỌN THUỘC TÍNH GOM CỤM
SỬ DỤNG METRIC LÝ THUYẾT THÔNG TIN
Phạm Công Xuyên, Đỗ Sĩ Trường, Nguyễn Thanh Tùng
TÓM TẮT — Bài toán gom cụm dữ liệu xuất hiện trong nhiều lĩnh vực khác nhau như Khai thác dữ liệu, Nhận dạng, Tin-sinh học,
vv. Mục tiêu cơ bản của gom cụm là nhóm đối tượng thành các cụm sao cho các đối tượng trong cùng một cụm thì tương tự như
nhau hơn là các đối tượng từ các cụm khác nhau. Gần đây, nhiều nhà nghiên cứu quan tâm đến vấn đề gom cụm dữ liệu phạm trù
(categorical), trong đó các đối tượng dữ liệu được mô tả bởi các thuộc tính không phải thuộc tính số. Đặc biệt, phương pháp tiếp
cận sử dụng lý thuyết tập thô trong gom cụm phân cấp (hierarchical) dữ liệu phạm trù đã thu hút nhiều sự chú ý. Chìa khóa của các
phương pháp này là làm thế nào để chọn được một thuộc gom cụm tốt nhất tại mỗi thời điểm trong số nhiều thuộc tính ứng viên.
Trong bài báo này, chúng tôi xem xét ba kỹ thuật dựa trên lý thuyết tập thô: Total Roughness (TR), Min-Min Roughness (MMR) và
Maximum Dependency Attribute (MDA), và đề xuất MAMD (Minimum value of Average Mantaras Distance), một thuật toán mới
cho việc lựa chọn thuộc tính phân cụm theo tiếp cận phân cấp. MAMD sử dụng metric Mantaras, một metric lý thuyết thông tin trên
tập các phân hoạch của một tập hợp gồm hữu hạn các đối tượng và tìm cách xác định thuộc tính gom cụm sao cho khoảng cách
trung bình giữa phân hoạch sinh ra bởi thuộc tính này và các phân hoạch sinh ra bởi các thuộc tính khác đạt giá trị nhỏ nhất. Để
đánh giá và so sánh MAMD với ba kỹ thuật dựa trên lý thuyết tập thô, chúng tôi sử dụng khái niệm “Độ tương tự trung bình bên
trong cụm” của một phép gom cụm để đo lường chất lượng gom cụm của thuộc tính được chọn. Kết quả thực nghiệm cho thấy chất
lượng gom cụm của thuộc tính chọn được bằng phương pháp của chúng tôi là cao hơn so với các thuộc tính chọn bởi các phương
pháp TR, MMR và MDA. Do đó, MAMD có thể được sử dụng như là một kỹ thuật hiệu quả lựa chọn thuộc tính trong phân cụm
phân cấp.
Từ khóa — Gom cụm, Lý thuyết tập thô, Lựa chọn thuộc tính, Mantaras metric, Gom cụm phân cấp.

File đính kèm:

an_information_theoretic_metric_based_method_for_selecting_c.pdf