Author profiling of vietnamese forum posts - An investigation on content - Based features

The rapid growth of World Wide Web has

created a lot of online channels for people to

communicate, such as email, blogs, social

networks, etc. However, online forum is still

one of the most popular channels for people to

share the opinions and discuss about the topics

which are interested in common. Forum posts

created by users can be considered as informal

and personal writings. Authors of these posts

can indicate their profiles for other people to

view as a function of forum. But not many

users reveal their personal information, because

of information privacy issues on the online

systems. Moreover, personal information of

users is not mandatory to input when they

people do not provide their personal

information or input the incorrect/unclear data.

As a result, the task of automatically

classifying the author’s properties such as

gender, age, location, occupation, etc. becomes

important and essential. Applications of this

task can be in commercial field, in which

providers can know which types of users like or

do not like their products/services (for target

marketing and product development). For the

social research domain, researchers also want to

know the profile of people who have a specific

opinion about some social issues (when doing a

social survey). It can also be used to support the

court, in term of identifying if a text was

created by a criminal or not [1].

Download

Trang 1

Trang 2

Trang 3

Trang 4

Trang 5

Trang 6

Trang 7

Trang 8

Trang 9

Trang 10

10 trang duykhanh 14040

Download

Bạn đang xem tài liệu "Author profiling of vietnamese forum posts - An investigation on content - Based features", để tải tài liệu gốc về máy hãy click vào nút Download ở trên

Tóm tắt nội dung tài liệu: Author profiling of vietnamese forum posts - An investigation on content - Based features

ts in words is
107 (the short test post contains 50 words, the
longest post contains 300 words).
Table 1. Corpus Statistic
Trait Total
posts
Class Percent in
corpus
Gender 4.474 Male 54%
Female 46%
Age 3.017 < 22 21%
24 to 27 27%
> 32 52%
Location 3.960 North 57%
South 43%
Occupation 3.453 Business,
Sale, Admin
36%
Technique,
Technology
31%
Education,
Healthcare
33%
4.2. Results and discussion
We conducted experiments on 4 traits of
authors as mentioned earlier using the Weka1
toolkit. The results were verified through a 10-
fold cross validation process, in which the
training set is randomly partitioned into 10
equal size subsets and 9 subsets were used as
training data and the remaining subset is
retained for testing. This process is then
repeated 10 times with each of 10 subsets is
used exactly once as the validation data. Using
Grid Search for SVM on PolyKernel with two
_______
1
parameters c and exponent, together with some
modifications in the feature extraction step, the
results improved noticeably compared with
results in [8], specially on age, location, and
occupation traits (e.g. the best parameters for
gender trait are c=3.0 and exponent=1.0). Table
2 shows the results of author profiling
experiments of 4 traits.
General evaluation. As the results shown
in Table 2, we can observe that content-based
features outperformed Style-based features.
Although content-based features are often
considered domain-specific and may be less
accurate when moving the other domains, the
results in this task are still promising. Firstly,
the data in corpus was collected from various
source, therefore it is not so domain-specific.
Secondly, even the results are domain-specific
to some extent, it is still useful when we
conduct the research or apply the results in that
domain. Besides, the results of Style-based
features are also good, especially for gender and
location. Generally, using content-based
features increases the accuracy from 7% to 8%,
but the improvement is more than 11% for the
location trait. Therefore, we may infer that
prediction of location is more sensitive on
content-based features than other traits. It is
reasonable because people from north and south
of Vietnam often use different local words in
casual communication.
Table 2. The results of author profiling experiments
Feature Gender Age Location Occup-ation
All
Features
90.55 70.70 83.13 61.04
Style-
based
83.47 62.76 71.22 52.46
Content-
based
90.01 70.05 82.98 60.99
Number of content-based features. As
mentioned earlier, to reduce the complexity and
improve the accuracy of the model, we applied
a feature selection method to eliminate the
irrelevant features. We experimented the
classification with different number of content
words which were chosen by Information Gain
D.T. Duc et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 1 (2017) 37-46 43
method, ranging from 100 to 1000. Fig 2.
shows the best number of features for each trait.
The figure shows that the highest score of
gender prediction is achieved when using 600
content words. The best number of words for
age and location traits is 400 and the occupation
trait is 200. The reason for this is probably the
noise in occupation data and therefore, not
many words can be used to discriminate
between the classes of occupation. Table 3
shows some of the most important content
words with their weights for each trait (the
bigger absolute value of weight is, the more
important the feature is).
Fig 2. Prediction accuracy for different numbers of
content words.
Table 3. The top important content words for each trait
(a) Important words for gender prediction
Male Female
feature weight feature weight feature weight feature weight
mục tiêu -1.35 quy định -1.18 cảm ơn 1.91 hồng 1.46
dữ liệu -1.34 máy ảnh -1.09 khách sạn 1.79 bếp 1.43
doanh nghiệp -1.32 điện tử -1.07 cưới 1.76 sữa 1.31
kỹ thuật -1.31 triển khai -1.03 bác sĩ 1.56 chia sẻ 1.27
xử lý -1.26 kiểm tra -1.02 vải 1.51 áp lực 1.18
(b) Important words for age prediction
Younger Middle Older
feature weight feature weight feature weight
học hỏi -1.50 nhu cầu -1.29 xài 1.24
lịch sử -1.32 triệu -1.20 luật 1.11
nguyên do -1.25 khắp nơi -0.90 quy định 0.66
hành động -1.05 lang thang -0.74 chi phí 0.62
thể thao -0.80 bỏ qua -1.03 hỗ trợ 0.58
(c) Important words for location prediction
North South
feature weight feature weight feature weight feature weight
buổi -1.22 rẽ -0.78 máy lạnh 1.52 gởi 1.09
đỗ -1.18 quay -0.73 coi 1.51 đậu 1.04
mạch -1.05 sinh -0.70 gạt 1.48 xài 1.00
liệu -1.00 ảnh -0.65 nhơn 1.46 uổng 1.00
nộp -1.00 chịu khó -0.53 quẹo 1.35 dơ 0.91
D.T. Duc et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 1 (2017) 37-46
44
(d) Important words for occupation prediction
Business/Sale/Admin Technology/Technique Education/Healthcare
feature weight feature weight feature weight
lịch -1.64 phát triển 1.68 tâm lý 1.61
cuộc -1.62 cấu hình 1.60 hình ảnh 1.58
lang thang -1.21 kết hợp 1.53 xã hội 1.43
đến nơi -0.88 kỹ thuật 1.30 học 1.13
cung cấp -0.77 tài liệu 1.20 từ thiện 1.09
H
The words in tables suggest that the men
tend to discuss about work, technology,
regulation etc. while the women often talk
about life, health, pressure, and so on. Young
people like to discuss about learning, action,
etc. The middle age people talk about the needs,
travel, and the older people often exchange the
views on expenses, law, etc. There many local
words that the northern and southern people
often used differently from each other, but in
our corpus, we found some of them as in the
Table 3 (c). Table 3 (d) shows that the people
working in business, sale field often used words
related to schedule, appointments, travel, while
the people working in technology field like to
talk about development, machine, etc., and the
people which have jobs in education/healthcare
fields often discuss about the social, learning,
charity issues.
Comparison with previous works. In
comparison to the results of previous works,
although forum posts are shorter and noisier
than other types of online messages such as
blog posts or emails, but the results can be
considered as promising, especially for gender
and location traits. The accuracy of 90.55%
when predicting the gender is even better than
the results of most of previous works which
were conducted on blogs or emails (which had
base-line about 80%). The percentage of age
prediction (70.70%) is not as good as the results
conducted on blog posts or emails (which had
the base-line around 77% for blog posts), but
much better compared to the result of a research
on forum posts conducted by [16], which is
only 53%. The same evaluation can be used
when saying about the location trait, but the
occupation prediction is not so good. The main
reason is that occupation information is very
noisy and subtle. For example, a person who
studied about technical but then works as a sale
person is not an easy case when predict his/her
job. This needs to be investigated further in
later researches.
When comparing with the only previous
work on author profiling in Vietnamese by [6],
for the gender trait, we achieved the better
result (90.55% and 83.3%) when using
content-based features, and the same result
(83.47% and 83.3%) without content-based
features. It showed that our approach when
adding the content-based features has improved
the results significantly. The same evaluation can
be said when comparing the results of location
trait. But for other traits, our results are less
accurate, but it is understandable and still
promising, because our experiments were
conducted on a shorter and more informal type of
text than blog posts.
5. Conclusion
In this study, we investigate the author
profiling task on a different language
(Vietnamese) and different type of text (forum
posts) than previous works. The results show
that it is feasible to classify authorial
characteristics of the informal online messages
as forum posts based on linguistic features, in
which using content-based features improved
the results significantly. We also have a
D.T. Duc et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 1 (2017) 37-46 45
thorough analysis on content-based features,
such as the best number of content words and
the list of important words for each trait.
Experiments conducted show the promising
results, although some aspects still need to be
improved such as the solutions for noisy
information in occupation trait or the result for
age prediction should be better and so on.
In future, this study can be expanded to
other domains, such as social networks or user
comments/product reviews. The data in these
domains is even shorter and noisier than forum
posts, so it is more challenging task. But the
results of such kind of works have promising
applications in commercial fields, such as
analyzing market trends or user behaviors
prediction etc.
We also have planned to investigate about
the use of more grammar-based features in this
kind of task. Vietnamese has many interesting
linguistic features such as tones, spells, and we
can exploit these features to improve the author
profiling results.
Acknowledgements
This work has been supported by Vietnam
National University, Hanoi (VNU), under
Project No. QG.16.91
References
[1] Abbasi, A., Chen, H. Applying authorship
analysis to extremist-group Web forum
messages, IEEE Intelligent Systems, 20(5),
pp.67-75 (2005).
[2] Abbasi, A., Chen, H. Writeprints: A Style-based
approach to identity-level identification and
similarity detection in cyberspace. ACM
Transactions on Information Systems, 26 (2),
pp: 1-29 (2008).
[3] Argamon, S., Koppel, M., Fine, J. and Shimoni,
A. Gender, Genre, and Writing Style in Formal
Written Texts, Text 23(3), August (2003).
[4] Argamon, S., Koppel, M., Pennebaker, J. and
Schler, J. Automatically Profiling the Author of
an Anonymous Text, Communications of the
ACM , 52(2), pp.119-123 (2008).
[5] Corney, M., DeVel, O., Anderson, A., Mohay,
G. Gender-preferential text mining of e-mail
discourse. In ACSAC’02: Proc. of the 18th
Annual Computer Security Applications
Conference, Washington, DC, pp : 21-27. (2002)
[6] Dang, P., Giang, T., Son, P. Author profiling for
Vietnamese blogs. International Conference on
Asian Language Processing (2009).
[7] De Vel, O., Anderson, A., Corney, M., Mohay,
G. M. Mining e-mail content for author
identification forensics. SIGMOD Record 30(4),
pp. 55-64 (2001).
[8] Duc, D.T., Son, P.B., Hanh, T. Using
Content-based Features for Author Profiling of
Vietnamese Forum Posts. In: Recent
Developments in Intelligent Information and
Database Systems, pp. 287–296. Springer
International Publishing, Berlin (2016).
[9] Goswami, S., Sarkar, S., and Rustagi.M.
Style-based analysis of bloggers’ age and
gender. In Eytan Adar, Matthew Hurst, Tim
Finin, Natalie S. Glance, Nicolas Nicolov, and
Belle L. Tseng, editors, ICWSM. The AAAI
Press (2009).
[10] Gressel, G., Hrudya, P., Surendran, K., Thara,
S., Aravind, A., Prabaharan, P. Ensemble
learning approach for author profiling, Notebook
for PAN at CLEF (2014).
[11] Iqbal, F. Messaging Forensic Framework for
Cybercrime Investigation. A Thesis in the
Department of Computer Science and Software
Engineering - Concordia University Montréal,
Canada (2010).
[12] Koppel, M., Argamon, S., Shimoni, A.R.
Automatically categorizing written texts by
author gender. Literary and Linguistic
Computing, 17(4), pp : 401-412 (2002).
[13] Kucukyilmaz, T., Aykanat, C., Cambazoglu, B.
B., Can, F. Chat mining: predicting user and
message attributes in computer-mediated
communication. Information Processing and
Management, 44(4), pp - 1448-1466 (2008).
[14] Mendenhall, T.C. The characteristic curves of
composition. Science, 11(11), 237–249 (1887).
[15] Mosteller, F., Wallace, D.L. Inference and
disputed authorship: The Federalist. Reading,
MA: Addison-Wesley (1964).
[16] Nguyen, D., Noah A. Smith, and Carolyn P.
Rosé. Author age prediction from text using
linear regression. In Proceedings of the 5th
ACL-HLT Workshop on Language Technology
for Cultural Heritage, Social Sciences, and
Humanities, LaTeCH, 11, pages 115-123,
Stroudsburg, PA, USA, 2011. Association for
Computational Linguistics (2011).
D.T. Duc et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 1 (2017) 37-46
46
[17] Nguyen, D., Gravel, R., Trieschnigg, D., and
Meder, T. "How old do you think i am?"; a study
of language and age in twitter. Proceedings of
the Seventh International AAAI Conference on
Weblogs and Social Media (2013).
[18] Peersman, C., Daelemans, W., and Vaerenbergh.
L.V. Predicting age and gender in online social
networks. In Proceedings of the 3rd international
workshop on Search and mining user-generated
contents, SMUC ’11, pages 37–44, New York,
NY, USA, 2011. ACM (2007).
[19] Phuong, L., H.,. In Proceedings of Traitement
Automatique des Langues Naturelles
(TALN-2010), Montreal, Canada (2010).
[20] Rangel, F., Rosso, P. Use of language and author
profiling: Identification of gender and age. In
Natural Language Processing and Cognitive
Science, p. 177 (2013). Huyen, N., T., M.,
Rossignol, M., Roussanaly, A. An empirical
study of maximum entropy approach for part-of-
speech tagging of Vietnamese texts.
[21] Savoy, J. Authorship attribution based on
specific vocabulary. ACM Trans. Inf. Syst. 30,
2 (2012).
[22] Schler, J., Koppel, M., Argamon, S. and
Pennebaker, J. Effects of Age and Gender on
Blogging. In 43 proceedings of AAAI Spring
Symposium on Computational Approaches for
Analyzing Weblogs (2006).
[23] Stamatatos, E., Fakotakis, N., Kokkinakis, G.
Automatic text categorization in terms of genre
and author, Computational Linguistics 26(4),
pp. 471-495 (2000).
[24] Zhang, C., Zhang, P. Predicting gender from
blog posts. Technical report, Technical Report.
University of Massachusetts Amherst,
USA (2010).
[25] Zheng, R., Chen, H., Huang, Z., Qin, Y.
Authorship Analysis in Cybercrime
Investigation (Eds.): ISI 2003, LNCS 2665,
pp: 59-73 (2003).
[26] Zheng, R., Li, J., Chen, H. and Huang, Z. “A
framework for authorship identification of
online messages: Writing-style features and
classification techniques,” Journal of the
American Society for Information Science and
Technology, vol. 57, no. 3, pp. 378–393 (2006).
p

File đính kèm:

author_profiling_of_vietnamese_forum_posts_an_investigation.pdf