Author profiling of vietnamese forum posts - An investigation on content - Based features
The rapid growth of World Wide Web has
created a lot of online channels for people to
communicate, such as email, blogs, social
networks, etc. However, online forum is still
one of the most popular channels for people to
share the opinions and discuss about the topics
which are interested in common. Forum posts
created by users can be considered as informal
and personal writings. Authors of these posts
can indicate their profiles for other people to
view as a function of forum. But not many
users reveal their personal information, because
of information privacy issues on the online
systems. Moreover, personal information of
users is not mandatory to input when they
register as a user of forums. Therefore, most of
people do not provide their personal
information or input the incorrect/unclear data.
As a result, the task of automatically
classifying the author’s properties such as
gender, age, location, occupation, etc. becomes
important and essential. Applications of this
task can be in commercial field, in which
providers can know which types of users like or
do not like their products/services (for target
marketing and product development). For the
social research domain, researchers also want to
know the profile of people who have a specific
opinion about some social issues (when doing a
social survey). It can also be used to support the
court, in term of identifying if a text was
created by a criminal or not [1].
Trang 1
Trang 2
Trang 3
Trang 4
Trang 5
Trang 6
Trang 7
Trang 8
Trang 9
Trang 10
Tóm tắt nội dung tài liệu: Author profiling of vietnamese forum posts - An investigation on content - Based features
ts in words is 107 (the short test post contains 50 words, the longest post contains 300 words). Table 1. Corpus Statistic Trait Total posts Class Percent in corpus Gender 4.474 Male 54% Female 46% Age 3.017 < 22 21% 24 to 27 27% > 32 52% Location 3.960 North 57% South 43% Occupation 3.453 Business, Sale, Admin 36% Technique, Technology 31% Education, Healthcare 33% 4.2. Results and discussion We conducted experiments on 4 traits of authors as mentioned earlier using the Weka1 toolkit. The results were verified through a 10- fold cross validation process, in which the training set is randomly partitioned into 10 equal size subsets and 9 subsets were used as training data and the remaining subset is retained for testing. This process is then repeated 10 times with each of 10 subsets is used exactly once as the validation data. Using Grid Search for SVM on PolyKernel with two _______ 1 parameters c and exponent, together with some modifications in the feature extraction step, the results improved noticeably compared with results in [8], specially on age, location, and occupation traits (e.g. the best parameters for gender trait are c=3.0 and exponent=1.0). Table 2 shows the results of author profiling experiments of 4 traits. General evaluation. As the results shown in Table 2, we can observe that content-based features outperformed Style-based features. Although content-based features are often considered domain-specific and may be less accurate when moving the other domains, the results in this task are still promising. Firstly, the data in corpus was collected from various source, therefore it is not so domain-specific. Secondly, even the results are domain-specific to some extent, it is still useful when we conduct the research or apply the results in that domain. Besides, the results of Style-based features are also good, especially for gender and location. Generally, using content-based features increases the accuracy from 7% to 8%, but the improvement is more than 11% for the location trait. Therefore, we may infer that prediction of location is more sensitive on content-based features than other traits. It is reasonable because people from north and south of Vietnam often use different local words in casual communication. Table 2. The results of author profiling experiments Feature Gender Age Location Occup-ation All Features 90.55 70.70 83.13 61.04 Style- based 83.47 62.76 71.22 52.46 Content- based 90.01 70.05 82.98 60.99 Number of content-based features. As mentioned earlier, to reduce the complexity and improve the accuracy of the model, we applied a feature selection method to eliminate the irrelevant features. We experimented the classification with different number of content words which were chosen by Information Gain D.T. Duc et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 1 (2017) 37-46 43 method, ranging from 100 to 1000. Fig 2. shows the best number of features for each trait. The figure shows that the highest score of gender prediction is achieved when using 600 content words. The best number of words for age and location traits is 400 and the occupation trait is 200. The reason for this is probably the noise in occupation data and therefore, not many words can be used to discriminate between the classes of occupation. Table 3 shows some of the most important content words with their weights for each trait (the bigger absolute value of weight is, the more important the feature is). Fig 2. Prediction accuracy for different numbers of content words. Table 3. The top important content words for each trait (a) Important words for gender prediction Male Female feature weight feature weight feature weight feature weight mục tiêu -1.35 quy định -1.18 cảm ơn 1.91 hồng 1.46 dữ liệu -1.34 máy ảnh -1.09 khách sạn 1.79 bếp 1.43 doanh nghiệp -1.32 điện tử -1.07 cưới 1.76 sữa 1.31 kỹ thuật -1.31 triển khai -1.03 bác sĩ 1.56 chia sẻ 1.27 xử lý -1.26 kiểm tra -1.02 vải 1.51 áp lực 1.18 (b) Important words for age prediction Younger Middle Older feature weight feature weight feature weight học hỏi -1.50 nhu cầu -1.29 xài 1.24 lịch sử -1.32 triệu -1.20 luật 1.11 nguyên do -1.25 khắp nơi -0.90 quy định 0.66 hành động -1.05 lang thang -0.74 chi phí 0.62 thể thao -0.80 bỏ qua -1.03 hỗ trợ 0.58 (c) Important words for location prediction North South feature weight feature weight feature weight feature weight buổi -1.22 rẽ -0.78 máy lạnh 1.52 gởi 1.09 đỗ -1.18 quay -0.73 coi 1.51 đậu 1.04 mạch -1.05 sinh -0.70 gạt 1.48 xài 1.00 liệu -1.00 ảnh -0.65 nhơn 1.46 uổng 1.00 nộp -1.00 chịu khó -0.53 quẹo 1.35 dơ 0.91 D.T. Duc et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 1 (2017) 37-46 44 (d) Important words for occupation prediction Business/Sale/Admin Technology/Technique Education/Healthcare feature weight feature weight feature weight lịch -1.64 phát triển 1.68 tâm lý 1.61 cuộc -1.62 cấu hình 1.60 hình ảnh 1.58 lang thang -1.21 kết hợp 1.53 xã hội 1.43 đến nơi -0.88 kỹ thuật 1.30 học 1.13 cung cấp -0.77 tài liệu 1.20 từ thiện 1.09 H The words in tables suggest that the men tend to discuss about work, technology, regulation etc. while the women often talk about life, health, pressure, and so on. Young people like to discuss about learning, action, etc. The middle age people talk about the needs, travel, and the older people often exchange the views on expenses, law, etc. There many local words that the northern and southern people often used differently from each other, but in our corpus, we found some of them as in the Table 3 (c). Table 3 (d) shows that the people working in business, sale field often used words related to schedule, appointments, travel, while the people working in technology field like to talk about development, machine, etc., and the people which have jobs in education/healthcare fields often discuss about the social, learning, charity issues. Comparison with previous works. In comparison to the results of previous works, although forum posts are shorter and noisier than other types of online messages such as blog posts or emails, but the results can be considered as promising, especially for gender and location traits. The accuracy of 90.55% when predicting the gender is even better than the results of most of previous works which were conducted on blogs or emails (which had base-line about 80%). The percentage of age prediction (70.70%) is not as good as the results conducted on blog posts or emails (which had the base-line around 77% for blog posts), but much better compared to the result of a research on forum posts conducted by [16], which is only 53%. The same evaluation can be used when saying about the location trait, but the occupation prediction is not so good. The main reason is that occupation information is very noisy and subtle. For example, a person who studied about technical but then works as a sale person is not an easy case when predict his/her job. This needs to be investigated further in later researches. When comparing with the only previous work on author profiling in Vietnamese by [6], for the gender trait, we achieved the better result (90.55% and 83.3%) when using content-based features, and the same result (83.47% and 83.3%) without content-based features. It showed that our approach when adding the content-based features has improved the results significantly. The same evaluation can be said when comparing the results of location trait. But for other traits, our results are less accurate, but it is understandable and still promising, because our experiments were conducted on a shorter and more informal type of text than blog posts. 5. Conclusion In this study, we investigate the author profiling task on a different language (Vietnamese) and different type of text (forum posts) than previous works. The results show that it is feasible to classify authorial characteristics of the informal online messages as forum posts based on linguistic features, in which using content-based features improved the results significantly. We also have a D.T. Duc et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 1 (2017) 37-46 45 thorough analysis on content-based features, such as the best number of content words and the list of important words for each trait. Experiments conducted show the promising results, although some aspects still need to be improved such as the solutions for noisy information in occupation trait or the result for age prediction should be better and so on. In future, this study can be expanded to other domains, such as social networks or user comments/product reviews. The data in these domains is even shorter and noisier than forum posts, so it is more challenging task. But the results of such kind of works have promising applications in commercial fields, such as analyzing market trends or user behaviors prediction etc. We also have planned to investigate about the use of more grammar-based features in this kind of task. Vietnamese has many interesting linguistic features such as tones, spells, and we can exploit these features to improve the author profiling results. Acknowledgements This work has been supported by Vietnam National University, Hanoi (VNU), under Project No. QG.16.91 References [1] Abbasi, A., Chen, H. Applying authorship analysis to extremist-group Web forum messages, IEEE Intelligent Systems, 20(5), pp.67-75 (2005). [2] Abbasi, A., Chen, H. Writeprints: A Style-based approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information Systems, 26 (2), pp: 1-29 (2008). [3] Argamon, S., Koppel, M., Fine, J. and Shimoni, A. Gender, Genre, and Writing Style in Formal Written Texts, Text 23(3), August (2003). [4] Argamon, S., Koppel, M., Pennebaker, J. and Schler, J. Automatically Profiling the Author of an Anonymous Text, Communications of the ACM , 52(2), pp.119-123 (2008). [5] Corney, M., DeVel, O., Anderson, A., Mohay, G. Gender-preferential text mining of e-mail discourse. In ACSAC’02: Proc. of the 18th Annual Computer Security Applications Conference, Washington, DC, pp : 21-27. (2002) [6] Dang, P., Giang, T., Son, P. Author profiling for Vietnamese blogs. International Conference on Asian Language Processing (2009). [7] De Vel, O., Anderson, A., Corney, M., Mohay, G. M. Mining e-mail content for author identification forensics. SIGMOD Record 30(4), pp. 55-64 (2001). [8] Duc, D.T., Son, P.B., Hanh, T. Using Content-based Features for Author Profiling of Vietnamese Forum Posts. In: Recent Developments in Intelligent Information and Database Systems, pp. 287–296. Springer International Publishing, Berlin (2016). [9] Goswami, S., Sarkar, S., and Rustagi.M. Style-based analysis of bloggers’ age and gender. In Eytan Adar, Matthew Hurst, Tim Finin, Natalie S. Glance, Nicolas Nicolov, and Belle L. Tseng, editors, ICWSM. The AAAI Press (2009). [10] Gressel, G., Hrudya, P., Surendran, K., Thara, S., Aravind, A., Prabaharan, P. Ensemble learning approach for author profiling, Notebook for PAN at CLEF (2014). [11] Iqbal, F. Messaging Forensic Framework for Cybercrime Investigation. A Thesis in the Department of Computer Science and Software Engineering - Concordia University Montréal, Canada (2010). [12] Koppel, M., Argamon, S., Shimoni, A.R. Automatically categorizing written texts by author gender. Literary and Linguistic Computing, 17(4), pp : 401-412 (2002). [13] Kucukyilmaz, T., Aykanat, C., Cambazoglu, B. B., Can, F. Chat mining: predicting user and message attributes in computer-mediated communication. Information Processing and Management, 44(4), pp - 1448-1466 (2008). [14] Mendenhall, T.C. The characteristic curves of composition. Science, 11(11), 237–249 (1887). [15] Mosteller, F., Wallace, D.L. Inference and disputed authorship: The Federalist. Reading, MA: Addison-Wesley (1964). [16] Nguyen, D., Noah A. Smith, and Carolyn P. Rosé. Author age prediction from text using linear regression. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, LaTeCH, 11, pages 115-123, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics (2011). D.T. Duc et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 33, No. 1 (2017) 37-46 46 [17] Nguyen, D., Gravel, R., Trieschnigg, D., and Meder, T. "How old do you think i am?"; a study of language and age in twitter. Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media (2013). [18] Peersman, C., Daelemans, W., and Vaerenbergh. L.V. Predicting age and gender in online social networks. In Proceedings of the 3rd international workshop on Search and mining user-generated contents, SMUC ’11, pages 37–44, New York, NY, USA, 2011. ACM (2007). [19] Phuong, L., H.,. In Proceedings of Traitement Automatique des Langues Naturelles (TALN-2010), Montreal, Canada (2010). [20] Rangel, F., Rosso, P. Use of language and author profiling: Identification of gender and age. In Natural Language Processing and Cognitive Science, p. 177 (2013). Huyen, N., T., M., Rossignol, M., Roussanaly, A. An empirical study of maximum entropy approach for part-of- speech tagging of Vietnamese texts. [21] Savoy, J. Authorship attribution based on specific vocabulary. ACM Trans. Inf. Syst. 30, 2 (2012). [22] Schler, J., Koppel, M., Argamon, S. and Pennebaker, J. Effects of Age and Gender on Blogging. In 43 proceedings of AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs (2006). [23] Stamatatos, E., Fakotakis, N., Kokkinakis, G. Automatic text categorization in terms of genre and author, Computational Linguistics 26(4), pp. 471-495 (2000). [24] Zhang, C., Zhang, P. Predicting gender from blog posts. Technical report, Technical Report. University of Massachusetts Amherst, USA (2010). [25] Zheng, R., Chen, H., Huang, Z., Qin, Y. Authorship Analysis in Cybercrime Investigation (Eds.): ISI 2003, LNCS 2665, pp: 59-73 (2003). [26] Zheng, R., Li, J., Chen, H. and Huang, Z. “A framework for authorship identification of online messages: Writing-style features and classification techniques,” Journal of the American Society for Information Science and Technology, vol. 57, no. 3, pp. 378–393 (2006). p
File đính kèm:
- author_profiling_of_vietnamese_forum_posts_an_investigation.pdf