Cross-language Information Retrieval by Reduced k-means

Authors

  • Jasminka Dobša
  • Dunja Mladenić
  • Jan Rupnik
  • Danijel Radošević

Keywords:

cross-language information retrieval, dimensionality reduction, latent semantic indexing, canonical correlation analysis, Reduced k-means

Abstract

Cross-language information retrieval aims at retrieving relevant documents in one language for a query set in another language. Here we propose a new approach to the problem of cross-language information retrieval based on factorization of a term-document matrix by an iterative method of Reduced k-means clustering. Method of Reduced k-means intended at simultaneous reduction of objects (documents) and variables (index terms). Proposed method is compared to standard machine learning techniques of cross-language information retrieval by usage of latent semantic indexing and canonical correlation analysis. Motivation for usage of Reduced k-means method for a task of cross-language information retrieval comes from an observation that documents in a semantic space obtained by method of latent semantic indexing are clustered by their language and not by their topics in the first place. As Reduced k-means aims at preserving clustering structure of data, the idea is that the proposed method could address the mentioned problem.

Downloads

Download data is not yet available.

Published

2018-11-07

How to Cite

Jasminka Dobša, Dunja Mladenić, Jan Rupnik, & Danijel Radošević. (2018). Cross-language Information Retrieval by Reduced k-means. International Journal of Computer Information Systems and Industrial Management Applications, 10, 9. Retrieved from https://cspub-ijcisim.org/index.php/ijcisim/article/view/594

Issue

Section

Original Articles