Posts tagged ‘latent semantic indexing’

A short bibliography on Latent Semantic Analysis and Indexing

To go a bit further than my previous post, here are a few references that I recently found to be interesting.

For a definition and/or other short bibliographies, see Wikipedia or something else this time : Scholarpedia, with an article “curated” by T.K. Landauer and S.T. Dumais.

  • [2009,techreport] bibtex
    U. Mortensen, "Einführung in die Korrespondenzanalyse," Universität Münster2009.
    @techreport{Mortensen:2009, title={{Einführung in die Korrespondenzanalyse}},
      author={Mortensen, U.},
      institution={Universität Münster},
      year = {2009},
  • [2005,conference] bibtex
    G. Gorrell and B. Webb, "Generalized Hebbian Algorithm for Incremental Latent Semantic Analysis," in Ninth European Conference on Speech Communication and Technology, 2005.
    @conference{Gorrell:2005, title={{Generalized Hebbian Algorithm for Incremental Latent Semantic Analysis}},
      author={Gorrell, G. and Webb, B.},
      booktitle={Ninth European Conference on Speech Communication and Technology},
  • [2004,book] bibtex
    P. Cibois, Les méthodes d’analyse d’enquêtes, Que sais-je ?, 2004.
    @book{Cibois:2007, title={{Les méthodes d'analyse d'enquêtes}},
      author={Cibois, P.},
      publisher={Que sais-je ?},
      year = {2004},
  • [2004,techreport] bibtex
    B. Pincombe, "Comparison of Human and Latent Semantic Analysis (LSA) Judgements of Pairwise Document Similarities for a News Corpus," Australian Department of Defence2004.
    @techreport{Pincombe:2004, title={{Comparison of Human and Latent Semantic Analysis (LSA) Judgements of Pairwise Document Similarities for a News Corpus}},
      author={Pincombe, B.},
      institution={Australian Department of Defence},
      year = {2004},
  • [1995,article] bibtex
    M. W. Berry, S. T. Dumais, and G. W. O’Brien, "Using Linear Algebra for Intelligent Information Retrieval," SIAM Review, vol. 37, iss. 4, p. pp. 573-595, 1995.
    @article{Berry:1995, title = {{Using Linear Algebra for Intelligent Information Retrieval}},
      author = {Berry, Michael W. and Dumais, Susan T. and O'Brien, Gavin W.},
      journal = {SIAM Review},
      volume = {37},
      number = {4},
      pages = {pp. 573-595},
      year = {1995},
      publisher = {Society for Industrial and Applied Mathematics},
  • [1992,techreport] bibtex
    S. Dumais, "Enhancing performance in latent semantic indexing (LSI) retrieval," Bellcore1992.
    @techreport{Dumais:1992, title={{Enhancing performance in latent semantic indexing (LSI) retrieval}},
      author={Dumais, S.},
  • [1990,article] bibtex
    S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, "Indexing by latent semantic analysis," Journal of the American society for information science, vol. 41, iss. 6, pp. 391-407, 1990.
    @article{Deerwester:1990, title={{Indexing by latent semantic analysis}},
      author={Deerwester, S. and Dumais, S.T. and Furnas, G.W. and Landauer, T.K. and Harshman, R.},
      journal={Journal of the American society for information science},
      publisher={John Wiley \& Sons}
  • [1975,article] bibtex
    G. Salton, A. Wong, and C. S. Yang, "A vector space model for automatic indexing," Communications of the ACM, vol. 18, iss. 11, pp. 613-620, 1975.
    @article{Salton:1975, title={{A vector space model for automatic indexing}},
      author={Salton, G. and Wong, A. and Yang, C.S.},
      journal={Communications of the ACM},

[Bibliography generated using the bib2html WordPress plugin.]

Building a topic-specific corpus out of two different corpora

I have (say, I crawled two websites and got hold of) two corpora which sometimes focus on the same topics. I would like to try and melt them together in order to build a balanced and coherent corpus. As this is a highly discussed research topic there are plenty of subtle ways to do it.

Still, as I am only at the beginning of my research and as I don’t know how far I am going to go with both corpora I want to keep it simple.


One of the appropriate techniques (if not the best)

I could do it using LSA (in this particular case Latent semantic analysis, and not Lysergic acid amide !) or to be more precise Latent semantic indexing.

As this technical report shows, it can perform well in that kind of case : Comparison of Human and Latent Semantic Analysis (LSA) Judgements of Pairwise Document Similarities for a News Corpus,  B. Pincombe, Australian Department of Defence, 2004. (full text available here or through any good search engine, see previous post)

This could be an issue for later research.


The approach that I am working on (not quick and dirty but simpler and hopefully robust)

For exercise purposes (and who knows ? maybe it will prove efficient), I am using another approach. Instead of applying corpus-wide statistical methods, I plan to take advantage of a local grammatical handling of the sentences, i.e. a partial / surface / chunk / shallow parsing.

The two main steps are :

  1. a term weighting phase (to find important words in the documents)
  2. a vector space search through the corpus (to find documents that have something in common)

Where I am now : the first phase

So far, I wrote a finite-state automaton which identifies the heads of the phrases, as these words often carry more significant “weight”.

Taking as input the information given by a part-of-speech tagger (the TreeTagger), it outputs its different states and if it found a possible head and/or a possible extension, taking into account that German is rather a head-final language. It works for noun, verb and a few adpositional phrases.

I think that may be enough to trigger a second automaton which would take advantage of this information and try to give the structure of the sentences, both automata building a kind of finite-state cascade (see Steven Abney. Partial parsing via finite-state cascades. In John Carroll (ed.), Workshop on Robust Parsing (ESSLLI ’96), pages 8–15, 1996). But for now it doesn’t seem to be necessary.

The heads that are to be found several times or at relevant places, such as a title, are stored as tags for a given document with an indication of their strength/relevance. I am refining the way to get to the tags and I am thinking about moving on to phase 2.