Microsoft to analyze social networks to determine comprehension level

I recently read that Microsoft was planning to analyze several social networks in order to know more about users, so that the search engine could deliver more appropriate results. See this article on geekwire.com : Microsoft idea: Analyze social networks posts to deduce mood, interests, education.

Among the variables that are considered, the ‘sophistication and education level’ of the posts is mentionned. This is highly interesting, because it assumes a double readability assessment, on the reader’s side and on the side of the search engine. More precisely, this could refer to a classification task.

Here is an extract of a patent describing how this is supposed to work.

[0117] In addition to skewing the search results to the user’s inferred interests, the user-following engine 112 may further tailor the search results to a user’s comprehension level. For example, an intelligent processing module 156 may be directed to discerning the sophistication and education level of the posts of a user 102. Based on that inference, the customization engine may vary the sophistication level of the customized search result 510. The user-following engine 112 is able to make determinations about comprehension level several ways, including from a user’s posts and from a user’s stored profile. In one example, the user-following engine 112 may discern whether a user is a younger student or an adult professional. In such an example, the user-following engine may tailor the results so that the professional receives results reflecting a higher comprehension level than the results for the student. Any of a wide variety of differentiations may be made. In a further example, the user-following engine may discern a particular specialty of the user, e.g., the user is a marine biologist or an avid cyclist. In such embodiments, a query from a user related to his or her particular area of specialty may return a more sophisticated set of results than the same query from a user not in that area of specialty.

The main drawback I see in this approach is the determination of a profile based on communication. First of all, people do not necessarily want to read texts that are as easy (or difficult) as those they write. Secondly, people progress in speaking a language by reading words or expressions they do not already know, by doing so Microsoft could prevent young students from developing language skills. Last, communication is an adaptative process : a whole series of adaptations depends on the persons or the group one speaks to, and the ‘sophistication level’ varies accordingly, which is not necessarily correlated with an education level.

A general example would be that people usually try to be (or to seem) cool on Facebook, which involves using shorter sentences and colloquial terms. Another example would be the lack of time, and as a result shorter sentences and messages.

It seems that this strategy is based on the false assumption that you can judge user’s linguistic abilities by starting from a result that is in fact a construct. In other words, it seems like an excessive valuation of performance over competence. There are many reasons why people may speak or write differently in different situations, that is what many sub-disciplines of linguistics are about, and that is what Microsoft is blatantly ignoring in this project. A reasonable explanation would be that the so-called levels are rough estimations and that the profiles are not fine-grained, i.e. that there is only a few of them.

0 votes, 0.00 avg. rating (0% score)

Amazon’s readability statistics by example

I already mentioned Amazon’s text stats in a post where I tried to explain why they were far from being useful in every situation : A note on Amazon’s text readability stats, published last December.

I found an example which shows particularly well why you cannot rely on these statistics when it comes to get a precise picture of a text’s readability. Here are the screenshots of text statistics describing two different books (click on them to display a larger view) :

Comparison

The two books look quite similar, except for the length of the second one, which seems to contain significantly more words and sentences.

The first book (on the left) is Pippi Longstocking, by Astrid Lindgren, whereas the second is The Sound and The Fury, by William Faulkner… The writing style could not be more different, however, the text statistics make them appear quite close to each other.

The criteria used by Amazon are too simplistic, even if they usually perform acceptably on all kind of texts. The readability formulas that output the first series of results only take the length of words and sentences into account and their scale is designed for the US school system. In fact, the “readability” and “complexity” factors are the same, so these sections are redundant. Nevertheless, it is an interesting approach to try to discriminate between them.

It is clear that the formulas lack depth and adaptability, we need to get a much more complete view of the processes that are concerned by readability issues (see for instance this post about Canadian research on readability in the ’90s).

Still, there may be other reasons that make the books comparable on this basic visualization. In the beginning of The Sound and The Fury, the characters are mostly speaking to a child. The ambiguity regarding the sentences and the narrative flow does not make its content that easy to understand, let alone the fact that the described reality is brutal. On the whole, Faulkner’s sentences are not particularly short, there are even a few luminous counterexamples, so there may be a failure in the text analysis, for instance in the tokenization process.

It is never easy to measure and to give precise data, as I discussed in a post on measurement which involved Lord Kelvin, Bachelard and Dilbert. Once again, one could ask for a more detailed account on readability, based on two main questions : what kind of readability ? and for whom ?

0 votes, 0.00 avg. rating (0% score)

2nd release of the German Political Speeches Corpus

Last Monday, I released an updated version of both corpus and visualization tool on the occasion of the DGfS-CL Poster-Session in Frankfurt, where I presented a poster (in German).

The first version had been made available last summer and mentioned on this blog, cf this post : Introducing the German Political Speeches Corpus and Visualization Tool.

The resource still uses this permanent redirection : http://purl.org/corpus/german-speeches

Description

If you don’t remember it or never heard of it, here is a brief description :

The resource presented here consists of speeches by the last German Presidents and Chancellors as well as a few ministers, all gathered from official sources. It provides raw data, metadata and tokenized text with part-of-speech tagging and lemmas in XML TEI format for researchers that are able to use it and a simple visualization interface for those who want to get a glimpse of what is in the corpus before downloading it or thinking about using more complete tools.

The visualization output is in valid CSS/XHTML format, it takes advantage of recent standards. The purpose is to give a sort of Zeitgeist, an insight on the topics developed by a government official and on the evolution in the use of general concepts.

Changes

The corpus has been updated and ships with an integrated text enrichment :

  • Tokenisation (Perl scripts), POS-tags and lemmatization (TreeTagger) are included.
  • Nearly TEI-compliant XML format.

On the visualization side, there are no major changes, but a lot of improvements :

  • The web pages are lighter, as they are completed on-the-fly by scripts (Javascript, client-side).
  • There is a list of keywords for each text, which is still experimental but gives a rough idea of what is inside.
  • The script that highlights the selected words in the texts has been improved but still does not get the words beginning with Ä, Ö or Ü, although they are rather frequent.
  • The ugly menu has been replaced by a real tab interface, and overall the CSS files fit more versions of Firefox, Chrome, Safari and Opera.

You can use this technical paper to learn more details and to refer to this resource.

0 votes, 0.00 avg. rating (0% score)

A short review of XML standards for language corpora

Document-driven and data-driven, standoff and inline

First of all, the intention of the encoding can be different. Richard Eckart summarizes two main trends: document-driven XML and data-driven XML. While the first uses an « inline approach » and is « usually easily human-readable and meaningful even without the annotations », the latter is « geared towards machine processing and functions like a database record. [...] The order of elements often is meaningless. » (Eckart 2008 p. 3)

In fact, several choices of architecture depend on the goal of an annotation using XML. The main division regards standoff and inline XML (also : stand-off and in-line).

The Paula format (“Potsdamer Austauschformat für linguistische Annotation”, ‘Potsdam Interchange Format for Linguistic Annotation’) chose both approaches. So did Nancy Ide for the ANC Project, a series of tools enable the users to convert the data between well-known formats (GrAF standoff, GrAF inline, GATE or UIMA). This versatility seems to be a good point, since you cannot expect corpus users to change their habits just because of one single corpus. Regarding the way standoff and inline annotation compare, [Dipper et al. 2007] found that the inline format (with pointers) performs better.

A few trends in linguistic research

Speaking about trends in the German research during the last decade, [Woerner et al. 2006] see three main approaches (p. 1) :

  • the timeline-based stand-off format Exmaralda [Schmidt 2004]
  • the hierarchical format Tusnelda that is based on the TEI [Sperberg-McQueen and Burnard 1994]
  • Paula that resembles the Linguistic Annotation Framework [Ide et al. 2003]

Among them, Paula seems interesting :

« The interchange format PAULA has been developed for empirical, data-based research on information structure, a linguistic phenomenon that involves various linguistic levels, such as syntax, phonology, semantics. As a consequence, the data which serve as the basis of this research are marked up with different kinds of annotations: syntax trees or graphs, segment-based phonological properties, etc. The annotations are created by means of different, task-specific annotation tools » (Woerner et al. 2006 p. 5)

The Linguistic Annotation Framework has been developed by Nancy Ide and Laurent Romary, who had already worked on the XCES ISO standard. The accomplishments and models can be seen at work in the American National Corpus (ANC).

The TEI. developed an very different and maybe much more complete annotation framework than XCES, although their approaches are similar.

Known issues

The variety of features which can be annotated is a challenge per se. [Witt et al. 2009] document one serious issue : the crossing edges in an XML graph.

« [Linguistically annotated corpora] may contain crossing edges and, thus, require a data structure that is more complex than a simple tree. » (p. 364)

Thus, they try to show how multi-rooted trees can be represented « in an integrated way, by using the TEI tag set for the annotation of feature structures. » (p. 365)

Given the diversity of formats, one of the main goals should be to ensure interoperability. That is where complying to a standard has a few advantages, described by [Romary 2009] :

« As soon as the corpus to be digitized is planned to be disseminated to a wider audience, one should make sure that the documentation of the corpus objects, both from a library point of view (meta-data, source identification, etc.) and a technical point of view (schema), is adequate for their autonomous processing by third-party users; » (p. 3)
« the definition a finite set of features and corresponding practices is somehow simplified, with very little room for encoding overkill. Still, since the corpus of texts is a constantly evolving matter, there is a need for defining a workflow for constant updating of the underlying schema;» (p. 5)

[Rehm et al. 2009] focus on the fact that interoperability is not as easy as it seems. First of all, data normalization should considered as an important feature, because beyond the mere practical issues, models can be put into question.

« The aspect of data normalization is rarely discussed in academic publications. This is mostly due to the fact that the conversion from one format into another is not regarded as a difficult or challenging task, because tools and specialized programming languages exist that support researchers in converting data sets from one format into another. In practice, however, it turns out that this rather time-consuming task is in fact of interest for researchers within Digital Humanities. The reason is that the specification of transformations may change the model according to which a text resource is annotated. » (p. 201)

Finally, Romary describes a theoretical issue on the linguistic side : the compatibility of linguistic frameworks is everything but guaranteed.

« What if two or more corpora contain data annotated in markup languages that are, from a theoretical linguistics point-of-view, incompatible with each other (for example, if they are based on incompatible theoretical frameworks) – will it be possible to represent terms and concepts in the ontology that contradict each other? » (p.8)

References

  • [2010,article] bibtex Go to document
    L. Romary, "Stabilizing knowledge through standards – A perspective for the humanities," Going Digital: Evolutionary and Revolutionary Aspects of Digitization, 2010.
    @article{Romary:2010, title={{Stabilizing knowledge through standards - A perspective for the humanities}},
      author={Romary, L.},
      journal={Going Digital: Evolutionary and Revolutionary Aspects of Digitization},
      url={http://arxiv.org/abs/1011.0519},
      year={2010},
      }
  • [2009,article] bibtex
    G. Rehm, O. Schonefeld, A. Witt, E. Hinrichs, and M. Reis, "Sustainability of annotated resources in linguistics: A web-platform for exploring, querying, and distributing linguistic corpora and other resources," Literary and Linguistic Computing, vol. 24, iss. 2, pp. 193-210, 2009.
    @article{Rehm:2009, title={{Sustainability of annotated resources in linguistics: A web-platform for exploring, querying, and distributing linguistic corpora and other resources}},
      author={Rehm, G. and Schonefeld, O. and Witt, A. and Hinrichs, E. and Reis, M.},
      journal={Literary and Linguistic Computing},
      volume={24},
      number={2},
      pages={193--210},
      year={2009},
      publisher={ALLC}
    }
  • [2009,article] bibtex
    A. Witt, G. Rehm, E. Hinrichs, T. Lehmberg, and J. Stegmann, "SusTEInability of linguistic resources through feature structures," Literary and Linguistic Computing, vol. 24, iss. 3, pp. 363-372, 2009.
    @article{Witt:2009, title={{SusTEInability of linguistic resources through feature structures}},
      author={Witt, A. and Rehm, G. and Hinrichs, E. and Lehmberg, T. and Stegmann, J.},
      journal={Literary and Linguistic Computing},
      volume={24},
      number={3},
      pages={363--372},
      year={2009},
      publisher={ALLC}
    }
  • [2009,article] bibtex Go to document
    L. Romary, "Questions \& Answers for TEI Newcomers," Jahrbuch für Computerphilologie, vol. 10, 2009.
    @article{Romary:2008, title={{Questions \& Answers for TEI Newcomers}},
      author={Romary, L.},
      journal={Jahrbuch für Computerphilologie},
      volume={10},
      url={http://arxiv.org/abs/0812.3563},
      year={2009}
    }
  • [2008,article] bibtex
    R. Eckart, "Choosing an XML database for linguistically annotated corpora," Sprache und Datenverarbeitung, vol. 32, iss. 1, pp. 7-22, 2008.
    @article{Eckart:2008, title={{Choosing an XML database for linguistically annotated corpora}},
      author={Eckart, Richard},
      journal={Sprache und Datenverarbeitung},
      volume={32},
      number={1},
      pages={7--22},
      year={2008},
      publisher={Institut für Kommunikationsforschung und Phonetik der Universität Bonn},
      }
  • [2007,techreport] bibtex Go to document
    "TEI P5: Guidelines for Electronic Text Encoding and Interchange," Text Encoding Initiative Consortium2007.
    @TECHREPORT{Burnard:2007, editor = {Burnard, L. and Bauman, S.},
      title = {{TEI P5: Guidelines for Electronic Text Encoding and Interchange}},
      institution = {Text Encoding Initiative Consortium},
      year = 2007, url = {http://www.tei-c.org/release/doc/tei-p5-doc/en/html/index.html},
      }
  • [2007,inproceedings] bibtex
    N. Ide and K. Suderman, "GrAF: A graph-based format for linguistic annotations," in Proceedings of the Linguistic Annotation Workshop, 2007, pp. 1-8.
    @inproceedings{Ide:2007, title={{GrAF: A graph-based format for linguistic annotations}},
      author={Ide, N. and Suderman, K.},
      booktitle={Proceedings of the Linguistic Annotation Workshop},
      pages={1--8},
      year={2007},
      organization={Association for Computational Linguistics}
    }
  • [2007,incollection] bibtex
    S. Dipper, M. Götze, U. Küssner, and M. Stede, "Representing and querying standoff XML," , Rehm, G., Witt, A., and Lemnitzer, L., Eds., Tübingen: Gunter Narr, 2007, pp. 337-346.
    @incollection{Dipper:2007, title={{Representing and querying standoff XML}},
      author={Dipper, Stefanie and Götze, Michael and Küssner, Uwe and Stede, Manfred},
      booktitle={{Data Structures for Linguistic Resources and Applications – Proceedings of the Bi- ennial GLDV Conference 2007}},
      editor={Rehm, Georg AND Witt, Andreas AND Lemnitzer, Lothar},
      pages={337--346},
      year={2007},
      publisher={Gunter Narr},
      address={Tübingen},
      }
  • [2006,inproceedings] bibtex
    W. R. A. . . G. Wörner K. and S. Dipper, "Modelling Linguistic Data Structures," in Proceedings of Extreme Markup Languages, Montréal, Canada, 2006.
    @INPROCEEDINGS{Woerner:2006,
      author={Wörner, K., Witt, A., Rehm, G., and Dipper, S.},
      title={{Modelling Linguistic Data Structures}},
      booktitle={Proceedings of Extreme Markup Languages},
      address={Montréal, Canada},
      year = 2006, }
  • [2003,incollection] bibtex Go to document
    H. Lobin, "Textauszeichnung und Dokumentgrammatiken," , Lobin, H. and Lemnitzer, L., Eds., Stauffenburg Verlag, 2003.
    @incollection{Lobin:2003, title={{Textauszeichnung und Dokumentgrammatiken}},
      author={Lobin, H.},
      booktitle={Texttechnologie},
      editor={Lobin, Henning AND Lemnitzer, Lothar},
      publisher={Stauffenburg Verlag},
      year={2003},
      url={http://www.uni-giessen.de/~g91062/pdf/Lobin-LL-2003c.pdf},
      }
  • [2000,inproceedings] bibtex
    N. Ide, P. Bonhomme, and L. Romary, "XCES: An XML-based Encoding Standard for Linguistic Corpora," in Proceedings of the Second Language Resources and Evaluation Conference (LREC), Athens, 2000, pp. 825-830.
    @INPROCEEDINGS{Ide:2000,
      author={Ide, Nancy AND Bonhomme, Patrice AND Romary, Laurent},
      title={{XCES: An XML-based Encoding Standard for Linguistic Corpora}},
      booktitle={Proceedings of the Second Language Resources and Evaluation Conference (LREC)},
      pages={825--830},
      address={Athens},
      year = 2000, }

0 votes, 0.00 avg. rating (0% score)

Completing web pages on the fly with JavaScript

As I am working on a new release of the German Political Speeches Corpus, I looked for a way to make web pages lighter. I have lots of repetitive information, so that a little JavaScript is helpful when it comes to save on file size. Provided that the DOM structure is available, there are elements that may be completed on load.

For example, there are span elements which include specific text. By catching them and testing them against a regular expression the script is able to add attributes (like a class) to the right ones. Without activating JavaScript one still sees the contents of the page, and with it the page appears as I intended. In fact, the attributes match properties defined in a separate CSS file.

I had to look for several JavaScript commands across many websites, that’s why I decided to summarize what I found in a post.

First example : append text and a target to a link

These lines of code match all the links that don’t already have a href attribute, and append to them a modified destination as well as a target attribute.


function modLink(txt){
// Get all the links
  var list = document.getElementsByTagName("a");
  for (var i = 0; i < list.length; i++) {
// Check if they already have an href attribute
    if (!list[i].getAttribute('href')) {
// Get what's in the a tag
      var content = list[i].firstChild.data;
// Give the element a special href attribute
      list[i].href += "something/" + content + ".html" + "?var=" + txt;
// Give the element a target attribute
      list[i].target= "_blank";
    }
  }
}

The result : one has to call this function somewhere in the HTML document (I do it on load) : modLink(test;). Then, an a tag including the text ‘AAA’ will become a href=”something/AAA.html?var=test”. Very useful if you have to pass arguments.

Second case : append a class to a span element

These lines of code modify existing span elements, those including parentheses and digits and the rest, by adding a class named c or i[digits] accordingly.


function modSpan() {
// Get all span elements
  var spanlist = document.getElementsByTagName("span");
  for (var n = 0; n < spanlist.length; n++) {
// Get their contents
    var inspan = spanlist[n].firstChild.data;
// If there is a left parenthesis
    if (/\(/.test(inspan)) {
// Find the digits in the contents and add them to a class named i
      var num = /[0-9]+/.exec(inspan);
      var add = "i" + num;
      spanlist[n].setAttribute('class', add);
// Hack to make it work with old versions of Internet Explorer
      spanlist[n].setAttribute('className', add);
    }
// If there is no left parenthesis in the span, add a class named c
    else {
      spanlist[n].setAttribute('class', 'c');
      spanlist[n].setAttribute('className', 'c');
    }
  }
}

The result : a span element containing text like ‘AAA’ will get a class=”c” attribute, whereas another containing something like ‘(505)’ will get class=”i505″.

A last word regarding the security and the functionality of this code : you might want to check the elements to change against fine-grained expressions (not like in this example) in order to ensure the result cannot be modified by mistake.

0 votes, 0.00 avg. rating (0% score)