Finding viable seed URLs for web corpora

I recently attended the Web as Corpus Workshop in Gothenburg, where I had a talk for a paper of mine, Finding viable seed URLs for web corpora: a scouting approach and comparative study of available sources, and another with Felix Bildhauer and Roland Schäfer, Focused Web Corpus Crawling.


The comparison I did started from web crawling experiments I performed at the FU Berlin. The fact is that the conventional tools of the “Web as Corpus” framework rely heavily on URLs obtained from search engines. URLs were easily gathered that way until search engine companies restricted this allowance, meaning that one now has to pay and/or to wait longer to send queries.

I tried to evaluate the leading approach and to find decent subtitutes using social networks as well as the Open Directory Project and Wikipedia. I take four different languages (Dutch, French, Indonesian and Swedish) as examples in order to compare several web spaces with different if not opposed characteristics.

My results distinguish no clear winner, complementary approaches are called for, and it seems possible to replace or at least to complement the existing BootCaT approach. I think that crawling problems such as link/host diversity have not been well-studied in a corpus linguistics context, and I wish to bring to linguists’ attention that the first step of web corpus construction in itself can change a lot of parameters.


Slide on search engine URLs
The slides of my talk can be downloaded here

Additionally, experiments run with Felix Bildhauer and Roland Schäfer at the FU Berlin showed that the crawling seeds I selected tend to perform better than the traditional approach.


For more information, the papers are available online:

Barbaresi A., 2014, Finding Viable Seed URLs for Web Corpora: A Scouting Approach and Comparative Study of Available Sources, Proceedings of WaC-9, p. 1-8.

Schäfer R., Barbaresi A., Bildhauer F., 2014, Focused Web Corpus Crawling, Proceedings of WaC-9, p. 9-15.

The toolchain used to perform these experiments is open-source: FLUX: Filtering and Language-identification for URL Crawling Seeds.

Challenges in web corpus construction for low-resource languages

I recently presented a paper at the third LRL Workshop (a joint LTC-ELRA-FLaReNet-META_NET workshop on “Less Resourced Languages, new technologies, new challenges and opportunities”).


The state of the art tools of the “web as corpus” framework rely heavily on URLs obtained from search engines. Recently, this querying process became very slow or impossible to perform on a low budget.

Moreover, there are diverse and partly unknown search biases related to search engine optimization tricks and undocumented PageRank adjustments, so that diverse sources of URL seeds could at least ensure that there is not a single bias, but several ones. Last, the evolving web document structure and a shift from “web AS corpus” to “web FOR corpus” (increasing number of web pages and the necessity to use sampling methods) complete what I call the post-BootCaT world in web corpus construction.

Study: What are viable alternative data sources for lesser-known languages?

Trying to find reliable data sources for Indonesian, a country with a population of 237,424,363 of which 25.90 % are internet users (2011, official Indonesian statistics institute), I performed a case study of different kinds of URL sources and crawling strategies.

First, I classified URLs extracted from the Open Directory Project (What are these URLs worth for language studies and web corpus construction?) and Wikipedia (Do the links from a particular edition point to relevant websites with respect to the language of the documents they contain?)

I did it for Indonesian, Malay, Danish and Swedish in order to enable comparisons, most notably with the Scandinavian language pair of medium-resourced languages. Then I performed web crawls focusing on Indonesian and using the mentioned sources as start URLs.

My scouting approach using open-source software leads to a URL database with metadata which can be used to replace or at least to complement the BootCaT approach.

Slide from the talk: crawling experiments for Indonesian

Slide from the talk: crawling experiments for Indonesian

For more information

A. Barbaresi, “Challenges in web corpus construction for low-resource languages in a post-BootCaT world“, in Human Language Technologies as a Challenge for Computer Science and Linguistics, Proceedings of the 6th Language & Technology Conference, Less Resourced Languages special track, Zygmunt Vetulani and Hans Uszkoreit (eds.), pp. 69-73, Poznan, 2013.

Article and slides are available here:

The toolchain used in this article is available under an open-source license on GitHub: FLUX-Toolchain, Filtering and Language-identification for URL Crawling Seeds (FLUCS).

Selected references

Barbaresi, A., 2013. “Crawling microblogging services to gather language-classified URLs. Workflow and case study”. In Proceedings of the Annual Meeting of the ACL, Student Research Workshop.
Baroni, M. and Bernardini, S., 2004. “BootCaT: Bootstrapping corpora and terms from the web”. In Proceedings of LREC.
Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E., 2009. “The WaCky Wide Web: A collection of very large linguistically processed web-crawled corpora”. Language Resources and Evaluation, 43(3):209–226.
Baykan, E., Henzinger, M., and Weber, I., 2008. “Web Page Language Identification Based on URLs”. Proceedings of the VLDB Endowment, 1(1):176–187.
Borin, L., 2009. “Linguistic diversity in the information society”. In Proceedings of the SALTMIL 2009 Workshop on Information Retrieval and Information Extraction for Less Resourced Languages.
Goldhahn, D., Eckart, T., Quasthoff, U., 2012. “Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages”. In Proceedings of LREC.
Kilgarriff, A., Reddy, S., Pomikalek, J., and Avinesh, PVS, 2010. “A Corpus Factory for Many Languages”. In Proceedings of LREC.
Lui, M., Baldwin, T., 2012. “ An Off-the-shelf Language Identification Tool”. In Proceedings of the 50th Annual Meeting of the ACL.
Scannell, K. P., 2007. “The Crubadan Project: Corpus building for under-resourced languages”. In Building and Exploring Web Corpora: Proceedings of the 3rd Web as Corpus Workshop, vol. 4.

A one-pass valency-oriented chunker for German

I recently introduced at the LTC’13 conference a tool I developed to help performing fast text analysis on web corpora: a one-pass valency-oriented chunker for German.


“It turns out that topological fields together with chunked phrases provide a solid basis for a robust analysis of German sentence structure.”
E. W. Hinrichs, “Finite-State Parsing of German”, in Inquiries into Words, Constraints and Contexts, A. Arppe and et al. (eds.), Stanford: CSLI Publications, pp. 35–44, 2005.


Non-finite state parsers provide fine-grained information but they are computationally demanding, so that it can be interesting to see how far a shallow parsing approach is able to go.

The transducer described here consists in a pattern-based matching operation of POS-tags using regular expressions that takes advantage of the characteristics of German grammar. The process aims at finding linguistically relevant phrases with a good precision, which enables in turn an estimation of the actual valency of a given verb.

The chunker reads its input exactly once instead of using cascades, which greatly benefits computational efficiency.

This finite-state chunking approach does not return a tree structure, but rather yields various kinds of linguistic information useful to the language researcher: possible applications include simulation of text comprehension on the syntactical level, creation of selective benchmarks and failure analysis.


This figure shows a simplified version of the pattern used, for illustration purposes:

Slide from the the talk

Slide from the the talk

For more information

A. Barbaresi, “A one-pass valency-oriented chunker for German“, in Human Language Technologies as a Challenge for Computer Science and Linguistics, Proceedings of the 6th Language & Technology Conference, Zygmunt Vetulani and Hans Uszkoreit (eds.), pp. 157-161, Poznan, 2013.

Article and slides are available here:

A proof of concept is available on GitHub:

Selected references on shallow parsing

Abney, S. P., 1991. “Parsing by chunks”. Principle-based parsing, 44:257–278.
Barbaresi, A., 2011. “Approximation de la complexité perçue, méthode d’analyse”. In Actes TALN’2011/RECITAL, pp. 229-234.
Hobbs, J. R., Appelt, D., Bear, J., Israel, D., Kameyama, M., Stickel, M., and Tyson, M., 1997. “FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text”. Finite-State Language Processing:383–406.
Kermes, H. and Evert, S. 2002. “YAC – A Recursive Chunker for Unrestricted German Text”. In Proceedings of the 3rd International Conference on Language Resources and Evaluation, vol. 5.
Neumann, G., Backofen R., Baur J., Becker M., and Braun C., 1997. “An Information Extraction Core System for Real World German Text Processing”. In Proceedings of the Fifth Conference on Applied Natural Language Processing. Association for Computational Linguistics.
Pereira, F., 1990. “Finite-state approximations of grammars”. In Proceedings of the Annual Meeting of the ACL.
Riloff, E. and Phillips, W., 2004. “An Introduction to the Sundance and AutoSlog Systems”. Technical report, School of Computing, University of Utah.
Schiehlen, M., 2003. “A Cascaded Finite-State Parser for German”. In Proceedings of the 10th conference of the EACL, vol. 2.
Voss, M. J., 2005. “Determining syntactic complexity using very shallow parsing”. Master’s thesis, CASPR, Artificial Intelligence Center, University of Georgia.

Guessing if a URL points to a WordPress blog

I am currently working on a project for which I need to identify WordPress blogs as fast as possible, given a list of URLs. I decided to write a review on this topic since I found relevant but sparse hints on how to do it.

First of all, let’s say that guessing if a website uses WordPress by analysing HTML code is straightforward if nothing was been done to hide it, which is almost always the case. As WordPress is one of the most popular content management systems, downloading every page and performing a check afterward is an option that should not be too costly if the amount of web pages to analyze is small. However, downloading even a reasonable number of web pages may take a lot of time, that is why other techniques have to be found to address this issue.

The way I chose to do it is twofold, the first filter is URL-based whereas the final selection uses HTTP HEAD requests.

URL Filter

There are webmasters who create a subfolder named “wordpress” which can be seen clearly in the URL, providing a kind of K.O. victory. If the URLs points to a non-text document, the default settings create a “wp-content” subfolder, which the URL is bound to feature, thus leading to another clear case.

An overview of patterns used by WordPress is available on their website on their Using Permalinks page.

In fact, what WordPress calls “permalinks settings” defines five common URL structures as well as a vocabulary to write a customized one. Here are the so-called “common setttings” (which almost every concerned website uses, one way or another):

  • default: ?p= or ?page_id= or ?paged=
  • date: /year/ and/or /month/ and/or /day/ and so on
  • post number: /keyword/number (where keyword is for example “archives”)
  • tag or category: /tag/ or /category/
  • post name: with very long URLs containing a lot of hyphens

The first three patterns yield good results in practice, the only problem with dates being news websites which tend to use dates very frequently in URLs. In that case the accuracy of the prediction is poor.

The last pattern is used broadly, it does not say a lot about a website apart from it being prone to feature search engine optimization techniques. Whether one wants to take advantage of it or not mostly depends on objectives with respect to recall and on the characteristics of the URL set, that is to say on one hand whether all possible URLs are to be covered and on the other hand whether this pattern seems to be significant. It also depends on how much time one may waste running the second step.

Examples of regular expressions:

  • 20[0-9]{2}/[0-9]{2}/
  • /tag/|/category/|\?cat=

HEAD requests

The URL analysis relies on the standard configuration and this step even more so. Customized websites are not easy to detect, and most of the criteria listed here will fail at them.

These questions on gave me the right clues to start performing these requests:

HEAD requests are part of the HTTP protocol. Like the most frequent request, GET, which fetches the content, they are supposed to be implemented by every web server. A HEAD requests asks for the meta-information written in response headers without downloading the actual content. That is why no webpage is actually “seen” during the process, which makes it a lot faster.

One or several requests per domain name are sufficient, depending on the desired precision:

  • A request sent to the homepage is bound to yield pingback information to use via the XMLRPC protocol. This is the “X-Pingback” header. Note that if there is a redirect, this header usually points to the “real” domain name and/or path + “xmlrpc.php”
  • A common extension speeds up page downloads by creating a cache version of every page on the website. This extension adds a “WP-Super-Cache” header to the rest. If the first hint may not be enough to be sure the website is using WordPress, this one does the trick.
    NB: there are webmasters who deliberately give false information, but they are rare.
  • A request sent to “/login” or “/wp-login.php” should yield a HTTP status like 2XX or 3XX, a 401 can also happen.
  • A request sent to “/feed” or “/wp-feed.php” should yield the header “Location”.

The criteria listed above can be used separately or in combination. I chose to use a kind of simple decision tree.

Sending more than one request makes the guess more precise, it also enables to detect redirects and check for the “real” domain name. As this operation sometimes really helps deduplicating a URL list, it is rarely a waste of time.

Last, let’s mention that it is useful to exclude a few common false positives, rule out using this kind of regular expression:


Review of the Czech internet corpus

Web for “old school” balanced corpus

The Czech internet corpus (Spoustová and Spousta 2012) is a good example of focused web corpora built in order to gather an “old school” balanced corpus encompassing different genres and several text types.

The crawled websites are not selected automatically or at random but according to the linguists’ expert knowledge: the authors mention their “knowledge of the Czech Internet” and their experience on “web site popularity”. The whole process as well as the target websites are described as follows:

“We have chosen to begin with manually selecting, crawling and cleaning particular web sites with large and good-enough-quality textual content (e.g. news servers, blog sites, young mothers discussion fora etc.).” (p. 311)

Boilerplate removal

The boilerplate removal part is specially crafted for each target, the authors speak of “manually written scripts”. Texts are picked within each website according to their knowledge. Still, as the number of documents remains too high to allow for a completely manual selection, the authors use natural language processing methods to avoid duplicates.


Their workflow includes:

  1. download of the pages,
  2. HTML and boilerplate removal,
  3. near-duplicate removal,
  4. and finally a language detection, which does not deal with English text but rather with the distinction of Czech and Slovak variants.

Finally, they divide the corpus into three parts: articles, discussions and blogs. What they do with mixed-content is not clear:

“Encouraged by the size, and also by the quality of the texts acquired from the web, we decided to compile the whole corpus only from particular, carefully selected sites, to proceed the cleaning part in the same, sophisticated manner, and to divide the corpus into three parts – articles (from news, magazines etc.), discussions (mainly standalone discussion fora, but also some comments to the articles in acceptable quality) and blogs (also diaries, stories, poetry, user film reviews).” (p. 312)


There are indeed articles and blog posts which due to long comment threads are likelier to fall into the discussion category. On so-called “pure players” or “netzines” the distinction between an article and a blog post is not clear either, because of the content but also for technical reasons related to the publishing software, such as a the content management system like WordPress, which is very popular among bloggers but also sometimes used to propel static websites.

It is interesting to see that “classical” approaches to web texts seem to be valid among the corpus linguistics community, in a shift that could be associated with the “web for corpus” or “corpora from the web” approach.

The workflow replicates steps that are useful for scanned texts, with boilerplate removal somehow replacing OCR corrections. One clear advantage is the availability and quantity of the texts, another is the speed of processing, both are mentioned by the authors who are convinced that their approach can lead to further text collections. A downside is the lack of information about the decisions made during the process, which ought to be encoded as metadata and exported with the corpus, so that the boilerplate removal or the text classification process for example can be evaluated or redesigned using other tools.


Johanka Spoustová and Miroslav Spousta, “A High-Quality Web Corpus of Czech” in Proceedings of LREC, pp. 311-315, 2012.