Bits of Language: corpus linguistics, NLP and text analytics

“Googleology is bad science”: Anatomy of a web corpus infrastructure

This post discusses a seminal article on corpus linguistics by Adam Kilgarriff. It shows which challenges arise when dealing with web corpora and how a corresponding infrastructure can be developed.

more ...

Replicating the BootCat method to build web corpora from search engines

This post describes an easy and modern way to gather web sources using search engines by adapting the BootCat method, whose positive and negative aspects are discussed.

more ...

How to make language detection with langid.py faster

The language detector langid.py has become quite popular. Using the modernized fork py3langid as an example I show how to maintain and optimize a Python package.

more ...

How to download web pages in parallel and follow politeness rules in Python

Optimizing downloads is crucial to gather data from a series of websites. However, one should respect “politeness” rules. Here is a simple way keep an eye on all these constraints as once.

more ...

Web scraping with Trafilatura just got faster

HTML to text extraction just got faster with the dedicated Trafilatura software as measured on the benchmark available on the repository. These follows from from two major changes in the package dependencies charset_normalizer and jusText.

more ...

An easy way to save time and resources: content-aware URL filtering

Avoid wasting bandwidth capacity and processing time for webpages which are probably not worth the effort. Stay away from pages with little text in the target language or focus on other pages to gather links.

more ...

Web scraping with R: Text and metadata extraction

Why choose between R and Python when you can have both? This tutorial shows how to install a Python scraper and use it for content discovery and text extraction, all straight from R.

more ...

Using a rule-based tokenizer for German

Tokenization is a text segmentation process whose objective resides in dividing written text into meaningful units. This post introduces two rule-based methods to perform tokenization on German, English and beyond.

more ...

Using RSS and Atom feeds to collect web pages with Python

This post describes practical ways to find recent URLs within a website and to extract text, metadata, and comments. It contains all necessary code snippets to optimize link discovery and document filtering.

more ...

A simple multilingual lemmatizer for Python

Grouping together the inflected forms of a word allows for analyzing them as a single item, identified by the dictionary form. The Python library Simplemma provides a simple and multilingual approach to look for base forms or lemmata.

more ...