Posts tagged ‘bash’

Batch file conversion to the same encoding on Linux

I recently had to deal with a series of files with different encodings in the same corpus, and I would like to share the solution I found in order to try to convert automatically all the files in a directory to the same encoding (here UTF-8).

file -i

I first tried to write a script in order to detect and correct the encoding, but it was everything but easy, so I decided to use UNIX software instead, assuming these tools would be adequate and robust enough.

I was not disappointed, as file for example gives relevant information when used with this syntax: file -i filename. In fact, there are other tools such as enca, but I was luckier with this one.

input:   file -i filename
output: filename: text/plain; charset=utf-8

grep -Po ‘…\K…’

First of all, one gets an answer of the kind filename: text/plain; charset=utf-8 (if everything goes well), which has to be parsed. In order to do this grep is an option. The -P option unlocks the power of Perl regular expressions, the -o option ensures that only the match will be printed and not the whole line, and finally the \K tells the interpreter to only select what comes afterwards (note that -o and \K are redundant in this example).

So, in order to select the detected charset name and only it: grep -Po 'charset=\K.+?$'

input:   file -i $filename | grep -Po 'charset=\K.+?$
output: utf-8

The encoding is stored in a variable as the result of a command-line using a pipe:

encoding=$(file -i $filename | grep -Po 'charset=\K.+?$')

if [[ ! $encoding =~ "unknown" ]]

Before the re-encoding can take place, it is necessary to filter out the cases where the file could not identify the encoding (it happened to me, but to a lesser extent than with enca).

The exclamation mark used in an if context transforms it into an unless statement, and the =~ operator attempts to find a match in a string (here in the variable).

iconv

Once everything has been cleared, one may proceed to the proper conversion, using iconv. The encoding in the first file is specified using the -f switch, -t is the target encoding (here Unicode).

iconv -f $encoding -t UTF-8 < $filename > $destfile

Note that UTF-8//TRANSLIT may also be used if there are too may errors, which should not be the case with an UTF-8 target but is necessary when converting to the ASCII format for example.

Last, there might be cases where the encoding could not be retrieved, that’s why the else clause is for, per byte re-encoding…

The whole trick

Here is the whole script:


#!/usr/bin/bash

for filename in dir/*        # 'dir' should be changed...
do
    encoding=$(file -i $filename | grep -Po 'charset=\K.+?$')
    destfile="dir2/"$filename        # 'dir2' should also be changed...
    if [[ ! $encoding =~ "unknown" ]]
    then
        iconv -f $encoding -t UTF-8 < $filename > $destfile
    else
        # do something like using a conversion table to address targeted problems
    fi
done

This bash script proved efficient and enabled me to homogenize my corpus. As far as I know, it runs quite fast and also saves time because one only has to focus on the problematic cases (which ought to be addressed anyway), the rest is taken care of.

Introducing the Microblog Explorer

The Microblog Explorer project is about gathering URLs from social networks (FriendFeed, identi.ca, and Reddit) to use them as web crawling seeds. At least by the last two of them a crawl appears to be manageable in terms of both API accessibility and corpus size, which is not the case concerning Twitter for example.

Hypotheses:

  1. These platforms account for a relative diversity of user profiles.
  2. Documents that are most likely to be important are being shared.
  3. It becomes possible to cover languages which are more rarely seen on the Internet, below the English-speaking spammer’s radar.
  4. Microblogging services are a good alternative to overcome the limitations of seed URL collections (as well as the biases implied by search engine optimization techniques and link classification).

Characteristics so far:

  • The messages themselves are not being stored (links are filtered on the fly using a series of heuristics).
  • The URLs that are obviously pointing to media documents are discarded, as the final purpose is to be able to build a text corpus.
  • This approach is ‘static’, as it does not rely on any long poll requests, it actively fetches the required pages.
  • Starting from the main public timeline, the scripts aim at finding interesting users or friends of users.

Regarding the first three, the scripts are just a few tweaks away from delivering this kind of content. Feel free to contact me if you want them to suit your needs. Other interests include microtext corpus building and analysis, social network sampling or network visualization, but they are not my priority right now.

The scripts for FriendFeed and Reddit are written in Python while those for identi.ca are written in Perl. Bash scripts automate series of executions like cron tasks.

FriendFeed

FriendFeed seems to be the most active of the three microblogging services considered. It works as an aggregator, which makes it interesting.

No explicit API limits are enforced, but too much is too much and it leads to non-responding servers.

Among the options I developed I would like to highlight a so-called ‘smart deep crawl’ which targets the interesting users and friends, i.e. the ones by which a significant number of relevant URLs was found or is expected to be found.

identi.ca (public timeline closed in Feb. 2013)

identi.ca is built on open source tools and open standards, which is why I chose to crawl it first. The Microblog Explorer enabled to gather external and internal links. Advantages included the CC license of the messages, the absence of limitations (to my knowledge) and the relative small amount of messages (which can also be a problem).

The Microblog Explorer featured an hourly crawl, which scanned a few pages of the public timeline, and a long-distance miner, which fetched a given list of users and analyzed them one by one.

Reddit

There are 15 target languages available so far : Croatian, Czech, Danish, Finnish, French, German, Hindi, Italian, Norse, Polish, Portuguese, Romanian, Russian, Spanish and Swedish.

Target languages are defined using subreddits (via so-called ‘multi-reddit expressions’). Here is an example to target possibly Norwegian users: http://www.reddit.com/r/norge+oslo+norskenyheter

Sadly, it is currently not possible to go back in time further than the 500th oldest post due to API limitations. Experience shows that user traversals as well as weekly crawls help to address this issue by cumulating a small but significant number of URLs.

Filters

The two main problems I tried to address deal with spam and numerous URLs that link to web pages in English. My take is that the networks analyzed here tend to be dominated by English-speaking users and spammers.

The URL harvesting works as follows: during a social network traversal, obvious spam and URLs leading to non-text documents are filtered out, then in some cases the short message is analyzed by a spell checker in order to see if it could be English text, optional record of user IDs for later crawls.

Using a spell checker (enchant and its library for Python), the scripts use thresholds (expressed as a percentage of tokens which do not pass the spell check) in order to discriminate between links whose titles are mostly English and others, which are thus expected to be in another language. This operation often cuts the amount of microtexts in half and enables to select particular users. Tests show that the probability to find URLs that lead to English text is indeed much higher concerning the lists considered as ‘suspicious’. This option can be deactivated.

This approach can be used with other languages as well but I did not try it so far. There is no language filtering on identi.ca as the number of URLs remaining after the spam filter stays small enough to gather them all.

The technology-prone users account for numerous short messages which over-represent their own interests and hobbies, and there is nothing to do (or to filter) about it…

References

The code is available on GitHub: https://github.com/adbar/microblog-explorer

Please use this paper if you need a reference for the Microblog Explorer, I will soon write a post to introduce it too:

  • [2013,inproceedings] bibtex Go to document
    A. Barbaresi, "Crawling microblogging services to gather language-classified URLs. Workflow and case study," in 51st Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research Workshop, Sofia, Bulgaria, 2013, pp. 9-15.
    @InProceedings{Barbaresi:2013,
      author = {Barbaresi, Adrien},
      title = {{Crawling microblogging services to gather language-classified URLs. Workflow and case study}},
      booktitle = {51st Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research Workshop},
      month = {August},
      year = {2013},
      address = {Sofia, Bulgaria},
      publisher = {Association for Computational Linguistics},
      pages = {9--15},
      url = {http://www.aclweb.org/anthology/P13-3002}
    }

Find and delete LaTeX temporary files

This morning I was looking for a way to delete the dispensable aux, bbl, blg, log, out and toc files that a pdflatex compilation generates. I wanted it to go through directories so that it would eventually find old files and delete them too. I also wanted to do it from the command-line interface and to integrate it within a bash script.

As I didn’t find this bash snippet as such, i.e. adapted to the LaTeX-generated files, I post it here :

find . -regex ".*\(aux\|bbl\|blg\|log\|nav\|out\|snm\|toc\)$" -exec rm -i {} \;

This works on Unix, probably on Mac OS and perhaps on Windows if you have Cygwin installed.

Remarks

  • Find goes here through all the directories starting from where you are (.), it could also go through absolutely all directories (/) or search your Desktop for instance (something like $Home/Desktop/).
  • The regular expression captures files ending with the (expandable) given series of letters, but also files with no extension which end with it (like test-aux).
    If you want it to stick to file extensions you may prefer this variant :
    find . \( -name "*.aux" -or -name "*.bbl" -or -name "*.blg" ... \)
  • The second part really removes the files that match the expression. If you are not sure keep the -i option. It will prompt you each time a file could be deleted (answer with “y” or “n”).
    If you think you know what you are doing, you can keep track of the deletions by using the “-v” (verbose) option.
    If you want it to be quiet, use at your own risk the -f (force) option.

A fast bash pipe for TreeTagger

I have been working with the part-of-speech tagger developed at the IMS Stuttgart TreeTagger since my master thesis. It performs very well on german texts as one could easily suppose, since it was one of its primary purposes.
One major problem is that it’s poorly documented, so I would like to share the way that I found to pass things to TreeTagger through a pipe.

The first thing is that TreeTagger doesn’t take Unicode strings, as it dates back to the nineties. So you have to convert whatever you pass to ISO-8859-1, which the iconv software with the translit option set does very well. It means here “find an equivalent if the character cannot be exactly translated”.

Then you have to define the options that you want to use. I put the most frequent ones in the example.

Benefits

The advantage of a pipe is that you can clean the text while passing it to the tagger. Here is one way of doing it, by using the text editor sed to : 1. remove the trailing white lines 2. replace everything that’s more than one space by one space and 3. replacing spaces by new lines.

This way the TreeTagger gets one word every new line, as required, which is very convenient I think.
Starting from a text file, you get a word, its tag and a new line.

Here is the code that I use :


#!/bin/bash
INPUT=~/file.txt
TAGGER=~/something/TreeTagger/bin/tree-tagger
OPTIONS="-token -lemma"
PARMFILE=~/something/TreeTagger/lib/german.par


< $INPUT sed -e '/^$/d' -e 's/\s+/\s/g' -e 's/ /\n/g' |
iconv --from-code=UTF-8 --to-code=ISO-8859-1//TRANSLIT |
$TAGGER $OPTIONS $PARMFILE |
iconv --from-code=ISO-8859-1 --to-code=UTF-8//TRANSLIT | ... >

That’s all ! Please let me know if these lines proved useful.

Update: You can also choose to use directly the UTF8-encoded model for German, which was apparently trained on the same texts and which should behave the same way (although I have no evidence for that).