Posts tagged ‘code snippet’

Guessing if a URL points to a WordPress blog

I am currently working on a project for which I need to identify WordPress blogs as fast as possible, given a list of URLs. I decided to write a review on this topic since I found relevant but sparse hints on how to do it.

First of all, let’s say that guessing if a website uses WordPress by analysing HTML code is straightforward if nothing was been done to hide it, which is almost always the case. As WordPress is one of the most popular content management systems, downloading every page and performing a check afterward is an option that should not be too costly if the amount of web pages to analyze is small. However, downloading even a reasonable number of web pages may take a lot of time, that is why other techniques have to be found to address this issue.

The way I chose to do it is twofold, the first filter is URL-based whereas the final selection uses HTTP HEAD requests.

URL Filter

There are webmasters who create a subfolder named “wordpress” which can be seen clearly in the URL, providing a kind of K.O. victory. If the URLs points to a non-text document, the default settings create a “wp-content” subfolder, which the URL is bound to feature, thus leading to another clear case.

An overview of patterns used by WordPress is available on their website on their Using Permalinks page.

In fact, what WordPress calls “permalinks settings” defines five common URL structures as well as a vocabulary to write a customized one. Here are the so-called “common setttings” (which almost every concerned website uses, one way or another):

  • default: ?p= or ?page_id= or ?paged=
  • date: /year/ and/or /month/ and/or /day/ and so on
  • post number: /keyword/number (where keyword is for example “archives”)
  • tag or category: /tag/ or /category/
  • post name: with very long URLs containing a lot of hyphens

The first three patterns yield good results in practice, the only problem with dates being news websites which tend to use dates very frequently in URLs. In that case the accuracy of the prediction is poor.

The last pattern is used broadly, it does not say a lot about a website apart from it being prone to feature search engine optimization techniques. Whether one wants to take advantage of it or not mostly depends on objectives with respect to recall and on the characteristics of the URL set, that is to say on one hand whether all possible URLs are to be covered and on the other hand whether this pattern seems to be significant. It also depends on how much time one may waste running the second step.

Examples of regular expressions:

  • 20[0-9]{2}/[0-9]{2}/
  • /tag/|/category/|\?cat=

HEAD requests

The URL analysis relies on the standard configuration and this step even more so. Customized websites are not easy to detect, and most of the criteria listed here will fail at them.

These questions on wordpress.stackexchange.com gave me the right clues to start performing these requests:

HEAD requests are part of the HTTP protocol. Like the most frequent request, GET, which fetches the content, they are supposed to be implemented by every web server. A HEAD requests asks for the meta-information written in response headers without downloading the actual content. That is why no webpage is actually “seen” during the process, which makes it a lot faster.

One or several requests per domain name are sufficient, depending on the desired precision:

  • A request sent to the homepage is bound to yield pingback information to use via the XMLRPC protocol. This is the “X-Pingback” header. Note that if there is a redirect, this header usually points to the “real” domain name and/or path + “xmlrpc.php”
  • A common extension speeds up page downloads by creating a cache version of every page on the website. This extension adds a “WP-Super-Cache” header to the rest. If the first hint may not be enough to be sure the website is using WordPress, this one does the trick.
    NB: there are webmasters who deliberately give false information, but they are rare.
  • A request sent to “/login” or “/wp-login.php” should yield a HTTP status like 2XX or 3XX, a 401 can also happen.
  • A request sent to “/feed” or “/wp-feed.php” should yield the header “Location”.

The criteria listed above can be used separately or in combination. I chose to use a kind of simple decision tree.

Sending more than one request makes the guess more precise, it also enables to detect redirects and check for the “real” domain name. As this operation sometimes really helps deduplicating a URL list, it is rarely a waste of time.

Last, let’s mention that it is useful to exclude a few common false positives, rule out using this kind of regular expression:

\.blogspot\.|\.google\.|\.tumblr\.|\.typepad\.com|\.wp\.com|\.archive\.|akamai|fbcdn|baidu\.com

Batch file conversion to the same encoding on Linux

I recently had to deal with a series of files with different encodings in the same corpus, and I would like to share the solution I found in order to try to convert automatically all the files in a directory to the same encoding (here UTF-8).

file -i

I first tried to write a script in order to detect and correct the encoding, but it was everything but easy, so I decided to use UNIX software instead, assuming these tools would be adequate and robust enough.

I was not disappointed, as file for example gives relevant information when used with this syntax: file -i filename. In fact, there are other tools such as enca, but I was luckier with this one.

input:   file -i filename
output: filename: text/plain; charset=utf-8

grep -Po ‘…\K…’

First of all, one gets an answer of the kind filename: text/plain; charset=utf-8 (if everything goes well), which has to be parsed. In order to do this grep is an option. The -P option unlocks the power of Perl regular expressions, the -o option ensures that only the match will be printed and not the whole line, and finally the \K tells the interpreter to only select what comes afterwards (note that -o and \K are redundant in this example).

So, in order to select the detected charset name and only it: grep -Po 'charset=\K.+?$'

input:   file -i $filename | grep -Po 'charset=\K.+?$
output: utf-8

The encoding is stored in a variable as the result of a command-line using a pipe:

encoding=$(file -i $filename | grep -Po 'charset=\K.+?$')

if [[ ! $encoding =~ "unknown" ]]

Before the re-encoding can take place, it is necessary to filter out the cases where the file could not identify the encoding (it happened to me, but to a lesser extent than with enca).

The exclamation mark used in an if context transforms it into an unless statement, and the =~ operator attempts to find a match in a string (here in the variable).

iconv

Once everything has been cleared, one may proceed to the proper conversion, using iconv. The encoding in the first file is specified using the -f switch, -t is the target encoding (here Unicode).

iconv -f $encoding -t UTF-8 < $filename > $destfile

Note that UTF-8//TRANSLIT may also be used if there are too may errors, which should not be the case with an UTF-8 target but is necessary when converting to the ASCII format for example.

Last, there might be cases where the encoding could not be retrieved, that’s why the else clause is for, per byte re-encoding…

The whole trick

Here is the whole script:


#!/usr/bin/bash

for filename in dir/*        # 'dir' should be changed...
do
    encoding=$(file -i $filename | grep -Po 'charset=\K.+?$')
    destfile="dir2/"$filename        # 'dir2' should also be changed...
    if [[ ! $encoding =~ "unknown" ]]
    then
        iconv -f $encoding -t UTF-8 < $filename > $destfile
    else
        # do something like using a conversion table to address targeted problems
    fi
done

This bash script proved efficient and enabled me to homogenize my corpus. As far as I know, it runs quite fast and also saves time because one only has to focus on the problematic cases (which ought to be addressed anyway), the rest is taken care of.

Recipes for several model fitting techniques in R

As I recently tried several modeling techniques in R, I would like to share some of these, with a focus on linear regression.

Disclaimer : the code lines below work, but I would not suggest that they are the most efficient way to deal with this kind of data (as a matter of fact, all of them score slightly below 80% accuracy on the Kaggle datasets). Moreover, there are not always the most efficient way to implement a given model.

I see it as a way to quickly test several frameworks without going into details.

The column names used in the examples are from the Titanic track on Kaggle.

Generalized linear models


titanic.glm <- glm (survived ~ pclass + sex + age + sibsp, data = titanic, family = binomial(link=logit))
glm.pred <- predict.glm(titanic.glm, newdata = titanicnew, na.action = na.pass, type = "response")
cat(glm.pred)

  • 'cat' actually prints the output
  • One might want to use the na.action switch to be able to deal with incomplete data (as in the Titanic dataset) : na.action=na.pass

Link to glm in R manual.

Mixed GAM (Generalized Additive Models) Computation Vehicle

The commands are a little less obvious :

library(mgcv)
titanic.gam <- gam (survived ~ pclass + sex + age + sibsp, data = titanic, family=quasibinomial(link = "logit"), method="GCV.Cp")
gam.pred <- predict.gam(titanic.gam, newdata = titanicnew, na.action = na.pass, type = "response")
gam.pred <- ifelse(gam.pred <= 0, 0, 1)

  • Quasibinomial, Poisson, Gamma and Gaussian are usually usable alternatives with both libraries.
  • The method used (GCV.Cp) is select by cross-validation. Others are available.
  • The ifelse threshold is not necessarily 0. Moreover, this kind of conversion is not always required.

Link to mgcv package manual.

Robust statistics


library(robustbase)
titanic.glm <- glmrob(survived ~ pclass + sex + age + sibsp + parch , data = titanic, family = binomial)

  • use predict.glmrob for prediction.

Link to robustbase package manual.

Partition trees

I mentioned this part in my last post about R.

library(rpart)
titanic.tree <- rpart(survived ~ pclass + sex + age + sibsp, data = titanic, method="anova")
tree.pred <- predict(titanic.tree, newdata = titanicnew)

Link to rpart package manual.

Support vector machines

There are many other implementations available.

library(e1071)
titanic.svm <- svm(formula = survived ~ pclass + sex + age + sibsp, data = titanic, gamma = 10^-1, cost = 10^-1)
pred.svm <- predict (titanic.svm, data = titanicnew)

  • optional : decision.values = TRUE
  • A formula in order to automatically tune the result (certainly not the most accurate) :
    tuned <- tune.svm(survived ~ pclass + sex + age + sibsp, data = titanic, gamma = 10^(-5:5), cost = 10^(-2:2), kernel = "polynomial")


Link to e1071 package manual
.

Bagging (in this case bagging of a tree)


library(ipred)
titanic.bt <- bagging(survived ~ pclass + sex + age + sibsp, data = titanic, nbagg=20,coob=T)
titanic.bt <- ipredbag(survived ~ pclass + sex + age + sibsp, data = titanic, nbagg=20,coob=T)
exp <- predict(titanic.bt, type="class")

  • Sampling : nbagg bootstrap samples are drawn and a tree is constructed for each of them.
  • Coob : out-of-bag estimate of the error rate is computed.

Link to ipred package manual.

Random forests


library(randomForest)
titanic.rf <- randomForest(survived ~ pclass + sex + age + sibsp, data = titanic, importance=T, na.action=NULL)
rf.pred <- predict(titanic.rf, data = titanicnew, type="response")

  • The mtry and ntree options can be useful here.

Link to randomForest package manual.

Data analysis and modeling in R: a crash course

Let’s pretend you recently installed R (a software to do statistical computing), you have a text collection you would like to analyze or classify and some time to lose. Here are a few quick commands that could get you a little further. I also write this kind of cheat sheet in order to remember a set of useful tricks and packages I recently gathered and from which I thought they could help others too.

Letter frequencies

In this example I will use a series of characteristics (or features) extracted from a text collection, more precisely the frequency of each letter from a to z (all lowercase). By the way, it goes as simple as that using Perl and regular expressions (provided you have a $text variable) :

my @letters = ("a" .. "z");
foreach my $letter (@letters) {
my $letter_count = () = $text =~ /$letter/gi;
printf "%.3f", (($letter_count/length($text))*100);
}

 

First tests in R

After having started R (‘R’ command), one usually wants to import data. In this case, my file type is TSV (Tab-Separated Values) and the first row contains only describers (from ‘a’ to ‘z’), which comes at hand later. This is done using the read.table command.

alpha <- read.table("letters_frequency.tsv", header=TRUE, sep="\t")

Then, after examining a glimpse of the dataset (summary(alpha)), for instance to see if everything looks fine, one may start to calculate and visualize a correlation matrix :

correlation_matrix <- cor(alpha)
install.packages("lattice") # if it is not already there
library(lattice)
levelplot(correlation_matrix)

Here is the result, it works out of the box even if you did not label your data, I will use the "choice" column later. The colors do not fully match my taste but this is a good starting point.

correlation_matrix

Correlation matrix (or 'heat map') of letter frequencies in a collection of texts written in English.

You may click on the image to enlarge it (sorry for the mouse pointer on the map), but as far as I am concerned there are no strong conclusions to draw from this visualization, let alone the fact that the letters frequencies are not distributed randomly. Prettier graphs can be made, see for instance this thread.

 

Models and validation : a few guidelines

Now, let's try to build a model that fits the data. To do this you need a reference, in this case the choice column, which is a binary choice : '1' or 'yes' and '0' or 'no', whether annotated or determined randomly. You may also try to model the frequency of the letter 'a' using the rest of the alphabet.

You could start from a hypothesis regarding the letter frequency, like "the frequency of the letter 'e' is of paramount importance to predict the frequency of the letter 'a'". You may also play around with the functions in R, come up with something interesting and call it 'data-driven analysis'.

One way or another, what you have to do seems simple : find a baseline to compare your results to, find a model and optimize it, compare and evaluate the results. R is highly optimized to perform these tasks and the computation time should not be the issue here. Nevertheless, there are a lot of different methods available in R, and to choose one that suits your needs might require a considerable amount of time.

Example 1 : Regression Analysis

Going a step further, I will start with regression analysis, more precisely with linear regression, i.e. lm(), since it may be easier to understand what the software does.

Other models of this family include the generalized linear model, glm(), and the generalized additive model, gam() with the mgcv package. Their operation is similar.

The following commands define a baseline, select all the features, print a summary and show a few plots :

alpha.lm.baseline <- lm(choice ~ 1 , data = alpha)
alpha.lm <- lm(choice ~ . , data = alpha)
summary(alpha.lm)
plot(alpha.lm)

'.' means 'every other column', another possibility is 'column_a + column_b + column_d' etc.

lm_plot

A plot of model coverage (or 'fit') and so-called residuals

In order to determine if the components (features) of the models are relevant (concerning regression analysis), the commands exp(), coef() and coefint() may be interesting, as they give hints whether a given model is supposed to generalize well or not.

Selecting features

You may also be able to do more (or at least as much as you already do) with less, by deleting features that bear no statistical relevance. There are a lot of ways to select features, all the more since this issue links to a field named dimensionality reduction.

Fortunately, R has a step() function which reduces gradually the number of features and stops if no statistically predictable improvement has been reached. However, fully automatic processing is not the best way to find a core of interesting features, one usually has to assess which framework to work with depending on the situation.

Frequently mentioned techniques include lasso sequences, ridge regression, leaps and bounds, or selection using the caret package. Please note that this list is not exhaustive at all...

To get a glimpse of another framework (same principles but another scientific tradition), I think principal components analysis (using princomp()) is worth a look.

This accessible paper presents different issues related to this topic and summarizes several techniques addressing it :

Evaluating the results

In many scientific fields, cross-validation is an acknowledged evaluation method. There are a lot of slightly different ways to perform such a validation, see for instance this compendium.

I will just describe two easy steps suiting the case of a binary choice. Firstly, drawing a prediction table usually works, even if it is not the shortest way, and enables to see what evaluating means. Please note that this example is bad, as on one hand the model is only evaluated once and the other hand the evaluation is performed on its own training data :

alpha.pred <- predict.lm(alpha.glm)
alpha.pred <- ifelse(alpha.pred <= 0, 0, 1)
table(alpha.pred, alpha$choice, dnn=c("predicted","observed"))
sum(diag(alpha.tab))/sum(logdata.tab)

The DAAG package (Data Analysis And Graphics) allows for quicker ways to estimate results AND perform a cross-validation on a few kinds of linear models using binary choices, for example with this command :

cv.binary(alpha.lm)

Example 2 : Decision Trees

Decision trees are another way to discover more about data. By the way, the package used in this example, rpart (recursive partitioning and regression trees) also supports regression algorithms.

library(rpart)
alpha.tree <- rpart(choice ~ . , data = alpha, method="anova")
plotcp(alpha.tree)

Because the anova method was applied, this is actually a regression tree. The "control" option should be used to prevent the tree to grow unnecessarily deep or broad.

The printcp() command enables to see the variables actually used in tree construction, which may be interesting. Making a summary() can output lots of text, as all the tree is detailed node by node. This is easy to understand, since the algorithm takes decisions based on statistical regularities, in that case letter frequencies are used to split the input into many scenarios (the terminal nodes of a tree), such as "if 'u' is under a certain limit and 'd' above another, than we look at the frequency of 'f' and 'j' and take another decision", etc.

tree

> plot(alpha.tree)
> text(alpha.tree)

All the decisions are accessible, this is called a 'white box' model. Nonetheless, if decision tree learning has advantages, it also bears limitations.

 

A few links to go further

Find and delete LaTeX temporary files

This morning I was looking for a way to delete the dispensable aux, bbl, blg, log, out and toc files that a pdflatex compilation generates. I wanted it to go through directories so that it would eventually find old files and delete them too. I also wanted to do it from the command-line interface and to integrate it within a bash script.

As I didn’t find this bash snippet as such, i.e. adapted to the LaTeX-generated files, I post it here :

find . -regex ".*\(aux\|bbl\|blg\|log\|nav\|out\|snm\|toc\)$" -exec rm -i {} \;

This works on Unix, probably on Mac OS and perhaps on Windows if you have Cygwin installed.

Remarks

  • Find goes here through all the directories starting from where you are (.), it could also go through absolutely all directories (/) or search your Desktop for instance (something like $Home/Desktop/).
  • The regular expression captures files ending with the (expandable) given series of letters, but also files with no extension which end with it (like test-aux).
    If you want it to stick to file extensions you may prefer this variant :
    find . \( -name "*.aux" -or -name "*.bbl" -or -name "*.blg" ... \)
  • The second part really removes the files that match the expression. If you are not sure keep the -i option. It will prompt you each time a file could be deleted (answer with “y” or “n”).
    If you think you know what you are doing, you can keep track of the deletions by using the “-v” (verbose) option.
    If you want it to be quiet, use at your own risk the -f (force) option.