Objective: Learn basics of batch processing in GNU/Linux
In the previous section, we have seen how to handle streams and text. We can use this knowledge to generate list of command instead of text. This is called batch processing.
In everyday life, you may want to run command sequentiality without using pipes.
To run CMD1
and then run CMD2
you can use the ;
operator
CMD1 ; CMD2
To run CMD1
and then run CMD2
if CMD1
didn’t throw an error, you can use the &&
operator which is safer than the ;
operator.
CMD1 && CMD2
You can also use the ||
to manage errors and run CMD2
if CMD1
failed.
CMD1 || CMD2
The easiest option to execute list of command is to use xargs
. xargs
reads arguments from stdin and use them as argument for a command. In UNIX systems the command echo
send string of character into stdout. We are going to use this command to learn more about xargs
.
echo "hello world"
In general a string of character differs from a command when it’s placed between quotes.
The two following commands are equivalent, why ?
echo "file1 file2 file3" | xargs touch
touch file1 file2 file3
You can display the command executed by xargs
with the switch -t
.
By default the number of arguments sent by xargs
is defined by the system. You can change it with the option -n N
, where N
is the number of arguments sent. Use the option -t
and -n
to run the previous command as 3 separate touch
commands.
echo "file1 file2 file3" | xargs -t -n 1 touch
Sometime, the arguments are not separated by space but by other characters. You can use the -d
option to specify them. Execute touch
1 time from the following command:
echo "file1;file2;file3"
echo "file1;file2;file3" | xargs -t -d \; touch
To reuse the arguments sent to xargs
you can use the command -I
which defines a string corresponding to the argument. Try the following command, what does the manual says about the -c
option of the command cut
?
ls -l file* | cut -c 44- | xargs -t -I % ln -s % link_%
Instead of using ls
the command xargs
is often used with the command find
. The command find
is a powerful command to search for files.
Modify the following command to make a non-hidden copy of all the file with a name starting with .bash in your home folder
find . -name ".bash*" | sed 's|./.||g'
find . -name ".bash*" | sed 's|./.||g' | xargs -t -I % cp .% %
You can try to remove all the files in the /tmp
folder with the following command:
find /tmp/ -type f | xargs -t rm
Modify this command to remove every folder in the /tmp
folder.
find /tmp/ -type d | xargs -t rm -R
awk
commandsxargs
It is a simple solution for writing batch commands, but if you want to write more complex command you are going to need to learn awk
. awk
is a programming language by itself, but you don’t need to know everything about awk
to use it.
You can to think of awk
as a xargs -I $N
command where $1
correspond to the first column $2
to the second column, etc.
There are also some predefined variables that you can use like.
$0
Correspond to all the columns.FS
the field separator usedNF
the number of fields separated by FS
NR
the number for records already readA awk
program is a chain of commands with the form motif { action }
motif
define where there action
is executedaction
is what you want to doThey motif
can be
BEGIN
or END
(before reading the first line, and after reading the last line)<
, <=
, ==
, >=
, >
or !=
&&
(AND), ||
(OR) and !
(Negation)motif_1,motif_2
With awk
you can
Count the number of lines in a file
awk '{ print NR " : " $0 }' file
Modify this command to only display the total number of line with awk (like wc -l
)
awk 'END{ print NR }' file
Convert a tabulated sequences file into fasta format
awk -vOFS='' '{print ">",$1,"\n",$2,"\n";}' two_column_sample_tab.txt > sample1.fa
Modify this command to only get a list of sequence names in a fasta file
awk -vOFS='' '{print $1 "\n";}' two_column_sample_tab.txt > seq_name.txt
Convert a multiline fasta file into a single line fasta file
awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' sample.fa > sample1_singleline.fa
Convert fasta sequences to uppercase
awk '/^>/ {print($0)}; /^[^>]/ {print(toupper($0))}' file.fasta > file_upper.fasta
Modify this command to only get a list of sequence names in a fasta file un lowercase
awk '/[^>]/ {print(tolower($0))}' file.fasta > seq_name_lower.txt
Return a list of sequence_id sequence_length from a fasta file
awk 'BEGIN {OFS = "\n"}; /^>/ {print(substr(sequence_id, 2)" "sequence_length); sequence_length = 0; sequence_id = $0}; /^[^>]/ {sequence_length += length($0)}; END {print(substr(sequence_id, 2)" "sequence_length)}' file.fasta
Count the number of bases in a fastq.gz file
(gzip -dc $0) | awk 'NR%4 == 2 {basenumber += length($0)} END {print basenumber}'
Only read with more than 20bp from a fastq
awk 'BEGIN {OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) >= 20){print header, seq, qheader, qseq}}' < input.fastq > output.fastq
When you start writing complicated command, you may want to save them to reuse them later.
You can find everything that you are typing in your bash
in the ~/.bash_history
file, but working with this file can be tedious as it also contains all the command that you mistype. A good solution, for reproducibility is to write bash
scripts. A bash script is simply a text file that contains a sequence of bash
commands.
As you use bash
in your terminal, you can execute a bash
script with the following command:
source myscrip.sh
It’s usual to write the .sh
extension for shell
scripts.
Write a bash script named download_hg38.sh
that download the hg38.ncbiRefSeq.gtf.gz file, then extract it and that says that it has done it.
The \
character like in regexp cancel the meaning of what follow, you can use it to split your one-liner scripts over many lines to use the &&
operator.
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.ncbiRefSeq.gtf.gz && \
gzip -dc hg38.ncbiRefSeq.gtf.gz && \
echo "download and extraction complete"
In your first bash script, the only thing saying that your script is a bash script is its extension. But most of the time UNIX system doesn’t care about file extension, a text file is a text file.
To tell the system that your text file is a bash script you need to add a shebang. A shebang is a special first line that starts with a #!
followed by the path of the interpreter for your script.
For example, for a bash script in a system where bash
is installed in /bin/bash
the shebang is:
#!/bin/bash
When you are not sure which
is the path of the tools available to interpret your script, you can use the following shebang:
#!/usr/bin/env bash
You can add a shebang to your script and add it the executable right.
chmod u+x download_hg38.sh
Now you can execute your script with the command:
./download_hg38.sh
Congratulations you wrote your first program !
Where did they /usr/bin/env
find the information about your bash ? Why did we have to write a ./
before our script if we are in the same folder ?
This is all linked to the PATH bash variable. Like in many programming languages bash
have what we call variables. variables are named storage for temporary information. You can print a list of all your environment variables (variables loaded in your bash
memory), with the command printenv
.
To create a new variable you can use the following syntax:
VAR_NAME="text"
VAR_NAME2=2
Create a IDENTIY
variable with your first and last names.
IDENTITY="First name Last Name"
It’s good practice to write your bash
variable in uppercase with _
in place of spaces.
You can access the value of an existing bash
variable with the $VAR_NAME
To display the value of your IDENTITY
variable with echo
you can write:
echo $IDENTITY
When you want to mix variable value and text you can use the two following syntax:
echo "my name is "$IDENTITY
echo "my name is ${IDENTITY}"
Going back to the printenv
You can see a PWD variable that store your current path, a SHELL variable that store your current shell, and you can see a PATH variable that stores a loot of file path separated by :
.
The PATH variable contains every folder where to look for executable programs. Executable programs can be binary files or text files with a shebang.
Display the content of PATH
with echo
echo $PATH
You can create a scripts
folder and move your download_hg38.sh
script in it. Then we can modify the PATH
variable to include the scripts
folder in it.
Don’t erase your
PATH
variable !
mkdir ~/scripts
mv `download_hg38.sh` ~/scripts/
PATH=$PATH:~/scripts/
You can check the result of your command with echo $PATH
Try to call your download_hg38.sh
from anywhere on the file tree. Congratulation you installed your first UNIX program !
You can pass argument to your bash scripts, writing the following command:
my_script.sh arg1 arg2 arg3
Means that from within the script:
$0
will give you the name of the script (my_script.sh
)$1
, $2
, $3
, $n
will give you the value of the arguments (arg1
, arg2
, arg3
, argn
)$$
the process id of the current shell$#
the total number of arguments passed to the script$@
the value of all the arguments passed to the script$?
the exit status of the last executed command$!
the process id of the last executed commandYou can write the following variables.sh
script in your scripts
folder:
#!/bin/bash
echo "Name of the script: $0"
echo "Total number of arguments: $#"
echo "Values of all the arguments: $@"
And you can try to call it with some arguments !
In the next session, we are going to learn how to execute command on other computers with ssh.
We have used the following commands:
echo
to display textxarg
to execute a chain of commandsawk
to execute complex chain of commands;
&&
and||
to chain commandssource
to load a scriptshebang
to specify the language of a scriptPATH
to install script