BLAST
Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences. BLAST Homepage
Available Modules¶
module load BLAST/2.9.0-gimkl-2018b
BLAST Databases¶
We download the standard NCBI databases quarterly, and create a
corresponding environment module named like BLASTDB/<yyyy-mm>
which
sets the BLASTDB environment variable accordingly. If you want to use
one of these databases then you should find out what our most recent
version is (module avail BLASTDB
) and then load it in your batch
script.
module load BLASTDB
ls $BLASTDB
Because we only keep a few recent versions of the databases, you may be required from time to time to change the BLASTDB module version if you use old job submission scripts as templates for new ones.
Example scripts¶
When given a large amount of query sequence to get through the BLAST search programs will take batches of it, running through the database with each batch and then starting over with the next batch. This can cause the database to be repeatedly read from disk and so limit the speed of your search, and using multiple threads only makes it worse. So there are two reasonable ways to run BLAST programs on our system: single threaded for small jobs, or multithreaded with a local copy of the database for large jobs. If in doubt try the simpler single-thread approach first and see if it takes too long.
Single Thread¶
For jobs which need less than 24 CPU-hours, eg: those that use small databases (< 10 GB) or small amounts of query sequence (< 1 GB), or fast BLAST programs such as blastn with its default (megablast) settings.
#!/bin/bash -e
#SBATCH --job-name BLAST
#SBATCH --time 00:30:00 # ~10 CPU minutes / MB blastn query vs nt
#SBATCH --mem 30G
#SBATCH --cpus-per-task 2
module load BLAST/2.13.0-GCC-11.3.0
module load BLASTDB/2023-01
# This script takes one argument, the FASTA file of query sequences.
QUERIES=$1
FORMAT="6 qseqid qstart qend qseq sseqid sgi sacc sstart send staxids sscinames stitle length evalue bitscore"
BLASTOPTS="-evalue 0.05 -max_target_seqs 10"
BLASTAPP=blastn
DB=nt
#BLASTAPP=blastx
#DB=nr
$BLASTAPP $BLASTOPTS -db $DB -query $QUERIES -outfmt "$FORMAT" \
-out $QUERIES.$DB.$BLASTAPP -num_threads $SLURM_CPUS_PER_TASK
Multiple threads and local database copy¶
For jobs which need more than 24 CPU-hours, eg: those that use large
databases (> 10 GB) or large amounts of query sequence (> 1 GB),
or slow BLAST searches such as classic blastn (blastn -task blastn
).
This script copies the BLAST database into the per-job temporary directory $TMPDIR before starting the search. Since compute nodes do not have local disks, this database copy is in memory, and so must be allowed for in the memory requested by the job. As of mid 2023 that is 283 GB for the nt database, 157 GB for refseq_protein.
#!/bin/bash -e
#SBATCH --job-name BLAST
#SBATCH --time 02:30:00
#SBATCH --mem 120G # 30 GB plus the database
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 36 # half a node
module load BLAST/2.13.0-GCC-11.3.0
module load BLASTDB/2023-01
# This script takes one argument, the FASTA file of query sequences.
QUERIES=$1
FORMAT="6 qseqid qstart qend qseq sseqid sgi sacc sstart send staxids sscinames stitle length evalue bitscore"
BLASTOPTS="-task blastn"
BLASTAPP=blastn
DB=nt
#BLASTAPP=blastx
#DB=nr
# Keep the database in RAM
cp $BLASTDB/{$DB,taxdb}.* $TMPDIR/
export BLASTDB=$TMPDIR
$BLASTAPP $BLASTOPTS -db $DB -query $QUERIES -outfmt "$FORMAT" \
-out $QUERIES.$DB.$BLASTAPP -num_threads $SLURM_CPUS_PER_TASK