搭建本地blast
刚来HKU的时候,专家说要搞本地blast,于是我搭了一个,并做了记录。看了《来香港读博其实是被逼的》这篇文章,很多人问专家是谁,这次的记录主机名称就是专家缩写。
I was asked to set up a local blast for the lab. Blast can be installed directly using apt in debian and it turns out to be easy.
root@jz:/ssd/genomes# apt-get install ncbi-blast+
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following NEW packages will be installed:
ncbi-blast+
0 upgraded, 1 newly installed, 0 to remove and 26 not upgraded.
Need to get 11.2 MB of archives.
After this operation, 32.8 MB of additional disk space will be used.
Get:1 http://ftp.hk.debian.org/debian/ wheezy/main ncbi-blast+ amd64 2.2.26-3 [11.2 MB]
Fetched 11.2 MB in 1min 16s (146 kB/s)
Selecting previously unselected package ncbi-blast+.
(Reading database ... 252681 files and directories currently installed.)
Unpacking ncbi-blast+ (from .../ncbi-blast+_2.2.26-3_amd64.deb) ...
Processing triggers for man-db ...
Setting up ncbi-blast+ (2.2.26-3) ...
Before the program can be used for sequence alignment, we should prepare the db files:
root@jz:/ssd/genomes/blast/db# makeblastdb -in ../../hg19.fa -out hg19 -dbtype nucl
Building a new DB, current time: 11/21/2013 16:03:05
New DB name: hg19
New DB title: ../../hg19.fa
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; added 25 sequences in 27.7084 seconds.
That’s it. Now blast is supported in the lab server. I tested blastn with the AHANAK sequence.
ygc@jz:~$ date; blastn -num_threads 8 -query sequence.fasta -db /ssd/genomes/blast/db/hg19 >> testblast.txt ; date
Thu Nov 21 16:19:19 HKT 2013
Thu Nov 21 17:19:45 HKT 2013
ygc@jz:~$ wc -c sequence.fasta
149510 sequence.fasta
Blast search has three distinctive stages: word matching with database scan, ungapped alignment, gapped alignment with traceback. Only word matching stage is multi-threaded. So it tooks 1 hour to align this long gene.
The most significant hit of the blast result indicated that the sequence is located at chr11:62331329-62184015, which is consistent with the information presented in the AHANAK sequence link.
ygc@jz:~$ cat testblast.txt
BLASTN 2.2.26+
Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb
Miller (2000), “A greedy algorithm for aligning DNA sequences”, J
Comput Biol 2000; 7(1-2):203-14.
Database: ../../hg19.fa
25 sequences; 3,095,693,983 total letters
Query= gi|224589802:c62331329-62184015 Homo sapiens chromosome 11,
GRCh37.p13 Primary Assembly
Length=147315
Score E
Sequences producing significant alignments: (Bits) Value
chr11 2.720e+05 0.0
chr14 941 0.0
…
> chr11
Length=135006516
Score = 2.720e+05 bits (147315), Expect = 0.0
Identities = 147315/147315 (100%), Gaps = 0/147315 (0%)
Strand=Plus/Minus
Query 1 CATTTGGAAAGGAACCTGGGACCTAGGGTTAAAGGACAGATCCTCTGCTTTGCTACAGTG 60
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 62331329 CATTTGGAAAGGAACCTGGGACCTAGGGTTAAAGGACAGATCCTCTGCTTTGCTACAGTG 62331270
Query 61 TAATCTTGGACAGGTCATTCTACTTGGTCTCATCTGTAAAGAGAGGGGAGGGTCAGGGTA 120
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 62331269 TAATCTTGGACAGGTCATTCTACTTGGTCTCATCTGTAAAGAGAGGGGAGGGTCAGGGTA 62331210
...
Database: ../../hg19.fa
Posted date: Nov 21, 2013 4:03 PM
Number of letters in database: 3,095,693,983
Number of sequences in database: 25
Matrix: blastn matrix 1 -2
Gap Penalties: Existence: 0, Extension: 2.5
In addition to the species-specific sequence file, we should download the non-redundent gene and protein sequences from
ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz and
ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz respectively.
nt contains all nucleotide sequences of GenBank, RefSeq Nucleotides, EMBL, DDBJ and PDB, while nr contains all non-redundant peptide sequences from GenBank CDS translations, RefSeq Proteins, PDB, SwissProt, PIR and PRF. After generating the db files, we can blast a sequence to multiple species.
makeblastdb -in ../../nr/nr -out nr -dbtype prot
makeblastdb -in ../../nr/nt -out nt -dbtype nucl