科研小鼠的参考基因组
该计划于2013年完成,数据结果全部开放下载:
SNP and indel calls for Version 3 can be found here:
ftp://ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/
SNP and indel calls for 18 mouse genomes are provided as a single
compressed VCF file (bgzip), along with an index file generated by
'tabix' (*.tbi).
测序数据量
小鼠品系简称 | 小鼠品系详情 | 平均测序深度 |
---|---|---|
129P2 | (129P2/OlaHsd) | 42 |
129S1 | (129S1/SvImJ) | 55 |
129S5 | (129S5SvEvBrd) | 18 |
AJ | (A/J) | 38 |
AKR | (AKR/J) | 40 |
BALBcJ | (BALB/cJ) | 52 |
C3HHeJ | (C3H/HeJ) | 49 |
C57BL6NJ | (C57BL/6NJ) | 48 |
CASTEiJ | (CAST/EiJ) | 39 |
CBAJ | (CBA/J) | 43 |
DBA2J | (DBA/2J) | 42 |
FVBNJ | (FVB/NJ) | 61 |
LPJ | (LP/J) | 41 |
NODShiLtJ | (NOD/ShiLtJ) | 48 |
NZO | (NZO/HILtJ) | 58 |
PWKPhJ | (PWK/PhJ) | 39 |
Spretus | (SPRET/EiJ) | 53 |
WSBEiJ | (WSB/EiJ) | 38 |
参考基因组
All SNP and indel calls are relative to the reference mouse genome
C57BL/6J (GRCm38). A version of the reference genome can be
found here: ftp://ftp-mouse.sanger.ac.uk/ref/
dbSNP数据库注释
SNPs and indels are annotated with rs IDs from dbSNP Build 137. The
dbSNP data was downloaded from:
ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/mouse_10090/VCF/
and the 'vcf-annotate' Perl utility from the VCFtools package
(Danecek et al, 2011) was used to add the rsIDs to calls in this
release. (See below for VCFtools information.)
For SNPs, the position, reference allele and alternative alleles were all compared:
eg: vcf-annotate -c CHROM,POS,ID,REF,ALT
For indels, only positions were matched:
eg: vcf-annotate -c CHROM,POS,ID
找变异的流程
# Sequence Data
Sequencing was performed using the Illumina HiSeq platform. All
reads are 100bp paired-end reads, except for strains 129P2 and
129S4. All mice were female and therefore SNPs and indels were
called on chromosome 1-19 and X only. The BAM files used to call
SNPs and indels are located in this directory:
ftp://ftp-mouse.sanger.ac.uk/REL-1302-BAM-GRCm38/
# Methods in brief
Reads were aligned to the reference genome (GRCm38) using BWA
version 0.5.9-r16 (Li and Durbin, 2009). SNPs and indel discovery
was performed with the SAMtools mpileup function and calling
was performed with the BCFtools view function (Li H, 2011). The
vcf-annotate function in VCFtools package (Danecek et al, 2011)
was used to soft-filter the SNP and indel calls.
The Variant Effect Predictor software from Ensembl (McLaren et al.,
2010) was used to predict the functional consequences of SNP and
indels and queried against Ensembl release 70 mouse gene models.
Definitions of consequence types can be found here:
http://www.ensembl.org/info/docs/variation/predicted_data.html#consequences
Indel calling was performed on each strain independently. The
calls from all 18 strains were then merged into a single VCF
file. SNP calls were also made independently for each strain
initially. Then, a single list of all high confidence polymorphic
sites across the genome was produced from all 18 strains. This
list was then used to call SNPs again, this time across all 18
strains simultaneously, using the 'samtools mpileup -l' option.
This process generates both reference-only genotype calls as well
as calls with non-reference bases across the 18 strains.
Information regarding the filtering of SNP and indel calls in
located in the VCF file headers in the '##FILTER' and
'##source_xxxxxx=vcf-annotate' lines.
得到的标准vcf变异记录文件
因为参考小鼠基因组选择的是就是C57BL/6NJ,所以对该品系小鼠来说,变异位点应该是很少的。
不同品系小鼠统计
Strain | SNPs | ts/tv | Private SNPs | %Private | ts/tv (Private SNPs) | INDELs | Private INDELs | %Private |
---|---|---|---|---|---|---|---|---|
129P2/OlaHsd | 5333940 | 2.03 | 24247 | (0.45%) | 1.95 | 869453 | 35585 | (4.09%) |
129S1SvEvBrd | 5197051 | 2.03 | 1696 | (0.03%) | 1.67 | 1018654 | 30217 | (2.97%) |
129S5/SvImJ | 4929566 | 2.07 | 4134 | (0.08%) | 1.45 | 678932 | 9094 | (1.34%) |
A/J | 4893229 | 2.02 | 42833 | (0.88%) | 2.07 | 922256 | 28695 | (3.11%) |
AKR/J | 4896783 | 2.06 | 84307 | (1.72%) | 2.12 | 931552 | 39740 | (4.27%) |
BALB/cJ | 4578862 | 2.01 | 29733 | (0.65%) | 2.04 | 924897 | 34178 | (3.70%) |
C3H/HeJ | 5093947 | 2.02 | 15371 | (0.30%) | 1.89 | 1014687 | 31161 | (3.07%) |
C57BL/6NJ | 15946 | 0.98 | 1522 | (9.54%) | 1.7 | 20852 | 1646 | (7.89%) |
CAST/EiJ | 20626644 | 2.04 | 5785024 | (28.05%) | 2.1 | 3062289 | 1006241 | (32.86%) |
CBA/J | 5223690 | 2.02 | 34464 | (0.66%) | 2.02 | 1014449 | 34911 | (3.44%) |
DBA/2J | 5169730 | 2.02 | 73319 | (1.42%) | 2.13 | 981471 | 40955 | (4.17%) |
FVB/NJ | 4836968 | 2.03 | 133983 | (2.77%) | 2.12 | 968398 | 54942 | (5.67%) |
LP/J | 5440597 | 2.03 | 53756 | (0.99%) | 2.09 | 1024149 | 36083 | (3.52%) |
NOD/ShiLtJ | 5101268 | 2.04 | 124970 | (2.45%) | 2.1 | 970497 | 52166 | (5.38%) |
NZO/HlLtJ | 5335807 | 2.03 | 214884 | (4.03%) | 2.13 | 1046653 | 82356 | (7.87%) |
PWK/PhJ | 20268163 | 2.03 | 5016466 | (24.75%) | 2.1 | 3044259 | 909692 | (29.88%) |
SPRET/EiJ | 41742349 | 1.94 | 25792444 | (61.79%) | 1.94 | 5077387 | 3279813 | (64.60%) |
WSB/EiJ | 7079907 | 2.03 | 915416 | (12.93%) | 2.12 | 1414664 | 233808 | (16.53%) |
vcf文件的详解
# VCF specification and VCFtools
The VCF file format specification can be found here:
http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41
http://vcftools.sourceforge.net/
The VCFtools software package (Danecek et al, 2011) can be used to
query, compare, and annotate VCF files.
# Notes regarding the mgp.v3 VCF files
- Information regarding the filters applied to the calls is located
in the VCF file header lines at the beginning of the file, marked
with a hash '#' at the beginning of the line.
- Genotypes (GT):
- '.' = no genotype call was made
- '0/0' = genotype is the same as the reference geneome
- '1/1' = homozygous alternative allele; can also be '2/2',
'3/3', etc. if more than one alternative allele is present.
- '0/1' = heterozygous genotype; can also be '1/2', '0/2', etc.
- FITLER column and high and low confidence calls:
High and low confidence genotype calls are distinguished by
the 'FI' tag in the FORMAT column for each sample.
eg: in the sample columns NODShiLTJ and NZO:
1/1:99:31:0:255,74,0:1 0/0:.:1:0:0,.,.:0
which corresponds to the tags in the FORMAT column
GT:GQ:DP:SP:PL:FI
In the NODShiLTJ column the genotype is '1/1'
and 'FI' tag is '1' indicating the genotype call
passed all filters and is high-confidence. In NZA,
the genotype is the same as the reference genome,
however 'FI' is '0', meaning the call failed one
or more filters and the call is low-confidence.
NOTES:
All heterozygous calls have been marked as low confidence with the
'FI' tag set to '0'. 'Het' has also been added to the FILTER
column.
A site is annotated with PASS in the FILTER column only if ALL
strains with a genotype call (including 0/0 genotype calls) at
that site pass all filter criteria. If one or more calls does NOT
pass filtering, filters which the calls have failed are listed in
the FILTER column, and the 'FI' tags are set to '0' for the failed
sample calls. No-call sites, marked as '.', are not included.
eg: FORMAT is GT:GQ:DP:SP:PL:FI
(a) MinDP 1/1:7:3:0:22,0,4:0 . 1/1:99:45:255:74,0:1 .
(b) PASS 1/1:99:31:0:255,74,0:1 . . 1/1:99:45:0:255,50,0:1
In example (a), there are 2 no-calls ('.'), the first sample failed
the MinDP filter, and the third sample passed all filters. The
FILTER column is set to 'MinDP'. In example (b), there are also 2
no-calls, and the first and fourth samples passed all filters. The
FILTER column is set to PASS.
- Functional consequences
Ensembl now uses consequence terms defined by the Sequence Ontology
(SO) by default. All definitions of the predicted functional
consequences can be found here:
http://www.ensembl.org/info/docs/variation/predicted_data.html#consequences
In our release VCF files, predicted functional consequences are indicated by
the 'CSQ' field in the INFO tag. Where no 'CSQ' tag is present, the SNP
or indel is classified under the SO term 'intergenic_variant'.
- Multiple alternative allele and consequences
In cases where different strains have different alternative alleles
which have different consequences, they can be distinguished by
checking the 'Allele' in the 'CSQ' line.
eg: Alternative alleles = G,T and CSQ=ENSMUST00000047577:ENSMUSG00000042414:
missense_variant:601:201:A>P:Grantham,27:Allele,G:Gene,Prdm14+ENSMUST00000047577:
ENSMUSG00000042414:missense_variant:601:201:A>T:Grantham,58:Allele,T:Gene,Prdm14
The strain with GT='1/1' is G/G and has a A>P amino acid
substitution, and the strain with GT='2/2' is T/T has a A>T amino
acid substitution.
初学者必须花13个小时仔细研读该数据库