查看原文
其他

科研小鼠的参考基因组

jimmy 生信技能树 2022-08-10

该计划于2013年完成,数据结果全部开放下载:

  1. SNP and indel calls for Version 3 can be found here:

  2. ftp://ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/

  3. SNP and indel calls for 18 mouse genomes are provided as a single

  4. compressed VCF file (bgzip), along with an index file generated by

  5. 'tabix' (*.tbi).

测序数据量

小鼠品系简称小鼠品系详情平均测序深度
129P2(129P2/OlaHsd)42
129S1(129S1/SvImJ)55
129S5(129S5SvEvBrd)18
AJ(A/J)38
AKR(AKR/J)40
BALBcJ(BALB/cJ)52
C3HHeJ(C3H/HeJ)49
C57BL6NJ(C57BL/6NJ)48
CASTEiJ(CAST/EiJ)39
CBAJ(CBA/J)43
DBA2J(DBA/2J)42
FVBNJ(FVB/NJ)61
LPJ(LP/J)41
NODShiLtJ(NOD/ShiLtJ)48
NZO(NZO/HILtJ)58
PWKPhJ(PWK/PhJ)39
Spretus(SPRET/EiJ)53
WSBEiJ(WSB/EiJ)38

参考基因组

  1. All SNP and indel calls are relative to the reference mouse genome

  2. C57BL/6J (GRCm38).  A version of the reference genome can be

  3. found here: ftp://ftp-mouse.sanger.ac.uk/ref/

dbSNP数据库注释

  1. SNPs and indels are annotated with rs IDs from dbSNP Build 137. The

  2. dbSNP data was downloaded from:

  3. ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/mouse_10090/VCF/

  4. and the 'vcf-annotate' Perl utility from the VCFtools package

  5. (Danecek et al, 2011) was used to add the rsIDs to calls in this

  6. release. (See below for VCFtools information.)

  7. For SNPs, the position, reference allele and alternative alleles were all compared:

  8. eg: vcf-annotate -c CHROM,POS,ID,REF,ALT

  9. For indels, only positions were matched:

  10. eg: vcf-annotate -c CHROM,POS,ID

找变异的流程

  1. # Sequence Data

  2. Sequencing was performed using the Illumina HiSeq platform. All

  3. reads are 100bp paired-end reads, except for strains 129P2 and

  4. 129S4.  All mice were female and therefore SNPs and indels were

  5. called on chromosome 1-19 and X only. The BAM files used to call

  6. SNPs and indels are located in this directory:

  7. ftp://ftp-mouse.sanger.ac.uk/REL-1302-BAM-GRCm38/

  8. # Methods in brief

  9. Reads were aligned to the reference genome (GRCm38) using BWA

  10. version 0.5.9-r16 (Li and Durbin, 2009). SNPs and indel discovery

  11. was performed with the SAMtools mpileup function and calling

  12. was performed with the BCFtools view function (Li H, 2011). The

  13. vcf-annotate function in VCFtools package (Danecek et al, 2011)

  14. was used to soft-filter the SNP and indel calls.

  15. The Variant Effect Predictor software from Ensembl (McLaren et al.,

  16. 2010) was used to predict the functional consequences of SNP and

  17. indels and queried against Ensembl release 70 mouse gene models.

  18. Definitions of consequence types can be found here:

  19. http://www.ensembl.org/info/docs/variation/predicted_data.html#consequences

  20. Indel calling was performed on each strain independently. The

  21. calls from all 18 strains were then merged into a single VCF

  22. file. SNP calls were also made independently for each strain

  23. initially. Then, a single list of all high confidence polymorphic

  24. sites across the genome was produced from all 18 strains. This

  25. list was then used to call SNPs again, this time across all 18

  26. strains simultaneously, using the 'samtools mpileup -l' option.

  27. This process generates both reference-only genotype calls as well

  28. as calls with non-reference bases across the 18 strains.

  29. Information regarding the filtering of SNP and indel calls in

  30. located in the VCF file headers in the '##FILTER' and

  31. '##source_xxxxxx=vcf-annotate' lines.

得到的标准vcf变异记录文件

因为参考小鼠基因组选择的是就是C57BL/6NJ,所以对该品系小鼠来说,变异位点应该是很少的。

不同品系小鼠统计

StrainSNPsts/tvPrivate SNPs%Privatets/tv (Private SNPs)INDELsPrivate INDELs%Private
129P2/OlaHsd53339402.0324247(0.45%)1.9586945335585(4.09%)
129S1SvEvBrd51970512.031696(0.03%)1.67101865430217(2.97%)
129S5/SvImJ49295662.074134(0.08%)1.456789329094(1.34%)
A/J48932292.0242833(0.88%)2.0792225628695(3.11%)
AKR/J48967832.0684307(1.72%)2.1293155239740(4.27%)
BALB/cJ45788622.0129733(0.65%)2.0492489734178(3.70%)
C3H/HeJ50939472.0215371(0.30%)1.89101468731161(3.07%)
C57BL/6NJ159460.981522(9.54%)1.7208521646(7.89%)
CAST/EiJ206266442.045785024(28.05%)2.130622891006241(32.86%)
CBA/J52236902.0234464(0.66%)2.02101444934911(3.44%)
DBA/2J51697302.0273319(1.42%)2.1398147140955(4.17%)
FVB/NJ48369682.03133983(2.77%)2.1296839854942(5.67%)
LP/J54405972.0353756(0.99%)2.09102414936083(3.52%)
NOD/ShiLtJ51012682.04124970(2.45%)2.197049752166(5.38%)
NZO/HlLtJ53358072.03214884(4.03%)2.13104665382356(7.87%)
PWK/PhJ202681632.035016466(24.75%)2.13044259909692(29.88%)
SPRET/EiJ417423491.9425792444(61.79%)1.9450773873279813(64.60%)
WSB/EiJ70799072.03915416(12.93%)2.121414664233808(16.53%)

vcf文件的详解

  1. # VCF specification and VCFtools

  2. The VCF file format specification can be found here:

  3.    http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

  4.    http://vcftools.sourceforge.net/

  5. The VCFtools software package (Danecek et al, 2011) can be used to

  6. query, compare, and annotate VCF files.

  7. # Notes regarding the mgp.v3 VCF files

  8. - Information regarding the filters applied to the calls is located

  9.   in the VCF file header lines at the beginning of the file, marked

  10.   with a hash '#' at the beginning of the line.

  11. - Genotypes (GT):

  12.      - '.'   = no genotype call was made

  13.      - '0/0' = genotype is the same as the reference geneome

  14.      - '1/1' = homozygous alternative allele; can also be '2/2',

  15.        '3/3', etc. if more than one alternative allele is present.

  16.      - '0/1' = heterozygous genotype; can also be '1/2', '0/2', etc.

  17. - FITLER column and high and low confidence calls:

  18. High and low confidence genotype calls are distinguished by

  19. the 'FI' tag in the FORMAT column for each sample.  

  20.    eg: in the sample columns NODShiLTJ and NZO:

  21.        1/1:99:31:0:255,74,0:1    0/0:.:1:0:0,.,.:0

  22.        which corresponds to the tags in the FORMAT column

  23.        GT:GQ:DP:SP:PL:FI

  24.        In the NODShiLTJ column the genotype is '1/1'

  25.        and 'FI' tag is '1' indicating the genotype call

  26.        passed all filters and is high-confidence. In NZA,

  27.        the genotype is the same as the reference genome,

  28.        however 'FI' is '0', meaning the call failed one

  29.        or more filters and the call is low-confidence.

  30. NOTES:

  31. All heterozygous calls have been marked as low confidence with the

  32. 'FI' tag set to '0'. 'Het' has also been added to the FILTER

  33. column.

  34. A site is annotated with PASS in the FILTER column only if ALL

  35. strains with a genotype call (including 0/0 genotype calls) at

  36. that site pass all filter criteria. If one or more calls does NOT

  37. pass filtering, filters which the calls have failed are listed in

  38. the FILTER column, and the 'FI' tags are set to '0' for the failed

  39. sample calls. No-call sites, marked as '.', are not included.

  40.    eg: FORMAT is GT:GQ:DP:SP:PL:FI

  41.        (a) MinDP   1/1:7:3:0:22,0,4:0    .   1/1:99:45:255:74,0:1 .

  42.        (b) PASS    1/1:99:31:0:255,74,0:1    .   .   1/1:99:45:0:255,50,0:1

  43. In example (a), there are 2 no-calls ('.'), the first sample failed

  44. the MinDP filter, and the third sample passed all filters. The

  45. FILTER column is set to 'MinDP'. In example (b), there are also 2

  46. no-calls, and the first and fourth samples passed all filters. The

  47. FILTER column is set to PASS.

  48. - Functional consequences

  49. Ensembl now uses consequence terms defined by the Sequence Ontology

  50. (SO) by default. All definitions of the predicted functional

  51. consequences can be found here:

  52. http://www.ensembl.org/info/docs/variation/predicted_data.html#consequences

  53. In our release VCF files, predicted functional consequences are indicated by

  54. the 'CSQ' field in the INFO tag. Where no 'CSQ' tag is present, the SNP

  55. or indel is classified under the SO term 'intergenic_variant'.

  56. - Multiple alternative allele and consequences

  57. In cases where different strains have different alternative alleles

  58. which have different consequences, they can be distinguished by

  59. checking the 'Allele' in the 'CSQ' line.

  60.    eg: Alternative alleles = G,T and CSQ=ENSMUST00000047577:ENSMUSG00000042414:

  61.    missense_variant:601:201:A>P:Grantham,27:Allele,G:Gene,Prdm14+ENSMUST00000047577:

  62.    ENSMUSG00000042414:missense_variant:601:201:A>T:Grantham,58:Allele,T:Gene,Prdm14

  63. The strain with GT='1/1' is G/G and has a A>P amino acid

  64. substitution, and the strain with GT='2/2' is T/T has a A>T amino

  65. acid substitution.


初学者必须花13个小时仔细研读该数据库

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存