查看原文
其他

肿瘤全外显子测序数据分析流程大放送

曾健明 生信技能树 2022-06-07

这个一个肿瘤外显子项目的文章发表并且公布的公共数据,我这里给出全套分析流程代码。只需要你肯实践,就可以运行成功。

PS:有些后起之秀自己运营公众号或者博客喜欢批评我们这些老人,一味的堆砌代码不给解释,恶意揣测我们是因为不懂代码的原理。我表示很无语,我写了3000多篇教程,要求我一篇篇都重复提到基础知识,我真的做不到。比如下面的流程,包括软件的用法,软件安装,注释数据库的下载,我博客都说过好几次了,直播我的基因组系列也详细解读过,我告诉你去哪里学,你却不珍惜,不当回事,呵呵。

文章解读

paper;https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4812767/ a whole-exome sequencing (WES) study was performed in 161 NPC cases and 895 controls of Southern Chinese descent

测序概述:

We sequenced the blood samples from 161 NPC cases, including 39 EAO cases, 63 FH+ cases from 52 independent families, and 59 sporadic cases by WES and achieved an average coverage of 49-fold on target (range of 32- to 76-fold)

后来还扩大了验证样本集:

An additional 2,160 NPC cases and 2,433 healthy controls from Hong Kong were further examined for the selected candidate variants.

数据全部上传了:

Whole-exome sequencing data for the early-age onset cases have been deposited in the Sequence Read Archive (SRA) database (accession ID SRA291701).

了解WES数据分析步骤: 2016-A Survey of Computational Tools to Analyze and Interpret Whole Exome Sequencing Data

选择部分数据如下:

随便选择了5个样本,其ID号及描述如下,

  1. SRX445405    MALE    NPC15   SRR1139956  NPC15F  NO  SRS540548   NPC15F-T

  2. SRX445406    MALE    NPC15   SRR1139958  NPC15F  NO  SRS540549   NPC15F-N

  3. SRX445407    MALE    NPC29   SRR1139966  NPC29F  YES SRS540550   NPC29F-T

  4. SRX445408    MALE    NPC29   SRR1139973  NPC29F  YES SRS540551   NPC29F-N

  5. SRX445409    FEMALE  NPC10   SRR1139999  NPC10F  NO  SRS540552   NPC10F-T

  6. SRX445410    FEMALE  NPC10   SRR1140007  NPC10F  NO  SRS540553   NPC10F-N

  7. SRX445411    FEMALE  NPC34   SRR1140015  NPC34F  NO  SRS540554   NPC34F-T

  8. SRX445412    FEMALE  NPC34   SRR1140023  NPC34F  NO  SRS540555   NPC34F-N

  9. SRX445413    MALE    NPC37   SRR1140044  NPC37F  YES SRS540556   NPC37F-T

  10. SRX445414    MALE    NPC37   SRR1140045  NPC37F  YES SRS540557   NPC37F-N

从SRA数据库下载并转换为fastq测序数据文件

把上面的描述文本存为文件npc.sra.txt下载脚本如下:

  1. cat npc.sra.txt | cut -f 4|while read id

  2. do echo $id

  3. wget -c ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP035/SRP035573/$id/$id.sra

  4. done

转换脚本如下:

  1. cat npc.sra.txt | while read id

  2. do

  3. array=($id)

  4. echo  ${array[3]}.sra  ${array[7]}

  5. ~/biosoft/sratoolkit/sratoolkit.2.8.2-1-centos_linux64/bin/fastq-dump --gzip --split-3 -A \  

  6. ${array[7]}  ${array[3]}.sra

  7. done

得到的fastq文件如下:

  1. 3.5G Aug 26 09:48 NPC10F-N_1.fastq.gz

  2. 3.6G Aug 26 09:48 NPC10F-N_2.fastq.gz

  3. 3.2G Aug 26 09:48 NPC10F-T_1.fastq.gz

  4. 3.3G Aug 26 09:48 NPC10F-T_2.fastq.gz

  5. 2.7G Aug 26 09:47 NPC15F-N_1.fastq.gz

  6. 2.8G Aug 26 09:47 NPC15F-N_2.fastq.gz

  7. 2.8G Aug 26 09:47 NPC15F-T_1.fastq.gz

  8. 2.9G Aug 26 09:47 NPC15F-T_2.fastq.gz

  9. 2.8G Aug 26 09:47 NPC29F-N_1.fastq.gz

  10. 2.9G Aug 26 09:47 NPC29F-N_2.fastq.gz

  11. 2.4G Aug 26 09:47 NPC29F-T_1.fastq.gz

  12. 2.5G Aug 26 09:47 NPC29F-T_2.fastq.gz

  13. 2.0G Aug 26 09:47 NPC34F-N_1.fastq.gz

  14. 2.0G Aug 26 09:47 NPC34F-N_2.fastq.gz

  15. 2.2G Aug 26 09:48 NPC34F-T_1.fastq.gz

  16. 2.3G Aug 26 09:48 NPC34F-T_2.fastq.gz

  17. 2.1G Aug 26 09:47 NPC37F-N_1.fastq.gz

  18. 2.1G Aug 26 09:47 NPC37F-N_2.fastq.gz

  19. 2.2G Aug 26 09:46 NPC37F-T_1.fastq.gz

  20. 2.2G Aug 26 09:46 NPC37F-T_2.fastq.gz

简单的走一下fastqc+multiqc 看看数据质量,一般都会很不错的,这个数据也不例外。

  1. ls *.gz |xargs ~/biosoft/fastqc/FastQC/fastqc -o ./ -t 5

但是值得注意的是本文章中的测序reads的编码格式是phred+64 而不是传统的33

然后走WES的标准SNP-calling流程

选用的是经典的GATK best practice的流程,代码如下:

  1. module load java/1.8.0_91

  2. GENOME=/home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta

  3. INDEX=/home/jianmingzeng/reference/index/bwa/human_g1k_v37

  4. GATK=/home/jianmingzeng/biosoft/GATK/GenomeAnalysisTK.jar

  5. PICARD=/home/jianmingzeng/biosoft/picardtools/2.9.2/picard.jar

  6. DBSNP=/home/jianmingzeng/annotation/variation/human/dbSNP/All_20160601.vcf.gz

  7. SNP=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.snps.high_confidence.b37.vcf.gz

  8. INDEL=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf.gz

  9. KG_SNP=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.snps.high_confidence.b37.vcf.gz

  10. Mills_indels=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf

  11. KG_indels=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.indels.b37.vcf

  12. TMPDIR=/home/jianmingzeng/tmp/software

  13. ## samtools and bwa are in the environment

  14. ## samtools Version: 1.3.1 (using htslib 1.3.1)

  15. ## bwa Version: 0.7.15-r1140

  16. #arr=($1)

  17. #fq1=${arr[0]}

  18. #fq2=${arr[1]}

  19. #sample=${arr[2]}

  20. fq1=$1

  21. fq2=$2

  22. sample=$3

  23. #####################################################

  24. ################ Step 1 : Alignment #################

  25. #####################################################

  26. start=$(date +%s.%N)

  27. echo bwa `date`

  28. bwa mem -t 5 -M  -R "@RG\tID:$sample\tSM:$sample\tLB:WES\tPL:Illumina" $INDEX $fq1 $fq2 1>$sample.sam 2>/dev/null

  29. echo bwa `date`

  30. dur=$(echo "$(date +%s.%N) - $start" | bc)

  31. printf "Execution time for BWA : %.6f seconds" $dur

  32. echo

  33. #####################################################

  34. ################ Step 2: Sort and Index #############

  35. #####################################################

  36. start=$(date +%s.%N)

  37. echo SortSam `date`

  38. java -Djava.io.tmpdir=$TMPDIR    -Xmx40g -jar $PICARD SortSam SORT_ORDER=coordinate INPUT=$sample.sam OUTPUT=$sample.bam

  39. samtools index $sample.bam

  40. echo SortSam `date`

  41. dur=$(echo "$(date +%s.%N) - $start" | bc)

  42. printf "Execution time for SortSam : %.6f seconds" $dur

  43. echo

  44. rm $sample.sam

  45. #####################################################

  46. ################ Step 3: Basic Statistics ###########

  47. #####################################################

  48. start=$(date +%s.%N)

  49. echo stats `date`

  50. samtools flagstat $sample.bam > ${sample}.alignment.flagstat

  51. samtools stats  $sample.bam > ${sample}.alignment.stat

  52. echo plot-bamstats -p ${sample}_QC  ${sample}.alignment.stat

  53. echo stats `date`

  54. dur=$(echo "$(date +%s.%N) - $start" | bc)

  55. printf "Execution time for Basic Statistics : %.6f seconds" $dur

  56. echo

  57. #####################################################

  58. ####### Step 4: multiple filtering for bam files ####

  59. #####################################################

  60. ###MarkDuplicates###

  61. start=$(date +%s.%N)

  62. echo MarkDuplicates `date`

  63. java -Djava.io.tmpdir=$TMPDIR    -Xmx40g -jar $PICARD MarkDuplicates \

  64. INPUT=$sample.bam OUTPUT=${sample}_marked.bam METRICS_FILE=$sample.metrics

  65. echo MarkDuplicates `date`

  66. dur=$(echo "$(date +%s.%N) - $start" | bc)

  67. printf "Execution time for MarkDuplicates : %.6f seconds" $dur

  68. echo

  69. rm $sample.bam  

  70. ###FixMateInfo###

  71. start=$(date +%s.%N)

  72. echo FixMateInfo `date`

  73. java -Djava.io.tmpdir=$TMPDIR    -Xmx40g -jar $PICARD FixMateInformation \

  74. INPUT=${sample}_marked.bam OUTPUT=${sample}_marked_fixed.bam SO=coordinate

  75. samtools index ${sample}_marked_fixed.bam

  76. echo FixMateInfo `date`

  77. dur=$(echo "$(date +%s.%N) - $start" | bc)

  78. printf "Execution time for FixMateInfo  : %.6f seconds" $dur

  79. echo

  80. rm ${sample}_marked.bam

  81. #####################################################

  82. ####### Step 5: gatk process bam files ##############

  83. #####################################################

  84. ### SplitNCigar ###

  85. start=$(date +%s.%N)

  86. echo SplitNCigar `date`

  87. java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T SplitNCigarReads \

  88. -R $GENOME  -I ${sample}_marked_fixed.bam  -o ${sample}_marked_fixed_split.bam \

  89. -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

  90. #--fix_misencoded_quality_scores

  91. ## --fix_misencoded_quality_scores only if phred 64

  92. echo SplitNCigar `date`

  93. dur=$(echo "$(date +%s.%N) - $start" | bc)

  94. printf "Execution time for SplitNCigar : %.6f seconds" $dur

  95. echo

  96. rm ${sample}_marked_fixed.bam

  97. # rm ${sample}.sam ${sample}.bam ${sample}_marked.bam ${sample}_marked_fixed.bam

  98. ###RealignerTargetCreator###

  99. start=$(date +%s.%N)

  100. echo RealignerTargetCreator `date`

  101. java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T RealignerTargetCreator \

  102. -I ${sample}_marked_fixed_split.bam -R $GENOME -o ${sample}_target.intervals \

  103. -known $Mills_indels -known $KG_indels -nt 5

  104. echo RealignerTargetCreator `date`

  105. dur=$(echo "$(date +%s.%N) - $start" | bc)

  106. printf "Execution time for RealignerTargetCreator : %.6f seconds" $dur

  107. echo

  108. ###IndelRealigner###

  109. start=$(date +%s.%N)

  110. echo IndelRealigner `date`

  111. java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T IndelRealigner \

  112. -I ${sample}_marked_fixed_split.bam  -R $GENOME -targetIntervals ${sample}_target.intervals \

  113. -o ${sample}_realigned.bam -known $Mills_indels -known $KG_indels

  114. echo IndelRealigner `date`

  115. dur=$(echo "$(date +%s.%N) - $start" | bc)

  116. printf "Execution time for IndelRealigner : %.6f seconds" $dur

  117. echo

  118. rm ${sample}_marked_fixed_split.bam

  119. ###BaseRecalibrator###

  120. start=$(date +%s.%N)

  121. echo BaseRecalibrator `date`

  122. java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T BaseRecalibrator \

  123. -I ${sample}_realigned.bam -R $GENOME -o ${sample}_temp.table -knownSites $DBSNP

  124. echo BaseRecalibrator `date`

  125. dur=$(echo "$(date +%s.%N) - $start" | bc)

  126. printf "Execution time for BaseRecalibrator : %.6f seconds" $dur

  127. echo

  128. ###PrintReads###

  129. start=$(date +%s.%N)

  130. echo PrintReads `date`

  131. java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T PrintReads \

  132. -R $GENOME -I ${sample}_realigned.bam -o ${sample}_recal.bam -BQSR ${sample}_temp.table

  133. samtools index ${sample}_recal.bam

  134. echo PrintReads `date`

  135. dur=$(echo "$(date +%s.%N) - $start" | bc)

  136. printf "Execution time for PrintReads : %.6f seconds" $dur

  137. echo

  138. rm  ${sample}_realigned.bam

  139. chmod uga=r   ${sample}_recal.bam

  140. #####################################################

  141. ############## Step 6: gatk call snp/indel##########

  142. #####################################################

  143. ###

  144. start=$(date +%s.%N)

  145. echo HaplotypeCaller `date`

  146. java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T HaplotypeCaller  \

  147. -R $GENOME -I ${sample}_recal.bam --dbsnp $DBSNP  \

  148. -stand_emit_conf 10 -o  ${sample}_raw.snps.indels.vcf

  149. echo HaplotypeCaller `date`

  150. dur=$(echo "$(date +%s.%N) - $start" | bc)

  151. printf "Execution time for HaplotypeCaller : %.6f seconds" $dur

  152. echo

  153. java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T SelectVariants  -R $GENOME  \

  154. -selectType SNP \

  155. -V ${sample}_raw.snps.indels.vcf -o ${sample}_raw_snps.vcf

  156. java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T SelectVariants  -R $GENOME \

  157. -selectType INDEL  \

  158. -V ${sample}_raw.snps.indels.vcf   -o ${sample}_raw_indels.vcf

  159. ##

  160. :'

  161. '

  162. ## for SNP

  163. java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T VariantFiltration -R $GENOME  \

  164. --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"  \

  165. --filterName "my_snp_filter" \

  166. -V ${sample}_raw_snps.vcf  -o ${sample}_filtered_snps.vcf  

  167. java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T SelectVariants -R $GENOME  \

  168. --excludeFiltered \

  169. -V ${sample}_filtered_snps.vcf  -o  ${sample}_filtered_PASS.snps.vcf

  170. java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T VariantEval -R $GENOME  \

  171. -eval ${sample}_filtered_PASS.snps.vcf -o  ${sample}_filtered_PASS.snps.vcf.eval

  172. ## for  INDEL

  173. java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T VariantFiltration -R $GENOME  \

  174. --filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0"  \

  175. --filterName "my_indel_filter" \

  176. -V ${sample}_raw_indels.vcf  -o ${sample}_filtered_indels.vcf  

  177. java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T SelectVariants -R $GENOME  \

  178. --excludeFiltered \

  179. -V ${sample}_filtered_indels.vcf  -o  ${sample}_filtered_PASS.indels.vcf

  180. java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T VariantEval -R $GENOME  \

  181. -eval ${sample}_filtered_PASS.indels.vcf -o  ${sample}_filtered_PASS.indels.vcf.eval

比对成功后得到的bam文件如下;

  1. 7.7G Aug 26 19:22 NPC10F-N_marked_fixed.bam

  2. 7.7G Aug 26 22:59 NPC10F-N_marked_fixed_split.bam

  3. 7.7G Aug 26 23:57 NPC10F-N_realigned.bam

  4. 15G Aug 27 03:45 NPC10F-N_recal.bam

  5. 7.0G Aug 26 19:49 NPC10F-T_marked_fixed.bam

  6. 7.0G Aug 26 22:55 NPC10F-T_marked_fixed_split.bam

  7. 7.0G Aug 26 23:48 NPC10F-T_realigned.bam

  8. 13G Aug 27 03:12 NPC10F-T_recal.bam

Snp-calling结束后得到的vcf如下:

  1.    82576 NPC10F-N_raw_indels.vcf

  2.   863243 NPC10F-N_raw_snps.vcf

  3.   945753 NPC10F-N_recal_raw.snps.indels.vcf

  4.    89604 NPC10F-T_raw_indels.vcf

  5.   819657 NPC10F-T_raw_snps.vcf

  6.   909190 NPC10F-T_recal_raw.snps.indels.vcf

这些vcf里面的变异位点还需要进行简单的过滤,或者只提取外显子区域的变异位点。(从WGS测序得到的VCF文件里面提取位于外显子区域的【直播】我的基因组84)

消耗时间如下;

  1. # for NPC10F-N_1.fastq.gz NPC10F-N_2.fastq.gz NPC10F-N

  2. bwa Sat Aug 26 16:04:44 CST 2017

  3. bwa Sat Aug 26 17:07:11 CST 2017

  4. SortSam Sat Aug 26 17:07:11 CST 2017

  5. SortSam Sat Aug 26 17:45:04 CST 2017

  6. stats Sat Aug 26 17:45:04 CST 2017

  7. plot-bamstats -p NPC10F-N_QC NPC10F-N.alignment.stat

  8. stats Sat Aug 26 17:56:07 CST 2017

  9. MarkDuplicates Sat Aug 26 17:56:07 CST 2017

  10. MarkDuplicates Sat Aug 26 18:40:25 CST 2017

  11. FixMateInfo Sat Aug 26 18:40:25 CST 2017

  12. FixMateInfo Sat Aug 26 19:26:39 CST 2017

  13. SplitNCigar Sat Aug 26 22:20:48 CST 2017

  14. SplitNCigar Sat Aug 26 22:59:32 CST 2017

  15. RealignerTargetCreator Sat Aug 26 22:59:32 CST 2017

  16. RealignerTargetCreator Sat Aug 26 23:17:35 CST 2017

  17. IndelRealigner Sat Aug 26 23:17:35 CST 2017

  18. IndelRealigner Sat Aug 26 23:57:35 CST 2017

  19. BaseRecalibrator Sat Aug 26 23:57:35 CST 2017

  20. BaseRecalibrator Sun Aug 27 01:27:39 CST 2017

  21. PrintReads Sun Aug 27 01:27:39 CST 2017

  22. PrintReads Sun Aug 27 03:52:03 CST 2017

  23. #for NPC10F-T_1.fastq.gz NPC10F-T_2.fastq.gz NPC10F-T

  24. bwa Sat Aug 26 16:54:14 CST 2017

  25. bwa Sat Aug 26 17:48:07 CST 2017

  26. SortSam Sat Aug 26 17:48:07 CST 2017

  27. SortSam Sat Aug 26 18:20:48 CST 2017

  28. stats Sat Aug 26 18:20:48 CST 2017

  29. plot-bamstats -p NPC10F-T_QC NPC10F-T.alignment.stat

  30. stats Sat Aug 26 18:30:45 CST 2017

  31. MarkDuplicates Sat Aug 26 18:30:45 CST 2017

  32. MarkDuplicates Sat Aug 26 19:11:40 CST 2017

  33. FixMateInfo Sat Aug 26 19:11:40 CST 2017

  34. FixMateInfo Sat Aug 26 19:52:37 CST 2017

  35. SplitNCigar Sat Aug 26 22:20:54 CST 2017

  36. SplitNCigar Sat Aug 26 22:55:53 CST 2017

  37. RealignerTargetCreator Sat Aug 26 22:55:53 CST 2017

  38. RealignerTargetCreator Sat Aug 26 23:14:19 CST 2017

  39. IndelRealigner Sat Aug 26 23:14:19 CST 2017

  40. IndelRealigner Sat Aug 26 23:48:43 CST 2017

  41. BaseRecalibrator Sat Aug 26 23:48:43 CST 2017

  42. BaseRecalibrator Sun Aug 27 01:10:26 CST 2017

  43. PrintReads Sun Aug 27 01:10:26 CST 2017

  44. PrintReads Sun Aug 27 03:18:33 CST 2017

外显子的QC结果(这个是我自己写的脚本)是:

  1. ## for NPC10F-N

  2. 32541594    2392028455  0.982188933530586   73.5068004044301

  3. 18711840    934414537   0.971564415370185   49.9370739061471

  4. 17075251    505674931   0.895198983425405   29.6144947444696

  5. 15656543    241509710   0.819070615186704   15.4254812189383

  6. ## for NPC10F-T

  7. 32348763    2946257280  0.97636879840624    91.0778962398037

  8. 18182204    1054428864  0.94406442121146    57.9923569221861

  9. 16075260    555403212   0.842772759844003   34.5501853158207

  10. 14181528    265599059   0.741905340358179   18.7285219900141

可以看到T的测序深度高于N,符合实验设计。T的外显子区域平均测序深度高达91X,是比较好的数据了,而且外显子旁边50bp范围区域平均测序深度还有57X,旁边100bp区域还有34X,也说明这款芯片的捕获效率比较好

然后走走somatic mutation calling流程

因为是配对数据,还可以走somatic mutation calling流程

  1. reference=/home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta

  2. GENOME=/home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta

  3. TMPDIR=/home/jianmingzeng/tmp/software

  4. normal_bam=NPC10F-N_recal.bam

  5. tumor_bam=NPC10F-T_recal.bam

  6. sample=NPC10F

  7. #####################################################

  8. ################### Step : Run VarScan #############

  9. #####################################################

  10. normal_pileup="samtools mpileup -q 1 -f $reference $normal_bam";

  11. tumor_pileup="samtools mpileup -q 1 -f $reference $tumor_bam";

  12. # Next, issue a system call that pipes input from these commands into VarScan :

  13. java -Djava.io.tmpdir=$TMPDIR   -Xmx40g  -jar ~/biosoft/VarScan/VarScan.v2.3.9.jar \

  14. somatic <($normal_pileup) <($tumor_pileup) $sample

  15. java -jar ~/biosoft/VarScan/VarScan.v2.3.9.jar processSomatic $sample.snp

  16. #####################################################

  17. ################### Step : Run Mutect2 #############

  18. #####################################################

  19. java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK  -T MuTect2 \

  20. -R $GENOME -I:tumor $tumor_bam  -I:normal $normal_bam \

  21. --dbsnp  $DBSNP   -o ${sample}-mutect2.vcf

  22. #####################################################

  23. ################### Step : Run Muse#################

  24. #####################################################

  25. ~/biosoft/muse/muse call -O $sample -f $reference $tumor_bam $normal_bam

  26. ~/biosoft/muse/muse sump -I ${sample}.MuSE.txt -E O ${sample}.vcf D $DBSNP

其中varscan耗费两个半小时,结果如下:

  1. 893539210 positions in tumor

  2. 891822458 positions shared in normal

  3. 79827518 had sufficient coverage for comparison

  4. 79718238 were called Reference

  5. 26 were mixed SNP-indel calls and filtered

  6. 102492 were called Germline

  7. 3703 were called LOH

  8. 2569 were called Somatic

  9. 490 were called Unknown

  10. 0 were called Variant

  11. Reading input from NPC10F.snp

  12. Opening output files: NPC10F.snp.Somatic NPC10F.snp.Germline NPC10F.snp.LOH

  13. 101352 VarScan calls processed

  14. 2342 were Somatic (556 high confidence)

  15. 95144 were Germline (4830 high confidence)

  16. 3502 were LOH (3484 high confidence)

这3个软件找到的somatic mutation可以互相对比一下,差异主要是在哪里,都值得自己去探究,这样最终确定的肿瘤外显子测序数据分析流程就更有把握。

需要安装的软件

软件太多了,我就不一一列出具体代码了,还有很多需要下载的参考基因组,变异数据库也是以前直播基因组的时候已经反复提到过,也不赘述啦。

  1. conda install -c bioconda bedtools

  2. conda install -c bioconda bwa

  3. conda install -c bioconda samtools

  4. cd ~/biosoft

  5. mkdir sratoolkit &&  cd sratoolkit

  6. wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-centos_linux64.tar.gz

  7. tar zxvf sratoolkit.current-centos_linux64.tar.gz

  8. ~/biosoft/sratoolkit/sratoolkit.2.8.2-1-centos_linux64/bin/fastdump -h

  9. ## https://sourceforge.net/projects/picard/

  10. ## https://github.com/broadinstitute/picard

  11. cd ~/biosoft

  12. mkdir picardtools &&  cd picardtools

  13. wget http://ncu.dl.sourceforge.net/project/picard/picard-tools/1.119/picard-tools-1.119.zip

  14. unzip picard-tools-1.119.zip

  15. mkdir 2.9.2 && cd 2.9.2

  16. wget https://github.com/broadinstitute/picard/releases/download/2.9.2/picard.jar

  17. cd ~/biosoft

  18. ## https://sourceforge.net/projects/varscan/files/

  19. ## http://varscan.sourceforge.net/index.html

  20. mkdir VarScan  &&  cd VarScan  

  21. wget https://sourceforge.net/projects/varscan/files/VarScan.v2.3.9.jar

  22. cd ~/biosoft

  23. mkdir SnpEff &&  cd SnpEff

  24. ##    http://snpeff.sourceforge.net/

  25. ##    http://snpeff.sourceforge.net/SnpSift.html

  26. ##    http://snpeff.sourceforge.net/SnpEff_manual.html

  27. wget http://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip

  28. ## java -jar snpEff.jar download GRCh37.75

  29. ## java -Xmx4G -jar snpEff.jar -i vcf -o vcf GRCh37.75 example.vcf > example_snpeff.vcf

  30. unzip snpEff_latest_core.zip


老规矩,点击下面的阅读原文,链接都是可以点击的。

另外:不要点击广告,不要点击广告,不要点击广告


猜你喜欢

基因组 游记 | 工作资讯 

学习课程 | 好书分享 


菜鸟入门

Linux | Perl | R语言 | 可视化 

R包 | perl模块 | python模块


数据分析

ChIP-seq(上)ChIP-seq(下)RNA-seq | miRNA

WGS,WES,RNA-seq组与ChIP-seq之间的异同


编程实践

第0题 | 探索人类基因组序列 | 最后一题


直播基因组分析

我的基因组 | 解惑帖

一个标准的基因检测报告目录

生信技能树



您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存