查看原文
其他

TCGA-4.使用RTCGA包获取数据

豆豆花花 生信星球 2022-06-06

 今天是生信星球陪你的第512天


   大神一句话,菜鸟跑半年。我不是大神,但我可以缩短你走弯路的半年~

   就像歌儿唱的那样,如果你不知道该往哪儿走,就留在这学点生信好不好~

   这里有豆豆和花花的学习历程,从新手到进阶,生信路上有你有我!

花花写于2020-01-03 

本系列是我的TCGA学习记录,跟着生信技能树B站课程学的,已获得授权(嗯,真的^_^)。课程链接:https://www.bilibili.com/video/av49363776

目录:

TCGA-1.数据下载
TCGA2.GDC数据整理
TCGA3.R包TCGA-biolinks下载数据

从RTCGA包中获取数据,这个包是将TCGA的数据全部下载下来,在R中用代码提取。有一个缺点,就是更新不及时,不是最新的数据。要最新的数据还是gdc下载。

示例是提取了KIRC的miRNA数据和clinical数据。

一.安装R包

if (!require(RTCGA))BiocManager::install("RTCGA")
if (!require(RTCGA.miRNASeq))BiocManager::install("RTCGA.miRNASeq")
if (!require(RTCGA.clinical))BiocManager::install("RTCGA.clinical")

二.miRNAseq数据

RTCGA.miRNASeq 包中存储着各种癌症的miRNA表达信息,可从帮助文档中找到对应数据集名称。

library(RTCGA.miRNASeq)
#?miRNASeq
# [1] "ACC.miRNASeq"      "BLCA.miRNASeq"     "BRCA.miRNASeq"    
# [4] "CESC.miRNASeq"     "CHOL.miRNASeq"     "COAD.miRNASeq"    
# [7] "COADREAD.miRNASeq" "DLBC.miRNASeq"     "ESCA.miRNASeq"    
# [10] "FPPP.miRNASeq"     "GBM.miRNASeq"      "GBMLGG.miRNASeq"  
# [13] "HNSC.miRNASeq"     "KICH.miRNASeq"     "KIPAN.miRNASeq"   
# [16] "KIRC.miRNASeq"     "KIRP.miRNASeq"     "LAML.miRNASeq"    
# [19] "LGG.miRNASeq"      "LIHC.miRNASeq"     "LUAD.miRNASeq"    
# [22] "LUSC.miRNASeq"     "MESO.miRNASeq"     "OV.miRNASeq"      
# [25] "PAAD.miRNASeq"     "PCPG.miRNASeq"     "PRAD.miRNASeq"    
# [28] "READ.miRNASeq"     "SARC.miRNASeq"     "SKCM.miRNASeq"    
# [31] "STAD.miRNASeq"     "STES.miRNASeq"     "TGCT.miRNASeq"    
# [34] "THCA.miRNASeq"     "THYM.miRNASeq"     "UCEC.miRNASeq"    
# [37] "UCS.miRNASeq"      "UVM.miRNASeq"    

1.探索KIRC.miRNASeq

KIRC.miRNASeq[1:10,1:4]
#>                                                 machine
#> TCGA-A3-3307-01A-01T-0860-13   Illumina Genome Analyzer
#> TCGA-A3-3307-01A-01T-0860-13.1 Illumina Genome Analyzer
#> TCGA-A3-3307-01A-01T-0860-13.2 Illumina Genome Analyzer
#> TCGA-A3-3308-01A-02R-1324-13   Illumina Genome Analyzer
#> TCGA-A3-3308-01A-02R-1324-13.1 Illumina Genome Analyzer
#> TCGA-A3-3308-01A-02R-1324-13.2 Illumina Genome Analyzer
#> TCGA-A3-3311-01A-02R-1324-13   Illumina Genome Analyzer
#> TCGA-A3-3311-01A-02R-1324-13.1 Illumina Genome Analyzer
#> TCGA-A3-3311-01A-02R-1324-13.2 Illumina Genome Analyzer
#> TCGA-A3-3313-01A-02R-1324-13   Illumina Genome Analyzer
#>                                                      miRNA_ID
#> TCGA-A3-3307-01A-01T-0860-13                       read_count
#> TCGA-A3-3307-01A-01T-0860-13.1 reads_per_million_miRNA_mapped
#> TCGA-A3-3307-01A-01T-0860-13.2                   cross-mapped
#> TCGA-A3-3308-01A-02R-1324-13                       read_count
#> TCGA-A3-3308-01A-02R-1324-13.1 reads_per_million_miRNA_mapped
#> TCGA-A3-3308-01A-02R-1324-13.2                   cross-mapped
#> TCGA-A3-3311-01A-02R-1324-13                       read_count
#> TCGA-A3-3311-01A-02R-1324-13.1 reads_per_million_miRNA_mapped
#> TCGA-A3-3311-01A-02R-1324-13.2                   cross-mapped
#> TCGA-A3-3313-01A-02R-1324-13                       read_count
#>                                hsa-let-7a-1 hsa-let-7a-2
#> TCGA-A3-3307-01A-01T-0860-13           5056        10323
#> TCGA-A3-3307-01A-01T-0860-13.1  3938.035142  8040.414709
#> TCGA-A3-3307-01A-01T-0860-13.2            N            N
#> TCGA-A3-3308-01A-02R-1324-13          14503        29238
#> TCGA-A3-3308-01A-02R-1324-13.1  9162.065524 18470.693772
#> TCGA-A3-3308-01A-02R-1324-13.2            N            Y
#> TCGA-A3-3311-01A-02R-1324-13           8147        16325
#> TCGA-A3-3311-01A-02R-1324-13.1  4917.159117  9853.028426
#> TCGA-A3-3311-01A-02R-1324-13.2            N            N
#> TCGA-A3-3313-01A-02R-1324-13           7138        14356
expr1 <- expressionsTCGA(KIRC.miRNASeq)
dim(expr1)
#> [1] 1779 1049
expr1[1:10,1:4]
#> # A tibble: 10 x 4
#>    machine            miRNA_ID             `hsa-let-7a-1` `hsa-let-7a-2`
#>    <chr>              <chr>                <chr>          <chr>         
#>  1 Illumina Genome A… read_count           5056           10323         
#>  2 Illumina Genome A… reads_per_million_m… 3938.035142    8040.414709   
#>  3 Illumina Genome A… cross-mapped         N              N             
#>  4 Illumina Genome A… read_count           14503          29238         
#>  5 Illumina Genome A… reads_per_million_m… 9162.065524    18470.693772  
#>  6 Illumina Genome A… cross-mapped         N              Y             
#>  7 Illumina Genome A… read_count           8147           16325         
#>  8 Illumina Genome A… reads_per_million_m… 4917.159117    9853.028426   
#>  9 Illumina Genome A… cross-mapped         N              N             
#> 10 Illumina Genome A… read_count           7138           14356

发现

  • 从第三列开始是miRNA表达量

  • 每三行是一个样本的信息;我们只需要read_count。

  • expressionsTCGA()函数处理后行名丢失,需要补充回去。

2.整理为表达矩阵

取3-ncol()列,每三行取第一行,补上行名

expr2 = expr1[seq(1, nrow(expr1), by = 3), 3:ncol(expr1)]
dim(expr2)
#> [1]  593 1047
expr2[1:4,1:4]
#> # A tibble: 4 x 4
#>   `hsa-let-7a-1` `hsa-let-7a-2` `hsa-let-7a-3` `hsa-let-7b`
#>   <chr>          <chr>          <chr>          <chr>       
#> 1 5056           10323          5429           17908       
#> 2 14503          29238          14738          37062       
#> 3 8147           16325          8249           28984       
#> 4 7138           14356          7002           6909
#此时表达量均为字符型,改为数值型
expr2 = apply(expr2, 2, as.numeric)
#> Warning in apply(expr2, 2, as.numeric): NAs introduced by coercion
#行名补回去
rownames(expr2) = rownames(KIRC.miRNASeq)[seq(1, nrow(KIRC.miRNASeq), by = 3)]
expr = t(expr2)
expr = na.omit(expr)
expr[1:4,1:4]
#>              TCGA-A3-3307-01A-01T-0860-13 TCGA-A3-3308-01A-02R-1324-13
#> hsa-let-7a-1                         5056                        14503
#> hsa-let-7a-2                        10323                        29238
#> hsa-let-7a-3                         5429                        14738
#> hsa-let-7b                          17908                        37062
#>              TCGA-A3-3311-01A-02R-1324-13 TCGA-A3-3313-01A-02R-1324-13
#> hsa-let-7a-1                         8147                         7138
#> hsa-let-7a-2                        16325                        14356
#> hsa-let-7a-3                         8249                         7002
#> hsa-let-7b                          28984                         6909

3.过滤低表达量的miRNA

有的miRNA在几百个样本中表达量都为零,需去除。过滤的方法不唯一!此处过滤的标准为:在10个以上样本中表达量>1,x>1返回逻辑值,sum()函数处理逻辑值向量,返回结果为TRUE的个数。

dim(expr)
#> [1] 1046  593
expr = expr[apply(expr, 1function(x) {
sum(x > 1) > 10
}), ]
dim(expr)
#> [1] 552 593

过滤掉了四百多个呐。

三.处理临床信息

library(RTCGA.clinical)
clinical <- KIRC.clinical
dim(clinical)
#> [1]  537 2809
#2809列,吓人了。把列名保存下来,挑里面有用的几列就好
clinical = clinical[c(
  'patient.bcr_patient_barcode',
  'patient.vital_status',
  'patient.days_to_death',
  'patient.days_to_last_followup',
  'patient.race',
  'patient.age_at_initial_pathologic_diagnosis',
  'patient.gender' ,
  'patient.stage_event.pathologic_stage'
)]
rownames(clinical) <- NULL
clinical <- tibble::column_to_rownames(clinical,var = 'patient.bcr_patient_barcode')
rownames(clinical) <- stringr::str_to_upper(rownames(clinical))
dim(clinical)
#> [1] 537   7
clinical[1:4,1:4]
#>              patient.vital_status patient.days_to_death
#> TCGA-3Z-A93Z                alive                  <NA>
#> TCGA-6D-AA2E                alive                  <NA>
#> TCGA-A3-3306                alive                  <NA>
#> TCGA-A3-3307                alive                  <NA>
#>              patient.days_to_last_followup              patient.race
#> TCGA-3Z-A93Z                             4 black or african american
#> TCGA-6D-AA2E                           135 black or african american
#> TCGA-A3-3306                          1120                     white
#> TCGA-A3-3307                          1436                      <NA>
dim(expr)
#> [1] 552 593
expr[1:4,1:4]
#>              TCGA-A3-3307-01A-01T-0860-13 TCGA-A3-3308-01A-02R-1324-13
#> hsa-let-7a-1                         5056                        14503
#> hsa-let-7a-2                        10323                        29238
#> hsa-let-7a-3                         5429                        14738
#> hsa-let-7b                          17908                        37062
#>              TCGA-A3-3311-01A-02R-1324-13 TCGA-A3-3313-01A-02R-1324-13
#> hsa-let-7a-1                         8147                         7138
#> hsa-let-7a-2                        16325                        14356
#> hsa-let-7a-3                         8249                         7002
#> hsa-let-7b                          28984                         6909

四.总结

通过RTCGA包,获取到了miRNA表达矩阵和临床信息。

表达矩阵中有593个样本,553个miRNA信息(过滤后的)。临床信息有537个。

这个不是最新的数据,只是简单而已。可以下载原始测序数据进行重新比对,可以拿到更多的miRNA信息。

插个小广告!

生信零基础入门学习小组长期报名中

GEO数据挖掘广州专场课程

再给生信技能树打个call!

全国巡讲第21站(长沙线下培训)

全球公益巡讲招学徒

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存