查看原文
其他

R语言统计入门课程推荐——生物科学中的数据分析Data Analysis for the Life Sciences

宏基因组 2022-03-28

The following article is from 植物微生物组 Author 宏基因组

Data Analysis for the Life Sciences是哈佛大学PH525x系列课程——生物医学中的数据分析(PH525x series - Biomedical Data Science
),课程全部采用R语言进行统计分析理论教学与实战。教材采用Rmarkdown语言编写,易轻松易读,又保证分析的可重复性,代表了科学界最先进的可重复计算要求,我们不仅可以系统学习一个生物学家所要掌握的统计知识,还能新手用代码实现,并达到CNS发表可重复代码的要求。

传统的统计材料关注数学原理。而本文重点是用计算机实现数据分析。本书采用实例来讲解数学原理,提供代码亲自实现分析。全文采用R markdown编写,保证读者完成全部分析。

关于作者:

Rafael A Irizarry是哈佛大学公共卫生学院丹娜法伯癌症研究院的生物统计和计算生物学教授,有17年分析基因组数据的经验。

Michael I Love是北卡教堂山大学生统与遗传系助理教授。研究方向为利用统计模型发现基因组数据中的生物为规律,并开发了Bioconductor中开源统计软件。

课程源代码:https://github.com/genomicsclass/labs 包括课程所有源代码、测试数据和结果

网页版教程: https://genomicsclass.github.io/book/ ,包括课程的Rmd运行结果网页教程,和Rmd源代码的每节导航和下载链接。

电子书:https://leanpub.com/dataanalysisforthelifesciences/ 方便下载各版本在移动端阅读

有意思的是可选择免费学习,或最高付给作者80$。

教程大纲

https://genomicsclass.github.io/book/

PH525x series - Biomedical Data Science

链接与资源Links and resources

  • R markdown source files

  • ePub version on Leanpub

  • Links to the HarvardX class pages

  • External resources and books

  • Finding more help for data analysis

Chapter 0 - 简介Introduction

  • Introduction [Rmd]

  • Getting started [Rmd]

  • Getting started exercises

  • 数据操作dplyr introduction [Rmd]

  • dplyr introduction exercises

  • Mathematical notation [Rmd]

Chapter 1 - 推理统计基础Inference

  • 随机变量Random variables [Rmd]

  • Random variables exercises

  • 群体与样本Populations and samples [Rmd]

  • Populations and samples exercises

  • CLT and t-distribution [Rmd]

  • CLT and t-distribution exercises

  • CLT in practice [Rmd]

  • CLT in practice exercises

  • t-test in practice [Rmd]

  • 置信区间Confidence intervals [Rmd]

  • Power calculations [Rmd]

  • Power calculations exercises

  • Monte carlo [Rmd]

  • Monte carlo exercises

  • 排列检验Permutation tests [Rmd]

  • Permutation tests exercises

  • 关联研究Association tests [Rmd]

  • Association tests exercises

Chapter 2 - 数据探索Exploratory Data Analysis

  • Exploratory data analysis [Rmd]

  • Plots to avoid [Rmd]

  • Exploratory data analysis exercises

Chapter 3 - 稳健统计Robust Statistics

  • Robust summaries [Rmd]

  • Rank tests [Rmd]

  • Robust summaries exercises

Chapter 4 - 矩阵代数Matrix Algebra

  • 回归Introduction to using regression [Rmd]

  • Introduction to using regression exercises

  • Matrix notation [Rmd]

  • Matrix notation exercises

  • Matrix operations [Rmd]

  • Matrix operations exercises

  • Matrix algebra examples [Rmd]

  • Matrix algebra examples exercises

Chapter 5 - 线性模型 Linear Models

  • Linear models introduction [Rmd]

  • Linear models introduction exercises

  • Expressing design formula [Rmd]

  • Expressing design formula exercises

  • Linear models in practice [Rmd]

  • Linear models in practice exercises

  • Standard errors [Rmd]

  • Standard errors exercises

  • Interactions and contrasts [Rmd]

  • Interactions and contrasts exercises

  • Collinearity [Rmd]

  • Collinearity exercises

  • QR and regression [Rmd]

  • Linear models going further [Rmd]

Chapter 6 - 推断高维数据Inference for High-Dimensional Data

  • Introduction to high-throughput data [Rmd]

  • Introduction to high-throughput data exercises

  • Inference for high-throughput data [Rmd]

  • Inference for high-throughput data exercises

  • Multiple testing [Rmd]

  • Multiple testing exercises

  • EDA for high-throughput data [Rmd]

  • EDA for high-throughput data exercises

Chapter 7 - 统计模型Statistical Modeling

  • Modeling [Rmd]

  • Modeling exercises

  • Bayes theorem [Rmd]

  • Bayes theorem exercises

  • Hierarchical models [Rmd]

  • Hierarchical models exercises

Chapter 8 - 降维Distance and Dimension Reduction

  • Distance [Rmd]

  • Distance exercises

  • PCA motivation [Rmd]

  • SVD [Rmd]

  • SVD exercises

  • Projections [Rmd]

  • Rotations [Rmd]

  • MDS [Rmd]

  • MDS exercises

  • PCA [Rmd]

Chapter 9 - 机器学习Practical Machine Learning

  • 聚类和热图Clustering and heatmaps [Rmd]

  • Clustering and heatmaps exercises

  • Conditional expectation [Rmd]

  • Conditional expectation exercises

  • Smoothing [Rmd]

  • Smoothing exercises

  • Machine learning [Rmd]

  • Crossvalidation [Rmd]

  • Crossvalidation exercises

Chapter 10 - 批次效应Batch Effects

  • Introduction to batch effects [Rmd]

  • Confounding [Rmd]

  • Confounding exercises

  • EDA with PCA [Rmd]

  • EDA with PCA exercises

  • Adjusting with linear models [Rmd]

  • Adjusting with linear models exercises

  • Factor analysis [Rmd]

  • Factor analysis exercises

  • Adjusting with factor analysis [Rmd]

  • Adjusting with factor analysis exercises

Chapter 11 - 生物R包简介Introduction to Bioconductor

  • Mike Love’s general reference card

  • Motivations and core values (optional)

  • Installing Bioconductor and finding help [Rmd]

  • Data structure and management for genome scale experiments [Rmd]

    • Coordinating multiple tables: ExpressionSet

    • Institutional archives: GEO, ArrayExpress

  • Interlude: Working with general genomic features using GenomicRanges

    • IRanges introduced

    • Intra-range operations

    • Inter-range operations

    • GRanges

    • Calculating overlaps

  • Range-oriented solutions for current experimental paradigms

    • SummarizedExperiment: for RNA-seq and 450k methylation

    • External storage for very large assays

    • GenomicFiles for families of BAM or BED

    • DNA Variants: VCF handling with VariantAnnotation and VariantTools

    • Handling multiomic archives like TCGA

    • Cloud-oriented solutions: e.g., Google BigQuery

  • Short read mapping/alignment software (optional) [Rmd]

Chapter 12 - 基因组注释Genomic Annotation with Bioconductor

  • More details on GRanges [Rmd]

    • Run-length encoding, views

    • Application to genomic landmarks

    • Application to 450k methylation array visualization

  • General overview of Bioconductor annotation [Rmd]

    • Levels: reference sequence, regions of interest, pathways

    • Discovering reference sequence

    • A build of the human genome

    • Gene/Transcript/Exon catalogs from UCSC and Ensembl

    • Importing and exporting regions and scores

    • AnnotationHub: brokering thousands of annotation resources

    • OrgDb: simple interface to annotation databases

    • Finding and managing gene sets

    • OrganismDb: unifying diverse annotation

  • Cheat sheet on Bioconductor annotation [Rmd]

  • Translating addresses between genome builds: liftOver [Rmd]

Chapter 13 - 假设检验Genome-scale hypothesis testing with Bioconductor

  • 区分生物重复和技术重复的变异Distinguishing biological and technical variability [Rmd]

    • An experiment with pooled and individual samples

    • Measuring technical variation

    • Measuring biological variation

    • Interpretation

  • 多重比较Multiple comparisons with genewise t-tests [Rmd]

    • Gene-wise testing

    • Naive enumeration of genes

    • Demonstrating danger of multiple testing with a set of sham comparisons

    • Adjusting for multiplicity with qvalue

    • Adjusted counts in the sham case

  • Moderated t tests via limma [Rmd]

    • A spike-in dataset

    • Naive t-tests

    • Three steps with limma: lmFit, eBayes, topTable

    • Exposing the spiked-in genes

    • A view of the shrinkage of variance estimates

  • 基因集分析Introducing gene sets and gene set analysis [Rmd]

    • Identifier remapping

    • Categorical testing

    • Statistical summaries for sets: Wilcoxon

    • Statistical summaries for sets: t statistics

    • A dataset for comparing expression by gender

    • Finding surrogate variables/batch effect correction

    • Data wrangling

    • The Broad Institute MsigDb

    • Adjusting for within-set correlation

    • A permutation procedure

Chapter 14 - 基因组数据可视化Visualization of genome scale data

  • 可视化任务与策略A basic overview of visualization tasks and strategies[Rmd]

    • Gene models

    • Gene models plus data

    • Driving visualizations with functions

    • Using the browser to drive visualization functions via shiny

    • Queriable dynamic displays with plotly

  • Annotation-oriented visualizations

    • Sketching the binding landscape over chromosomes with ggbio’s karyogram layout [Rmd]

    • Plotting data in the context of genomic features with Gviz [Rmd]

  • Visualizing NGS data [Rmd]

  • Interactive visualization

    • Graphical user interfaces for multivariate data with shiny [Rmd]

    • Clustering gene expression data with shiny [Rmd]

  • Final remarks on visualization [Rmd]

Chapter 15: 并行与内存不足Pursuing scalability in genomic analysis: parallelism and out-of-memory data

  • Parallel computing with R and Bioconductor [Rmd]

    • Demonstrating simple speedup in multicore environments

    • Implicit parallelism with BiocParallel and GenomicAlignments

  • External data: data interfaces that spare RAM[Rmd]

    • SQLite for annotation

    • Tabix-indexed BAM

    • HDF5

    • An illustration of NoSQL with S4: mongodb and RaggedMongoExpt[Rmd]

  • Benchmarking various out-of-memory solutions[Rmd]

  • Introduction to Bioconductor’s Amazon Machine Instance for cluster creation and use in EC2 [Rmd]

  • Sharded GRanges for scalable integrative analysis[Rmd]

Chapter 16: 多组学数据Multi-omic data integration

  • Basic examples of multi-omic integration[Rmd]

    • Transcription factor (TF) binding and gene coexpression in yeast

    • TF binding and GWAS hits in humans

  • Using RTCGAToolbox outputs to integrate clinical, mutation, expression and methylation assays[Rmd]

    • Associating tumor stage with expression patterns

    • Linking DNA methylation with expression patterns

    • Defining a severity marker

    • Extracting survival times

    • Basic data acquisition

    • Working with clinical data

    • Working with mutations

    • Curation tasks for discrepant identifier formats

    • Working with expression data

  • Application to visualization: kataegis and rainfall plot[Rmd]

Chapter 17: Fostering reproducible genome-scale analysis

  • Overview of unit on reproducibility[Rmd]

    • Basic definitions

    • Infrastructure requirements

    • Statistical aspects of reproducibility

    • Analysis of reproducibility probability (Boos and Stefanski 2011)

    • Costs of highly reproducible designs

  • Package structure, creation, installation, management[Rmd]

    • create() to set up folders and DESCRIPTION

    • Composing documentation plus code

    • document(), install()

    • What is a package?

    • Using package.skeleton

    • Using makeOrganismPackage

    • Using devtools

    • Conclusions, including a link to a recent Nature Toolbox article on Bioconductor

如何学习

我们选择在线阅读网页版教程,结合源代码进行练习。

https://genomicsclass.github.io/book/ 逐节阅读学习,内容较多。读者可挑选适合自己的章节学习即可。

有实战的内容,都有Rmd的源代码,下载用本地的Rstudio打开即可。

批量下载所有资源

Windows下载:https://github.com/genomicsclass/labs/archive/master.zip

Linux下使用git或wget下载

# 方法1. 解压后为labs-master目录 wget -c https://github.com/genomicsclass/labs/archive/master.zip unzip master.zip # 方法2. 下载为labs目录下 git clone git@github.com:genomicsclass/labs.git

猜你喜欢

10000+:菌群分析 宝宝与猫狗 梅毒狂想曲 提DNA发Nature Cell专刊 肠道指挥大脑

系列教程:微生物组入门 Biostar 微生物组  宏基因组

专业技能:学术图表 高分文章 生信宝典 不可或缺的人

一文读懂:宏基因组 寄生虫益处 进化树

必备技能:提问 搜索  Endnote

文献阅读 热心肠 SemanticScholar Geenmedical

扩增子分析:图表解读 分析流程 统计绘图

16S功能预测   PICRUSt  FAPROTAX  Bugbase Tax4Fun

在线工具:16S预测培养基 生信绘图

科研经验:云笔记  云协作 公众号

编程模板: Shell  R Perl

生物科普:  肠道细菌 人体上的生命 生命大跃进  细胞暗战 人体奥秘  

写在后面

为鼓励读者交流、快速解决科研困难,我们建立了“宏基因组”专业讨论群,目前己有国内外2200+ 一线科研人员加入。参与讨论,获得专业解答,欢迎分享此文至朋友圈,并扫码加主编好友带你入群,务必备注“姓名-单位-研究方向-职称/年级”。技术问题寻求帮助,首先阅读《如何优雅的提问》学习解决问题思路,仍末解决群内讨论,问题不私聊,帮助同行。

学习16S扩增子、宏基因组科研思路和分析实战,关注“宏基因组”

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存