R语言统计入门课程推荐——生物科学中的数据分析Data Analysis for the Life Sciences

宏基因组 2022-03-28

The following article is from 植物微生物组 Author 宏基因组

Data Analysis for the Life Sciences是哈佛大学PH525x系列课程——生物医学中的数据分析(PH525x series - Biomedical Data Science
)，课程全部采用R语言进行统计分析理论教学与实战。教材采用Rmarkdown语言编写，易轻松易读，又保证分析的可重复性，代表了科学界最先进的可重复计算要求，我们不仅可以系统学习一个生物学家所要掌握的统计知识，还能新手用代码实现，并达到CNS发表可重复代码的要求。

传统的统计材料关注数学原理。而本文重点是用计算机实现数据分析。本书采用实例来讲解数学原理，提供代码亲自实现分析。全文采用R markdown编写，保证读者完成全部分析。

关于作者：

Rafael A Irizarry是哈佛大学公共卫生学院丹娜法伯癌症研究院的生物统计和计算生物学教授，有17年分析基因组数据的经验。

Michael I Love是北卡教堂山大学生统与遗传系助理教授。研究方向为利用统计模型发现基因组数据中的生物为规律，并开发了Bioconductor中开源统计软件。

课程源代码：https://github.com/genomicsclass/labs 包括课程所有源代码、测试数据和结果

网页版教程: https://genomicsclass.github.io/book/ ，包括课程的Rmd运行结果网页教程，和Rmd源代码的每节导航和下载链接。

电子书：https://leanpub.com/dataanalysisforthelifesciences/ 方便下载各版本在移动端阅读

有意思的是可选择免费学习，或最高付给作者80$。

教程大纲

https://genomicsclass.github.io/book/

PH525x series - Biomedical Data Science

链接与资源Links and resources

R markdown source files
ePub version on Leanpub
Links to the HarvardX class pages
External resources and books
Finding more help for data analysis

Chapter 0 - 简介Introduction

Introduction [Rmd]
Getting started [Rmd]
Getting started exercises
数据操作dplyr introduction [Rmd]
dplyr introduction exercises
Mathematical notation [Rmd]

Chapter 1 - 推理统计基础Inference

随机变量Random variables [Rmd]
Random variables exercises
群体与样本Populations and samples [Rmd]
Populations and samples exercises
CLT and t-distribution [Rmd]
CLT and t-distribution exercises
CLT in practice [Rmd]
CLT in practice exercises
t-test in practice [Rmd]
置信区间Confidence intervals [Rmd]
Power calculations [Rmd]
Power calculations exercises
Monte carlo [Rmd]
Monte carlo exercises
排列检验Permutation tests [Rmd]
Permutation tests exercises
关联研究Association tests [Rmd]
Association tests exercises

Chapter 2 - 数据探索Exploratory Data Analysis

Exploratory data analysis [Rmd]
Plots to avoid [Rmd]
Exploratory data analysis exercises

Chapter 3 - 稳健统计Robust Statistics

Robust summaries [Rmd]
Rank tests [Rmd]
Robust summaries exercises

Chapter 4 - 矩阵代数Matrix Algebra

回归Introduction to using regression [Rmd]
Introduction to using regression exercises
Matrix notation [Rmd]
Matrix notation exercises
Matrix operations [Rmd]
Matrix operations exercises
Matrix algebra examples [Rmd]
Matrix algebra examples exercises

Chapter 5 - 线性模型 Linear Models

Linear models introduction [Rmd]
Linear models introduction exercises
Expressing design formula [Rmd]
Expressing design formula exercises
Linear models in practice [Rmd]
Linear models in practice exercises
Standard errors [Rmd]
Standard errors exercises
Interactions and contrasts [Rmd]
Interactions and contrasts exercises
Collinearity [Rmd]
Collinearity exercises
QR and regression [Rmd]
Linear models going further [Rmd]

Chapter 6 - 推断高维数据Inference for High-Dimensional Data

Introduction to high-throughput data [Rmd]
Introduction to high-throughput data exercises
Inference for high-throughput data [Rmd]
Inference for high-throughput data exercises
Multiple testing [Rmd]
Multiple testing exercises
EDA for high-throughput data [Rmd]
EDA for high-throughput data exercises

Chapter 7 - 统计模型Statistical Modeling

Modeling [Rmd]
Modeling exercises
Bayes theorem [Rmd]
Bayes theorem exercises
Hierarchical models [Rmd]
Hierarchical models exercises

Chapter 8 - 降维Distance and Dimension Reduction

Distance [Rmd]
Distance exercises
PCA motivation [Rmd]
SVD [Rmd]
SVD exercises
Projections [Rmd]
Rotations [Rmd]
MDS [Rmd]
MDS exercises
PCA [Rmd]

Chapter 9 - 机器学习Practical Machine Learning

聚类和热图Clustering and heatmaps [Rmd]
Clustering and heatmaps exercises
Conditional expectation [Rmd]
Conditional expectation exercises
Smoothing [Rmd]
Smoothing exercises
Machine learning [Rmd]
Crossvalidation [Rmd]
Crossvalidation exercises

Chapter 10 - 批次效应Batch Effects

Introduction to batch effects [Rmd]
Confounding [Rmd]
Confounding exercises
EDA with PCA [Rmd]
EDA with PCA exercises
Adjusting with linear models [Rmd]
Adjusting with linear models exercises
Factor analysis [Rmd]
Factor analysis exercises
Adjusting with factor analysis [Rmd]
Adjusting with factor analysis exercises

Chapter 11 - 生物R包简介Introduction to Bioconductor

Mike Love’s general reference card
Motivations and core values (optional)
Installing Bioconductor and finding help [Rmd]
Data structure and management for genome scale experiments [Rmd]

Coordinating multiple tables: ExpressionSet
Institutional archives: GEO, ArrayExpress

Interlude: Working with general genomic features using GenomicRanges

IRanges introduced
Intra-range operations
Inter-range operations
GRanges
Calculating overlaps

Range-oriented solutions for current experimental paradigms

SummarizedExperiment: for RNA-seq and 450k methylation
External storage for very large assays
GenomicFiles for families of BAM or BED
DNA Variants: VCF handling with VariantAnnotation and VariantTools
Handling multiomic archives like TCGA
Cloud-oriented solutions: e.g., Google BigQuery

Short read mapping/alignment software (optional) [Rmd]

Chapter 12 - 基因组注释Genomic Annotation with Bioconductor

More details on GRanges [Rmd]

Run-length encoding, views
Application to genomic landmarks
Application to 450k methylation array visualization

General overview of Bioconductor annotation [Rmd]

Levels: reference sequence, regions of interest, pathways
Discovering reference sequence
A build of the human genome
Gene/Transcript/Exon catalogs from UCSC and Ensembl
Importing and exporting regions and scores
AnnotationHub: brokering thousands of annotation resources
OrgDb: simple interface to annotation databases
Finding and managing gene sets
OrganismDb: unifying diverse annotation

Cheat sheet on Bioconductor annotation [Rmd]
Translating addresses between genome builds: liftOver [Rmd]

Chapter 13 - 假设检验Genome-scale hypothesis testing with Bioconductor

区分生物重复和技术重复的变异Distinguishing biological and technical variability [Rmd]

An experiment with pooled and individual samples
Measuring technical variation
Measuring biological variation
Interpretation

多重比较Multiple comparisons with genewise t-tests [Rmd]

Gene-wise testing
Naive enumeration of genes
Demonstrating danger of multiple testing with a set of sham comparisons
Adjusting for multiplicity with qvalue
Adjusted counts in the sham case

Moderated t tests via limma [Rmd]

A spike-in dataset
Naive t-tests
Three steps with limma: lmFit, eBayes, topTable
Exposing the spiked-in genes
A view of the shrinkage of variance estimates

基因集分析Introducing gene sets and gene set analysis [Rmd]

Identifier remapping
Categorical testing
Statistical summaries for sets: Wilcoxon
Statistical summaries for sets: t statistics
A dataset for comparing expression by gender
Finding surrogate variables/batch effect correction
Data wrangling
The Broad Institute MsigDb
Adjusting for within-set correlation
A permutation procedure

Chapter 14 - 基因组数据可视化Visualization of genome scale data

可视化任务与策略A basic overview of visualization tasks and strategies[Rmd]

Gene models
Gene models plus data
Driving visualizations with functions
Using the browser to drive visualization functions via shiny
Queriable dynamic displays with plotly

Annotation-oriented visualizations

Sketching the binding landscape over chromosomes with ggbio’s karyogram layout [Rmd]
Plotting data in the context of genomic features with Gviz [Rmd]

Visualizing NGS data [Rmd]
Interactive visualization

Graphical user interfaces for multivariate data with shiny [Rmd]
Clustering gene expression data with shiny [Rmd]

Final remarks on visualization [Rmd]

Chapter 15: 并行与内存不足Pursuing scalability in genomic analysis: parallelism and out-of-memory data

Parallel computing with R and Bioconductor [Rmd]

Demonstrating simple speedup in multicore environments
Implicit parallelism with BiocParallel and GenomicAlignments

External data: data interfaces that spare RAM[Rmd]

SQLite for annotation
Tabix-indexed BAM
HDF5
An illustration of NoSQL with S4: mongodb and RaggedMongoExpt[Rmd]

Benchmarking various out-of-memory solutions[Rmd]
Introduction to Bioconductor’s Amazon Machine Instance for cluster creation and use in EC2 [Rmd]
Sharded GRanges for scalable integrative analysis[Rmd]

Chapter 16: 多组学数据Multi-omic data integration

Basic examples of multi-omic integration[Rmd]

Transcription factor (TF) binding and gene coexpression in yeast
TF binding and GWAS hits in humans

Using RTCGAToolbox outputs to integrate clinical, mutation, expression and methylation assays[Rmd]

Associating tumor stage with expression patterns
Linking DNA methylation with expression patterns
Defining a severity marker
Extracting survival times
Basic data acquisition
Working with clinical data
Working with mutations
Curation tasks for discrepant identifier formats
Working with expression data

Application to visualization: kataegis and rainfall plot[Rmd]

Chapter 17: Fostering reproducible genome-scale analysis

Overview of unit on reproducibility[Rmd]

Basic definitions
Infrastructure requirements
Statistical aspects of reproducibility
Analysis of reproducibility probability (Boos and Stefanski 2011)
Costs of highly reproducible designs

Package structure, creation, installation, management[Rmd]

create() to set up folders and DESCRIPTION
Composing documentation plus code
document(), install()
What is a package?
Using package.skeleton
Using makeOrganismPackage
Using devtools
Conclusions, including a link to a recent Nature Toolbox article on Bioconductor

如何学习

我们选择在线阅读网页版教程，结合源代码进行练习。

https://genomicsclass.github.io/book/ 逐节阅读学习，内容较多。读者可挑选适合自己的章节学习即可。

有实战的内容，都有Rmd的源代码，下载用本地的Rstudio打开即可。

批量下载所有资源

Windows下载：https://github.com/genomicsclass/labs/archive/master.zip

Linux下使用git或wget下载

# 方法1. 解压后为labs-master目录
wget -c https://github.com/genomicsclass/labs/archive/master.zip
unzip master.zip

# 方法2. 下载为labs目录下
git clone git@github.com:genomicsclass/labs.git

系列教程：微生物组入门 Biostar 微生物组宏基因组

专业技能：学术图表高分文章生信宝典不可或缺的人

一文读懂：宏基因组寄生虫益处进化树

必备技能：提问搜索 Endnote

文献阅读热心肠 SemanticScholar Geenmedical

扩增子分析：图表解读分析流程统计绘图

16S功能预测 PICRUSt FAPROTAX Bugbase Tax4Fun

在线工具：16S预测培养基生信绘图

科研经验：云笔记云协作公众号

编程模板: Shell R Perl

生物科普: 肠道细菌人体上的生命生命大跃进细胞暗战人体奥秘

写在后面

为鼓励读者交流、快速解决科研困难，我们建立了“宏基因组”专业讨论群，目前己有国内外2200+ 一线科研人员加入。参与讨论，获得专业解答，欢迎分享此文至朋友圈，并扫码加主编好友带你入群，务必备注“姓名-单位-研究方向-职称/年级”。技术问题寻求帮助，首先阅读《如何优雅的提问》学习解决问题思路，仍末解决群内讨论，问题不私聊，帮助同行。

学习16S扩增子、宏基因组科研思路和分析实战，关注“宏基因组”

观察｜官方通报陕西蒲城一职校学生坠亡：事发前与舍友发生口角和肢体冲突认定该生系高空坠落死亡

市管干部“龚书记”免职迷局

讣告！又一知名女星在家中去世，终年54岁，曾是无数人白月光…

近视的孩子有救了！国内最新近视防控矫正技术，不手术，扫码进群即可了解！

著名口述史学者Portelli的一部被忽视的口述史作品 | 一个工业小镇的传记：意大利特尔尼（1831-2014）

R语言统计入门课程推荐——生物科学中的数据分析Data Analysis for the Life Sciences

教程大纲

PH525x series - Biomedical Data Science

链接与资源Links and resources

Chapter 0 - 简介Introduction

Chapter 1 - 推理统计基础Inference

Chapter 2 - 数据探索Exploratory Data Analysis

Chapter 3 - 稳健统计Robust Statistics

Chapter 4 - 矩阵代数Matrix Algebra

Chapter 5 - 线性模型 Linear Models

Chapter 6 - 推断高维数据Inference for High-Dimensional Data

Chapter 7 - 统计模型Statistical Modeling

Chapter 8 - 降维Distance and Dimension Reduction

Chapter 9 - 机器学习Practical Machine Learning

Chapter 10 - 批次效应Batch Effects

Chapter 11 - 生物R包简介Introduction to Bioconductor

Chapter 12 - 基因组注释Genomic Annotation with Bioconductor

Chapter 13 - 假设检验Genome-scale hypothesis testing with Bioconductor

Chapter 14 - 基因组数据可视化Visualization of genome scale data

Chapter 15: 并行与内存不足Pursuing scalability in genomic analysis: parallelism and out-of-memory data

Chapter 16: 多组学数据Multi-omic data integration

Chapter 17: Fostering reproducible genome-scale analysis

如何学习

猜你喜欢

写在后面

您可能也对以下帖子感兴趣

观察｜官方通报陕西蒲城一职校学生坠亡：事发前与舍友发生口角和肢体冲突 认定该生系高空坠落死亡

市管干部“龚书记”免职迷局

讣告！又一知名女星在家中去世，终年54岁，曾是无数人白月光…

近视的孩子有救了！国内最新近视防控矫正技术，不手术，扫码进群即可了解！

著名口述史学者Portelli的一部被忽视的口述史作品 | 一个工业小镇的传记：意大利特尔尼（1831-2014）

生成图片，分享到微信朋友圈

R语言统计入门课程推荐——生物科学中的数据分析Data Analysis for the Life Sciences

教程大纲

PH525x series - Biomedical Data Science

链接与资源Links and resources

Chapter 0 - 简介Introduction

Chapter 1 - 推理统计基础Inference

Chapter 2 - 数据探索Exploratory Data Analysis

Chapter 3 - 稳健统计Robust Statistics

Chapter 4 - 矩阵代数Matrix Algebra

Chapter 5 - 线性模型 Linear Models

Chapter 6 - 推断高维数据Inference for High-Dimensional Data

Chapter 7 - 统计模型Statistical Modeling

Chapter 8 - 降维Distance and Dimension Reduction

Chapter 9 - 机器学习Practical Machine Learning

Chapter 10 - 批次效应Batch Effects

Chapter 11 - 生物R包简介Introduction to Bioconductor

Chapter 12 - 基因组注释Genomic Annotation with Bioconductor

Chapter 13 - 假设检验Genome-scale hypothesis testing with Bioconductor

Chapter 14 - 基因组数据可视化Visualization of genome scale data

Chapter 15: 并行与内存不足Pursuing scalability in genomic analysis: parallelism and out-of-memory data

Chapter 16: 多组学数据Multi-omic data integration

Chapter 17: Fostering reproducible genome-scale analysis

如何学习

猜你喜欢

写在后面

您可能也对以下帖子感兴趣

观察｜官方通报陕西蒲城一职校学生坠亡：事发前与舍友发生口角和肢体冲突认定该生系高空坠落死亡