edgeRDESeq2分析RNAseq差异表达.docx
- 文档编号:9817343
- 上传时间:2023-02-06
- 格式:DOCX
- 页数:12
- 大小:18.88KB
edgeRDESeq2分析RNAseq差异表达.docx
《edgeRDESeq2分析RNAseq差异表达.docx》由会员分享,可在线阅读,更多相关《edgeRDESeq2分析RNAseq差异表达.docx(12页珍藏版)》请在冰豆网上搜索。
edgeRDESeq2分析RNAseq差异表达
edgeR包的安装
edgeR包是基于 Bioconductor 平台发布的,所以安装不能直接用 () 命令从CRAN上来下载
安装:
#tryifURLsarenotsupported
>source("")
>biocLite("edgeR")
数据导入
由于edgeR对测序结果的下游分析是依赖count计数来进行基因差异表达分析的,在这里使用的是featureCounts 来进行统计`.bam` 文件中Map的结果
count结果如下:
>library(edgeR)
>mydata<-("",header=TRUE,quote='\t',skip=1)
>sampleNames<-c("CA_1","CA_2","CA_3","CC_1","CC_2","CC_3")
>names(mydata)[7:
12]<-sampleNames
>head(mydata)
GeneidChrStartEndStrandLengthCA_1CA_2CA_3CC_1CC_2CC_3
1gene131412571745+489000000
2gene131521153452+1338000000
3gene131638564680+825000000
4gene131748665435-570000000
5gene131860666836-771000000
6gene131972949483+2190000000
在这里我们只是需要Geneid和后6列的样本的count信息来组成矩阵,所以要处理下
>countMatrix<-(mydata[7:
12])
>rownames(countMatrix)<-mydata$Geneid
>head(countMatrix)
CA_1CA_2CA_3CC_1CC_2CC_3
gene1314000000
gene1315000000
gene1316000000
gene1317000000
gene1318000000
gene1319000000
*要导入的矩阵由3v3样本组成(三组生物学重复)
创建DEGlist
>group<-factor(c("CA","CA","CA","CC","CC","CC"))
>y<-DGEList(counts=countMatrix,group=group)
>y
Anobjectofclass"DGEList"
$counts
CA_1CA_2CA_3CC_1CC_2CC_3
gene1314000000
gene1315000000
gene1316000000
gene1317000000
gene1318000000
14212morerows...
$samples
group
CA_1CA_117885371
CA_2CA_218255461
CA_3CA_319030171
CC_1CC_118260421
CC_2CC_221244681
CC_3CC_320250631
过滤
过滤掉那些count结果都为0的数据,这些没有表达的基因对结果的分析没有用,过滤又两点好处:
1可以减少内存的压力2可以减少计算的压力
>keep<-rowSums(cpm(y)>1)>=2
>y<-y[keep,,objectofclass"DGEList"
$counts
CA_1CA_2CA_3CC_1CC_2CC_3
gene1321161138129218194220
gene1322231133
gene1323202733475146
gene132460877986100132
gene1325322921587556
3877morerows...
$samples
group
CA_1CA_117883621
CA_2CA_218253081
CA_3CA_319027961
CC_1CC_118258891
CC_2CC_221241551
CC_3CC_320247861
标准化处理
edgeR采用的是TMM方法进行标准化处理,只有标准化处理后的数据才又可比性
>y<-calcNormFactors(y)
>y
Anobjectofclass"DGEList"
$counts
CA_1CA_2CA_3CC_1CC_2CC_3
gene1321161138129218194220
gene1322231133
gene1323202733475146
gene132460877986100132
gene1325322921587556
3877morerows...
$samples
group
CA_1CA_11788362
CA_2CA_21825308
CA_3CA_31902796
CC_1CC_11825889
CC_2CC_22124155
CC_3CC_32024786
设计矩阵
为什么要一个设计矩阵呢,道理很简单,有了一个设计矩阵才能够更好的分组分析
>subGroup<-factor(substring(colnames(countMatrix),4,4))
>design<-(~subGroup+group)
>rownames(design)<-colnames(y)
>design
(Intercept)subGroup2subGroup3groupCC
CA_11000
CA_21100
CA_31010
CC_11001
CC_21101
CC_31011
attr(,"assign")
[1]0112
attr(,"contrasts")
attr(,"contrasts")$subGroup
[1]""
attr(,"contrasts")$group
[1]""
评估离散度
>y<-estimateDisp(y,design,robust=TRUE)
>y$
[1]
#plot
>plotBCV(y)
差异表达基因
>fit<-glmQLFit(y,design,robust=TRUE)
>qlf<-glmQLFTest(fit)
>topTags(qlf)
Coefficient:
groupCC
logFClogCPMFPValueFDR
gene7024
gene6612
gene2743
gene12032
gene491
gene8941
gene2611
gene6242
gene7252
gene6125
查看差异表达基因原始的CMP
>top<-rownames(topTags(qlf))
>cpm(y)[top,]
CA_1CA_2CA_3CC_1CC_2CC_3
gene7024
gene6612
gene2743
gene12032
gene491
gene8941
gene2611
gene6242
gene7252
gene6125
查看上调和下调基因的数目
>summary(dt<-decideTestsDGE(qlf))
[,1]
-1536
02793
1553
挑选出差异表达基因的名字
>isDE<-(dt)
>DEnames<-rownames(y)[isDE]
>head(DEnames)
[1]"gene1325""gene1326""gene1327""gene1331""gene1340""gene1343"
差异表达基因画图
>plotSmear(qlf,=DEnames)
>abline(h=c(-1,1),col="blue")
DESeq2包的安装
安装:
##tryifURLsarenotsupported
>source("")
>biocLite("DESeq2")
数据导入
导入count矩阵,导入数据的方式很多这里直接导入count矩阵
count结果如下:
library(DESeq2)
sampleNames<-c("CA_1","CA_2","CA_3","CC_1","CC_2","CC_3")
mydata<-("",header=TRUE,quote='\t',skip=1)
names(mydata)[7:
12]<-sampleNames
countMatrix<-(mydata[7:
12])
rownames(countMatrix)<-mydata$Geneid
table2<-(name=c("CA_1","CA_2","CA_3","CC_1","CC_2","CC_3"),condition=("CA","CA","CA","CC","CC","CC"))
rownames(table2)<-sampleNames
head(countMatrix)
CA_1CA_2CA_3CC_1CC_2CC_3
gene1314000000
gene1315000000
gene1316000000
gene1317000000
gene1318000000
gene1319000000
把count矩阵转化为DESeq2 的数据格式
>dds<-DESeqDataSetFromMatrix(countMatrix,colData=table2,design=~condition)
>dds
class:
DESeqDataSet
dim:
142176
metadata(0):
assays
(1):
counts
rownames(14217):
gene1314gene1315...gene6710gene6709
rowRangesmetadatacolumnnames(0):
colnames(6):
CA_1CA_2...CC_2CC_3
colDatanames
(2):
namecondition
过滤
过滤掉那些count结果都为0的数据,这些没有表达的基因对结果的分析没有用
dds<-dds[rowSums(counts(dds))>1,]
dds
class:
DESeqDataSet
dim:
41906
metadata(0):
assays
(1):
counts
rownames(4190):
gene1321gene1322...gene6712gene6710
rowRangesmetadatacolumnnames(0):
colnames(6):
CA_1CA_2...CC_2CC_3
colDatanames
(2):
namecondition
PCA分析
rld<-rlog(dds)
plotPCA(rld,intgroup=c("name","condition"))
当然也可以使用ggplot2 来画 PCA图
library(ggplot2)
rld<-rlog(dds)
data<-plotPCA(rld,intgroup=c("condition","name"),returnData=TRUE)
percentVar<-round(100*attr(data,"percentVar"))
p<-ggplot(data,aes(PC1,PC2,color=condition,shape=name))+
geom_point(size=3)+
xlab(paste0("PC1:
",percentVar[1],"%variance"))+
ylab(paste0("PC2:
",percentVar[2],"%variance"))
p
注意在进行PCA分析前不要 library(DESeq) 否则无法进行PCA分析
差异表达基因分析
分析结果输出
library(DESeq)
dds<-DESeq(dds)
res<-results(dds)
(res,"",sep=",",=TRUE)
head(res)
log2foldchange(MAP):
conditionCCvsCA
Waldtestp-value:
conditionCCvsCA
DataFramewith6rowsand6columns
baseMeanlog2FoldChangelfcSEstatpvalue
gene13210.
gene1322
gene13230.
gene13240.
gene13250.
gene1326
padj
gene1321
gene1322
gene1323
gene1324
gene1325
gene1326
注:
(1)rownames:
基因ID
(2)baseMean:
所有样本矫正后的平均reads数(3)log2FoldChange:
取log2后的表达量差异(4)pvalue:
统计学差异显著性检验指标(5)padj:
校正后的pvalue,padj越小,表示基因表达差异越显著
summary 查看整体分析结果
summary(res)
outof4190withnonzerototalreadcount
adjustedp-value<
LFC>0(up):
595,14%
LFC<0(down):
644,15%
outliers[1]:
0,0%
lowcounts[2]:
325,%
(meancount<1)
[1]see'cooksCutoff'argumentof?
results
[2]see'independentFiltering'argumentof?
results
MA图
library(geneplotter)
plotMA(res,main="DESeq2",ylim=c(-2,2))
Heatmap图
sum(res$padj<,=TRUE)
library("pheatmap")
select<-order(rowMeans(counts(dds,normalized=TRUE)),decreasing=TRUE)[1:
1000]
nt<-normTransform(dds)#defaultstolog2(x+1)
<-assay(nt)[select,]
df<-"name","condition")])
pdf('',width=6,height=7)
pheatmapcluster_rows=TRUE,show_rownames=FALSE,
cluster_cols=TRUE,annotation_col=df)
()
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- edgeRDESeq2 分析 RNAseq 差异 表达