edgeRDESeq2分析RNAseq差异表达.docx
- 文档编号:29371215
- 上传时间:2023-07-22
- 格式:DOCX
- 页数:15
- 大小:154.10KB
edgeRDESeq2分析RNAseq差异表达.docx
《edgeRDESeq2分析RNAseq差异表达.docx》由会员分享,可在线阅读,更多相关《edgeRDESeq2分析RNAseq差异表达.docx(15页珍藏版)》请在冰豆网上搜索。
edgeRDESeq2分析RNAseq差异表达
edgeR包的安装
∙edgeR包是基于 Bioconductor 平台发布的,所以安装不能直接用 install.packages() 命令从CRAN上来下载
∙安装:
#tryhttp:
//ifhttps:
//URLsarenotsupported
>source("https:
//bioconductor.org/biocLite.R")
>biocLite("edgeR")
数据导入
∙由于edgeR对测序结果的下游分析是依赖count计数来进行基因差异表达分析的,在这里使用的是featureCounts 来进行统计`.bam` 文件中Map的结果
∙count结果如下:
>library(edgeR)
>mydata<-read.table("counts.txt",header=TRUE,quote='\t',skip=1)
>sampleNames<-c("CA_1","CA_2","CA_3","CC_1","CC_2","CC_3")
>names(mydata)[7:
12]<-sampleNames
>head(mydata)
GeneidChrStartEndStrandLengthCA_1CA_2CA_3CC_1CC_2CC_3
1gene1314NW_139421.112571745+489000000
2gene1315NW_139421.121153452+1338000000
3gene1316NW_139421.138564680+825000000
4gene1317NW_139421.148665435-570000000
5gene1318NW_139421.160666836-771000000
6gene1319NW_139421.172949483+2190000000
∙在这里我们只是需要Geneid和后6列的样本的count信息来组成矩阵,所以要处理下
>countMatrix<-as.matrix(mydata[7:
12])
>rownames(countMatrix)<-mydata$Geneid
>head(countMatrix)
CA_1CA_2CA_3CC_1CC_2CC_3
gene1314000000
gene1315000000
gene1316000000
gene1317000000
gene1318000000
gene1319000000
*要导入的矩阵由3v3样本组成(三组生物学重复)
创建DEGlist
>group<-factor(c("CA","CA","CA","CC","CC","CC"))
>y<-DGEList(counts=countMatrix,group=group)
>y
Anobjectofclass"DGEList"
$counts
CA_1CA_2CA_3CC_1CC_2CC_3
gene1314000000
gene1315000000
gene1316000000
gene1317000000
gene1318000000
14212morerows...
$samples
grouplib.sizenorm.factors
CA_1CA_117885371
CA_2CA_218255461
CA_3CA_319030171
CC_1CC_118260421
CC_2CC_221244681
CC_3CC_320250631
过滤
∙过滤掉那些count结果都为0的数据,这些没有表达的基因对结果的分析没有用,过滤又两点好处:
1可以减少内存的压力2可以减少计算的压力
>keep<-rowSums(cpm(y)>1)>=2
>y<-y[keep,,keep.lib.sizes=FALSE]
>y
Anobjectofclass"DGEList"
$counts
CA_1CA_2CA_3CC_1CC_2CC_3
gene1321161138129218194220
gene1322231133
gene1323202733475146
gene132460877986100132
gene1325322921587556
3877morerows...
$samples
grouplib.sizenorm.factors
CA_1CA_117883621
CA_2CA_218253081
CA_3CA_319027961
CC_1CC_118258891
CC_2CC_221241551
CC_3CC_320247861
标准化处理
∙edgeR采用的是TMM方法进行标准化处理,只有标准化处理后的数据才又可比性
>y<-calcNormFactors(y)
>y
Anobjectofclass"DGEList"
$counts
CA_1CA_2CA_3CC_1CC_2CC_3
gene1321161138129218194220
gene1322231133
gene1323202733475146
gene132460877986100132
gene1325322921587556
3877morerows...
$samples
grouplib.sizenorm.factors
CA_1CA_117883620.9553769
CA_2CA_218253080.9052539
CA_3CA_319027960.9686232
CC_1CC_118258890.9923455
CC_2CC_221241551.1275178
CC_3CC_320247861.0668754
设计矩阵
∙为什么要一个设计矩阵呢,道理很简单,有了一个设计矩阵才能够更好的分组分析
>subGroup<-factor(substring(colnames(countMatrix),4,4))
>design<-model.matrix(~subGroup+group)
>rownames(design)<-colnames(y)
>design
(Intercept)subGroup2subGroup3groupCC
CA_11000
CA_21100
CA_31010
CC_11001
CC_21101
CC_31011
attr(,"assign")
[1]0112
attr(,"contrasts")
attr(,"contrasts")$subGroup
[1]"contr.treatment"
attr(,"contrasts")$group
[1]"contr.treatment"
评估离散度
>y<-estimateDisp(y,design,robust=TRUE)
>y$common.dispersion
[1]0.02683622
#plot
>plotBCV(y)
差异表达基因
>fit<-glmQLFit(y,design,robust=TRUE)
>qlf<-glmQLFTest(fit)
>topTags(qlf)
Coefficient:
groupCC
logFClogCPMFPValueFDR
gene7024-5.5156489.612809594.92326.431484e-442.496702e-40
gene66125.1302828.451143468.20601.557517e-393.023140e-36
gene27434.3774925.586773208.02683.488383e-264.513967e-23
gene120324.7343835.098148192.93784.359649e-254.231040e-22
gene491-2.73391010.412673190.98396.104188e-254.739291e-22
gene89412.9971856.839106177.76146.332836e-244.097345e-21
gene2611-2.8469247.216173174.73321.099339e-236.096619e-21
gene62422.5291259.897771169.26583.022914e-231.466869e-20
gene72523.7323156.137670188.20943.890569e-231.678132e-20
gene61252.8754236.569935160.31891.656083e-226.428914e-20
查看差异表达基因原始的CMP
>top<-rownames(topTags(qlf))
>cpm(y)[top,]
CA_1CA_2CA_3CC_1CC_2CC_3
gene70241711.3830021405.8618991480.12111533.1141837.1604029.62696
gene661217.55864912.10384826.585753403.99298582.457961044.35046
gene27434.6823061.8155775.96823062.9169487.26431114.34156
gene120321.7558652.4207702.71283265.6764647.5987275.45617
gene4912811.1397272059.4696692222.351938444.83381385.38258253.68087
gene894123.99682024.81288824.415488131.35291244.67410225.90560
gene2611245.821088310.463691225.16505243.0484326.3045539.81123
gene6242231.188880299.570228298.4115151348.298991343.619882191.93237
gene72529.36461313.3142325.42566492.71970108.55847181.92807
gene612523.41153214.52461729.841152145.70239160.75005185.16852
查看上调和下调基因的数目
>summary(dt<-decideTestsDGE(qlf))
[,1]
-1536
02793
1553
挑选出差异表达基因的名字
>isDE<-as.logical(dt)
>DEnames<-rownames(y)[isDE]
>head(DEnames)
[1]"gene1325""gene1326""gene1327""gene1331""gene1340""gene1343"
差异表达基因画图
>plotSmear(qlf,de.tags=DEnames)
>abline(h=c(-1,1),col="blue")
DESeq2包的安装
∙安装:
##tryhttp:
//ifhttps:
//URLsarenotsupported
>source("https:
//bioconductor.org/biocLite.R")
>biocLite("DESeq2")
数据导入
∙导入count矩阵,导入数据的方式很多这里直接导入count矩阵
∙count结果如下:
library(DESeq2)
sampleNames<-c("CA_1","CA_2","CA_3","CC_1","CC_2","CC_3")
mydata<-read.table("counts.txt",header=TRUE,quote='\t',skip=1)
names(mydata)[7:
12]<-sampleNames
countMatrix<-as.matrix(mydata[7:
12])
rownames(countMatrix)<-mydata$Geneid
table2<-data.frame(name=c("CA_1","CA_2","CA_3","CC_1","CC_2","CC_3"),condition=("CA","CA","CA","CC","CC","CC"))
rownames(table2)<-sampleNames
head(countMatrix)
CA_1CA_2CA_3CC_1CC_2CC_3
gene1314000000
gene1315000000
gene1316000000
gene1317000000
gene1318000000
gene1319000000
∙把count矩阵转化为DESeq2 的数据格式
>dds<-DESeqDataSetFromMatrix(countMatrix,colData=table2,design=~condition)
>dds
class:
DESeqDataSet
dim:
142176
metadata(0):
assays
(1):
counts
rownames(14217):
gene1314gene1315...gene6710gene6709
rowRangesmetadatacolumnnames(0):
colnames(6):
CA_1CA_2...CC_2CC_3
colDatanames
(2):
namecondition
过滤
∙过滤掉那些count结果都为0的数据,这些没有表达的基因对结果的分析没有用
dds<-dds[rowSums(counts(dds))>1,]
dds
class:
DESeqDataSet
dim:
41906
metadata(0):
assays
(1):
counts
rownames(4190):
gene1321gene1322...gene6712gene6710
rowRangesmetadatacolumnnames(0):
colnames(6):
CA_1CA_2...CC_2CC_3
colDatanames
(2):
namecondition
PCA分析
rld<-rlog(dds)
plotPCA(rld,intgroup=c("name","condition"))
∙当然也可以使用ggplot2 来画 PCA图
library(ggplot2)
rld<-rlog(dds)
data<-plotPCA(rld,intgroup=c("condition","name"),returnData=TRUE)
percentVar<-round(100*attr(data,"percentVar"))
p<-ggplot(data,aes(PC1,PC2,color=condition,shape=name))+
geom_point(size=3)+
xlab(paste0("PC1:
",percentVar[1],"%variance"))+
ylab(paste0("PC2:
",percentVar[2],"%variance"))
p
∙注意在进行PCA分析前不要 library(DESeq) 否则无法进行PCA分析
差异表达基因分析
分析结果输出
library(DESeq)
dds<-DESeq(dds)
res<-results(dds)
write.table(res,"result.csv",sep=",",row.names=TRUE)
head(res)
log2foldchange(MAP):
conditionCCvsCA
Waldtestp-value:
conditionCCvsCA
DataFramewith6rowsand6columns
baseMeanlog2FoldChangelfcSEstatpvalue
gene1321173.2886810.262679590.20499831.28137422.000623e-01
gene13222.118367-0.052379520.4989589-0.10497769.163936e-01
gene132335.9737010.500545800.30380961.64756419.944215e-02
gene132488.4216610.176776050.24027270.73573094.618945e-01
gene132543.0018280.811431040.29193962.77944865.445127e-03
gene1326662.136259-1.053561050.1752230-6.01268801.824720e-09
padj
gene13213.790396e-01
gene13229.559679e-01
gene13232.337858e-01
gene13246.565731e-01
gene13252.447141e-02
gene13264.520861e-08
∙注:
(1)rownames:
基因ID
(2)baseMean:
所有样本矫正后的平均reads数(3)log2FoldChange:
取log2后的表达量差异(4)pvalue:
统计学差异显著性检验指标(5)padj:
校正后的pvalue,padj越小,表示基因表达差异越显著
∙summary 查看整体分析结果
summary(res)
outof4190withnonzerototalreadcount
adjustedp-value<0.1
LFC>0(up):
595,14%
LFC<0(down):
644,15%
outliers[1]:
0,0%
lowcounts[2]:
325,7.8%
(meancount<1)
[1]see'cooksCutoff'argumentof?
results
[2]see'independentFiltering'argumentof?
results
MA图
library(geneplotter)
plotMA(res,main="DESeq2",ylim=c(-2,2))
Heatmap图
sum(res$padj<0.1,na.rm=TRUE)
library("pheatmap")
select<-order(rowMeans(counts(dds,normalized=TRUE)),decreasing=TRUE)[1:
1000]
nt<-normTransform(dds)#defaultstolog2(x+1)
log2.norm.counts<-assay(nt)[select,]
df<-as.data.frame(colData(dds)[,c("name","condition")])
pdf('heatmap1000.pdf',width=6,height=7)
pheatmap(log2.norm.counts,cluster_rows=TRUE,show_rownames=FALSE,
cluster_cols=TRUE,annotation_col=df)
dev.off()
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- edgeRDESeq2 分析 RNAseq 差异 表达