聚类的R语言实现外文.docx
- 文档编号:6083596
- 上传时间:2023-01-03
- 格式:DOCX
- 页数:26
- 大小:47.28KB
聚类的R语言实现外文.docx
《聚类的R语言实现外文.docx》由会员分享,可在线阅读,更多相关《聚类的R语言实现外文.docx(26页珍藏版)》请在冰豆网上搜索。
聚类的R语言实现外文
ClusterAnalysis:
TutorialwithR
JariOksanen
January26,2014
Contents
1Introduction1
2HierarchicClustering1
2.1DescriptionofClasses........................42.2NumbersofClasses..........................42.3ClusteringandOrdination......................52.4ReorderingaDendrogram......................62.5MinimumSpanningTree.......................72.6CopheneticDistance.........................8
3InterpretationofClasses8
3.1EnvironmentalInterpretation....................93.2CommunitySummaries.......................10
4OptimizedClusteringataGivenLevel11
4.1OptimumNumberofClasses....................11
5FuzzyClustering12
1Introduction
Inthistutorialweinspectclassification.Classificationandordinationareal-ternativestrategiesofsimplifyingdata.Ordinationtriestosimplifydataintoamapshowingsimilaritiesamongpoints.Classificationsimplifiesdatabyputtingsimilarpointsintosameclass.Thetaskofdescribingahighnumberofpointsissimplifiedtoaneasiertaskofdescribingalownumberofclasses.
2HierarchicClustering
TheclassificationmethodsareavailableinstandardRpackages.Theveganpackagedoesnothavemanysupportfunctionsforclassification,butwestill
loadvegantohaveaccesstoitsdatasetsandsomeofitssupportfunctions.1
1Ifyoudonothaveapackage,butgetanerrormessage,youmustinstallpackageusinginstall.packages("vegan")ortheinstallationmenu.
1
AA
A
AA
B
B
B
BBB
B
A
A
A
A
+
A
B
+
B
B
B
B
B
BB
Figure1:
DistancebetweentwoclustersAandBdefinedbysingle,completeandaveragelinkage.Markeachofthelinkagetypesintheconnectingline.Thefusionlevelintheclusterdendrogramwouldbethelengthofthecorrespondingconnectinglineofthelinkagetype.
R>library(vegan)
R>data(dune)
Hierarchicclustering(functionhclust)isinstandardRandavailablewith-outloadinganyspecificlibraries.Hierarchicclusteringneedsdissimilaritiesasitsinput.StandardRhasfunctiondisttocalculatemanydissimilarityfunctions,butforcommunitydatawemaypreferveganfunctionvegdistwithecologically
usefuldissimilarityindices.ThedefaultindexinvegdistisBray-Curtis:
R>d<-vegdist(dune)
Ecologicallyusefulindicesinveganhaveanupperlimitof1forabsolutelydiferentsites(nosharedspecies),andtheyarebasedondiferencesofabun-dances.Incontrast,thestandardEuclideandistancehasnoupperlimit,butvarieswiththesumoftotalabundancesofcomparedsiteswhentherearenosharedspecies,andusessquaresofdiferencesofabundances.Therearemanyotherecologicallyusefulindicesinvegdist,butBray-Curtisisusuallynotabadchoice.
Thereareseveralalternativeclusteringmethodsinthestandardfunctionhclust.Weshallinspectthreebasicmethods:
singlelinkage,completelinkageandaveragelinkage.Allthesestartinthesameway:
theyfusetwomostsim-ilarpointstoacluster.Theydiferinthewaytheycombineclusterstoeachother,ornewpointstoexistingclusters(Fig.1).Insinglelinkage(a.k.a.near-estneighbour,orneighbourjoiningtreeingenetics)thedistancebetweentwoclustersistheshortestpossibledistanceamongmembersoftheclusters,orthebestofthefriends.Incompletelinkage(a.k.a.furthestneighbour)thedistancebetweentwoclustersisthelongestpossibledistancebetweenthegroups,ortheworstamongthefriends.Inaveragelinkage,thedistancebetweentheclusters
2
isthedistancebetweenclustercentroids.Thereareseveralalternativewaysof
definingtheaverageanddefiningthecloseness,andhenceahugenumberofaveragelinkagemethods.Weonlyuseoneofthesemethodscommonlyknown
asupgma.Thelectureslidesdiscussthemethodsinmoredetail.
Inthefollowingwewillcomparethreediferentclusteringstrategies.Ifyou
wanttoplotthreegraphssidebyside,youcandividethescreenintothree
panelsby
R>par(mfrow=c(1,3))
Thisdefinesthreepanelssidebyside.Youprobablywanttostretchtheplotting
windowifyouareusingthisoption.Alternatively,youcanhavethreepanels
aboveeachotherwith
R>par(mfrow=c(3,1))
Youcangetbacktothesinglepanelmodewith
R>par(mfrow=c(1,1))
Youmayalsowishtousenarroweremptymarginsforthepanels:
R>par(mar=c(3,4,1,1)+.1)
Themarcommanddefinesplotmarginsinorderbottom,left,up,rightusing
rowheight(textheight)asaunit.
Thesinglelinkageclusteringcanbefoundwith:
R>csin<-hclust(d,method="single")
R>csin
Thedendrogramcanbeplottedwith:
R>plot(csin)
Thedefaultistoplotaninvertedtreewiththerootatthetop,andbranches
hangingdown.Youcanforcethebranchesdowntothebaselinegivingthehang
argument:
R>plot(csin,hang=-1)
Ifyouplottedthecsintreetwiceyouconsumedtwopanelsoutofthreeyou
have,andtherewillnotbespaceforthenexttwotreesinthesameplot.Inthatcaseyoucanstartanewplotbyissuingagainthemfrowcommandandthendrawingcsinagain.
Thecompletelinkageandaveragelinkagemethodsarefoundinthesame
way:
R>
R>R>R>
ccom<-hclust(d,method="complete")
plot(ccom,hang=-1)
caver<-hclust(d,method="aver")
plot(caver,hang=-1)
Theverticalaxesoftheclusterdendrogramshowthefusionlevel.Thetwo
mostsimilarobservationsarecombinedfirst,andtheyareatthesamelevelinalldendrograms.Attheupperfusionlevels,thescalesdiverge:
theyaretheshortestdissimilaritiesamongclustermembersinsinglelinkage,thelongestpossibledissimilaritiesincompletelinkage,andthedistancesamongclustercentroidsinaveragelinkage(Fig.1).
3
Figure2:
VegemiteisanAustraliannationaldelicacymadeofyeastextract.The
vegemitefunctionwasnamedbecauseitsoutputisjustasdenseasVegemite.
2.1DescriptionofClasses
Oneproblemwithhierarchicclusteringisthatitgivesaclassificationofob-
servations(plots,samplingunits),butitdoesnottellhowtheseclassesdiferfromeachother.Forcommunitydata,thereisnoinformationhowthespeciescompositiondifersbetweenclasses(wereturntothissubjectinChapter3.2).
Theveganpackagehasfunctionvegemite(Fig.2)thatcanproducecom-pactcommunitytablesorderedbyadendrogram,ordinationorenvironmentalvariables.Withthehelpofthesetablesitispossibletoseewhichspeciesdifer
inclassification:
R>vegemite(dune,caver)
Thevegemitecommandwillalwaysuseone-charactercolumns.Iftheob-servedvaluesdonotfitonecharacter,thevegemiterefusestowork.Withar-gumentscaleyoucanrecodethevaluestoone-characterwidth.Thevegemitehasagraphicalsisterfunctiontabascothatisdescribedinsection2.4.
2.2NumbersofClasses
Thehierarchicclusteringmethodsproduceallpossiblelevelsofclassifications.
Theextremesareallobservationsinasingleclass,andeachobservationinitsprivateclass.Theusernormallywantstohaveaclusteringintoacertainnumberofclasses.Thefixedclassificationcanbevisuallydemonstratedwith
rect.hclustfunction:
R>plot(csin,hang=-1)
R>rect.hclust(csin,3)R>plot(ccom,hang=-1)
R>rect.hclust(ccom,3)
4
R>plot(caver,hang=-1)
R>rect.hclust(caver,3)
Singlelinkagehasatendencytochainobservations:
mostcommoncaseis
tofuseasingleobservationtoanexistingclass:
thesinglelinkisthenearestneighbour,andacloseneighbourismoreprobablyinalargegroupthaninasmallgrouporalonelypoint.Completelinkagehasatendencytoproducecompactbunches:
completelinkminimizesthespreadwithinthecluster.Theaveragelinkageisbetweenthesetwoextremes.
Wecanextractclassificationinacertainlevelusingfunctioncutree:
R>cl<-cutree(ccom,3)
R>cl
Thisgivesanumericclassificationvectorofclusteridentities.Theclustersare
numberedintheordertheobservationsappearinthedata:
thefirstitemwillalwaysbelongtocluster1,andthenumberingdoesnotmatchthedendrogram.
Wecantabulatethenumbersofobservationsineachcluster:
R>table(cl)
Wecancomparetwoclusteringschemesbycross-tabulationwhichgivesasa
confusionmatrix:
R>table(cl,cutree(csin,3))
R>table(cl,cutree(caver,3))
Theconfusionmatrixtabulatestheclassificationsagainsteachother.Therows
givethefirstclassification,andthecolumnsthesecondclassification.Iftheclassificationsmatchandthereisno"confusion",eachrowandeachcolumnhasonlyonenon-zeroentry,butiftheclassesaredividedbetweenseveralclassesinthesecondclassification,therowhasseveralnon-zeroentries.
2.3ClusteringandOrdination
Wecanuseordinationtodisplaytheobserveddissimilaritiesamongpoints.
Anaturalchoiceistousemetricscalinga.k.a.principalcoordinatesanalysis(PCoA)thatmapsobserveddissimilaritieslinearlyontolow-dimensionalgraphusingthesamedissimilaritieswehadinourclustering.
ThemetricscalingcanbeperformedwithstandardRfunctioncmdscale:
R>ord<-cmdscale(d)
Wecandisplaytheresultsusingveganfunctionordiplotthatcanplot
resultsofanyveganordinationfunctionandmanynon-veganordinationfunc-tions,suchascmdscale,prcompandprincomp(thelatterforprincipalcompo-
nentsanalysis):
R>ordiplot(ord)
Wegotawarningbecauseordiplottriestoplotboth
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 语言 实现 外文