Hierarchical Clustering.docx
- 文档编号:30248031
- 上传时间:2023-08-13
- 格式:DOCX
- 页数:16
- 大小:77.99KB
Hierarchical Clustering.docx
《Hierarchical Clustering.docx》由会员分享,可在线阅读,更多相关《Hierarchical Clustering.docx(16页珍藏版)》请在冰豆网上搜索。
HierarchicalClustering
HierarchicalClustering
Onthispage…
IntroductiontoHierarchicalClustering
AlgorithmDescription
SimilarityMeasures
Linkages
Dendrograms
VerifytheClusterTree
CreateClusters
IntroductiontoHierarchicalClustering
Hierarchicalclusteringgroupsdataoveravarietyofscalesbycreatingaclustertreeor dendrogram.Thetreeisnotasinglesetofclusters,butratheramultilevelhierarchy,whereclustersatonelevelarejoinedasclustersatthenextlevel.Thisallowsyoutodecidethelevelorscaleofclusteringthatismostappropriateforyourapplication.TheStatisticsToolbox™function clusterdata supportsagglomerativeclusteringandperformsallofthenecessarystepsforyou.Itincorporatesthe pdist, linkage,and cluster functions,whichyoucanuseseparatelyformoredetailedanalysis.The dendrogram functionplotstheclustertree.
AlgorithmDescription
ToperformagglomerativehierarchicalclusteranalysisonadatasetusingStatisticsToolboxfunctions,followthisprocedure:
Findthesimilarityordissimilaritybetweeneverypairofobjectsinthedataset. Inthisstep,youcalculatethe distance betweenobjectsusingthe pdist function.The pdist functionsupportsmanydifferentwaystocomputethismeasurement.See SimilarityMeasures formoreinformation.
Grouptheobjectsintoabinary,hierarchicalclustertree. Inthisstep,youlinkpairsofobjectsthatareincloseproximityusingthe linkage function.The linkage functionusesthedistanceinformationgeneratedinstep 1todeterminetheproximityofobjectstoeachother.Asobjectsarepairedintobinaryclusters,thenewlyformedclustersaregroupedintolargerclustersuntilahierarchicaltreeisformed.See Linkages formoreinformation.
Determinewheretocutthehierarchicaltreeintoclusters. Inthisstep,youusethe cluster functiontoprunebranchesoffthebottomofthehierarchicaltree,andassignalltheobjectsbeloweachcuttoasinglecluster.Thiscreatesapartitionofthedata.The cluster functioncancreatetheseclustersbydetectingnaturalgroupingsinthehierarchicaltreeorbycuttingoffthehierarchicaltreeatanarbitrarypoint.
Thefollowingsectionsprovidemoreinformationabouteachofthesesteps.
Note TheStatisticsToolboxfunction clusterdata performsallofthenecessarystepsforyou.Youdonotneedtoexecutethe pdist, linkage,or cluster functionsseparately.
SimilarityMeasures
Youusethe pdist functiontocalculatethedistancebetweeneverypairofobjectsinadataset.Foradatasetmadeupof m objects,thereare m*(m –1)/2pairsinthedataset.Theresultofthiscomputationiscommonlyknownasadistanceordissimilaritymatrix.
Therearemanywaystocalculatethisdistanceinformation.Bydefault,the pdist functioncalculatestheEuclideandistancebetweenobjects;however,youcanspecifyoneofseveralotheroptions.See pdist formoreinformation.
Note Youcanoptionallynormalizethevaluesinthedatasetbeforecalculatingthedistanceinformation.Inarealworlddataset,variablescanbemeasuredagainstdifferentscales.Forexample,onevariablecanmeasureIntelligenceQuotient(IQ)testscoresandanothervariablecanmeasureheadcircumference.Thesediscrepanciescandistorttheproximitycalculations.Usingthe zscore function,youcanconvertallthevaluesinthedatasettousethesameproportionalscale.See zscore formoreinformation.
Forexample,consideradataset, X,madeupoffiveobjectswhereeachobjectisasetof x,y coordinates.
Object1:
1,2
Object2:
2.5,4.5
Object3:
2,2
Object4:
4,1.5
Object5:
4,2.5
Youcandefinethisdatasetasamatrix
rng('default')%Forreproducibility
X=[12;2.54.5;22;41.5;...
42.5];
andpassitto pdist.The pdist functioncalculatesthedistancebetweenobject 1andobject 2,object 1andobject 3,andsoonuntilthedistancesbetweenallthepairshavebeencalculated.Thefollowingfigureplotstheseobjectsinagraph.TheEuclideandistancebetweenobject2andobject3isshowntoillustrateoneinterpretationofdistance.
DistanceInformation
The pdist functionreturnsthisdistanceinformationinavector, Y,whereeachelementcontainsthedistancebetweenapairofobjects.
Y=pdist(X)
Y=
Columns1through7
2.91551.00003.04143.04142.54953.35412.5000
Columns8through10
2.06162.06161.0000
Tomakeiteasiertoseetherelationshipbetweenthedistanceinformationgeneratedby pdist andtheobjectsintheoriginaldataset,youcanreformatthedistancevectorintoamatrixusingthe squareformfunction.Inthismatrix,element i,j correspondstothedistancebetweenobject i andobject j intheoriginaldataset.Inthefollowingexample,element1,1representsthedistancebetweenobject 1anditself(whichiszero).Element1,2representsthedistancebetweenobject 1andobject 2,andsoon.
squareform(Y)
ans=
02.91551.00003.04143.0414
2.915502.54953.35412.5000
1.00002.549502.06162.0616
3.04143.35412.061601.0000
3.04142.50002.06161.00000
Linkages
Oncetheproximitybetweenobjectsinthedatasethasbeencomputed,youcandeterminehowobjectsinthedatasetshouldbegroupedintoclusters,usingthe linkage function.
The linkage functiontakesthedistanceinformationgeneratedby pdist andlinkspairsofobjectsthatareclosetogetherintobinaryclusters(clustersmadeupoftwoobjects).The linkage functionthenlinksthesenewlyformedclusterstoeachotherandtootherobjectstocreatebiggerclustersuntilalltheobjectsintheoriginaldatasetarelinkedtogetherinahierarchicaltree.
Forexample,giventhedistancevector Y generatedby pdist fromthesampledatasetof x-and y-coordinates,the linkage functiongeneratesahierarchicalclustertree,returningthelinkageinformationinamatrix, Z.
Z=linkage(Y)
Z=
4.00005.00001.0000
1.00003.00001.0000
6.00007.00002.0616
2.00008.00002.5000
Inthisoutput,eachrowidentifiesalinkbetweenobjectsorclusters.Thefirsttwocolumnsidentifytheobjectsthathavebeenlinked.Thethirdcolumncontainsthedistancebetweentheseobjects.Forthesampledatasetof x-and y-coordinates,the linkage functionbeginsbygroupingobjects 4and 5,whichhavetheclosestproximity(distancevalue=1.0000).
The linkage functioncontinuesbygroupingobjects 1and 3,whichalsohaveadistancevalueof1.0000.
Thethirdrowindicatesthatthe linkage functiongroupedobjects 6and 7.Iftheoriginalsampledatasetcontainedonlyfiveobjects,whatareobjects 6and 7?
Object 6isthenewlyformedbinaryclustercreatedbythegroupingofobjects 4and 5.Whenthe linkage functiongroupstwoobjectsintoanewcluster,itmustassigntheclusterauniqueindexvalue,startingwiththevalue m+1,where m isthenumberofobjectsintheoriginaldataset.(Values 1through m arealreadyusedbytheoriginaldataset.)Similarly,object 7istheclusterformedbygroupingobjects 1and 3.
linkage usesdistancestodeterminetheorderinwhichitclustersobjects.Thedistancevector Y containsthedistancesbetweentheoriginalobjects1through5.Butlinkagemustalsobeabletodeterminedistancesinvolvingclustersthatitcreates,suchasobjects6and7.Bydefault, linkage usesamethodknownassinglelinkage.However,thereareanumberofdifferentmethodsavailable.Seethe linkage referencepageformoreinformation.
Asthefinalcluster,the linkage functiongroupedobject 8,thenewlyformedclustermadeupofobjects 6and 7,withobject 2fromtheoriginaldataset.Thefollowingfiguregraphicallyillustratestheway linkagegroupstheobjectsintoahierarchyofclusters.
Dendrograms
Thehierarchical,binaryclustertreecreatedbythe linkage functionismosteasilyunderstoodwhenviewedgraphically.TheStatisticsToolboxfunction dendrogram plotsthetreeasfollows.
dendrogram(Z)
Inthefigure,thenumbersalongthehorizontalaxisrepresenttheindicesoftheobjectsintheoriginaldataset.Thelinksbetweenobjectsarerepresentedasupside-downU-shapedlines.TheheightoftheUindicatesthedistancebetweentheobjects.Forexample,thelinkrepresentingtheclustercontainingobjects1and3hasaheightof1.Thelinkrepresentingtheclusterthatgroupsobject2togetherwithobjects1,3,4,and5,(whicharealreadyclusteredasobject8)hasaheightof2.5.Theheightrepresentsthedistancelinkage computesbetweenobjects2and8.Formoreinformationaboutcreatingadendrogramdiagram,seethe dendrogram referencepage.
VerifytheClusterTree
Afterlinkingtheobjectsinadatasetintoahierarchicalclustertree,youmightwanttoverifythatthedistances(thatis,heights)inthetreereflecttheoriginaldistancesaccurately.Inaddition,youmightwanttoinvestigatenaturaldivisionsthatexistamonglinksbetweenobjects.StatisticsToolboxfunctionsareavailableforbothofthesetasks,asdescribedinthefollowingsections.
VerifyDissimilarity
Inahierarchicalclustertree,anytwoobjectsintheoriginaldatasetareeventuallylinkedtogetheratsomelevel.Theheightofthelinkrepresentsthedistancebetweenthetwoclustersthatcontainthosetwoobjects.Thisheightisknownasthe copheneticdistance betweenthetwoobjects.Onewaytomeasurehowwelltheclustertreegeneratedbythe linkage functionreflectsyourdataistocomparethecopheneticdistanceswiththeoriginaldistancedatageneratedbythe pdist function.Iftheclusteringisvalid,thelinkingofobjectsintheclustertreeshouldhaveastrongcorrelationwiththedistancesbetweenobjectsinthedistancevector.The cophenet functioncomparesthesetwosetsofvaluesandcomputestheircorrelation,returni
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- Hierarchical Clustering