书签分享收藏举报版权申诉 / 6

立即下载加入VIP,免费下载

当前位置：首页 > 表格模板 > 合同协议 > Weka31 crossValidation源代码分析汇总.docx

Weka31 crossValidation源代码分析汇总.docx

文档编号：4355402
上传时间：2022-11-30
格式：DOCX
页数：6
大小：357.99KB

《Weka31 crossValidation源代码分析汇总.docx》由会员分享，可在线阅读，更多相关《Weka31 crossValidation源代码分析汇总.docx（6页珍藏版）》请在冰豆网上搜索。

Weka31 crossValidation源代码分析汇总.docx

Weka31crossValidation源代码分析汇总

Weka[31]crossValidation源代码分析

作者:

Koala++/屈伟

Weka学习论坛里的人也帮我下载过一些论文,而群主也希望我建设群论坛,而我一开始看到的就是crossvalidation,感觉介绍的有点简单了,这里我只是再深一点而已。

我还是Ng的介绍为主。

在看到学习算法的时候,我们都是以最小化经验误差为目标,比如有方法:

梯度下降,牛顿法,lagrange乘子法,坐标上升法,都是我blog里提到过的方法了。

如果我们用下面的步骤来得到模型:

1.在数据集上训练多个模型。

2.选择训练误差最小的模型。

下面是说明这个问题的一幅图（它的原意倒和这没什么关系,直接去看Patternrecognitionandmachinelearning第一章。

这幅图大概意思就是模型复杂到一定程序（可以表示学习到概念之后,再使用更复杂模型,那么它的训练误差会下降,而测试误差会提高。

这幅图其实还有一个比较深的意义,那就是你可以通过它来选择合适的模型,不要一看测试误差高,就加样本。

然后有一个代替它的方法是hold-outcrossvalidation或是被称为simplecrossvalidation。

1.把数据集分为两部分,一部分为训练集（比如70%的数据,和测试数据集（比如30%。

测试数据集也被称为hold-outcrossvalidationset。

2.在训练集上训练多个模型。

3.每一个模型在测试数据集上得到一个分类误差值,选择分类误差最小的模型。

通常测试集的大小在数据集的1/4到1/3之间。

一般30%是一个t典型的选择。

这样做的原因是:

如果只是以最小化经验误差为目标,那么最终选择的就是过拟合的模型。

但用这种方法也有一个缺点就是它浪费了30%的数据,就算我们最后选择出我们认为合理的模型,再用全部的数据进行训练,只不能保证这个模型是最好的。

如果训练样本充分,那倒没什么,如果训练样本不足,而模型与模型训练所需的样本也是不一样的（不太清楚如何表述,可以看一下learningtheory,就是求的最小需要样本。

再说明白一点,在样本不充

足的情况下,或不知道是不是充足的情况下,模型A在70%的数据上比模型B好,不能说在100%的数据集上,模型A一定比模型B好。

接下来的方法是k-foldcrossvalidation,这个算法就不写伪代码了,马上看真代码,一个典型的选择是k=10,它与上面的方法相比,它只留下了1/k的数据,但是我们也需要训练k次,比以前多了k次（其实也不完全是这样,就算是划分训练测试集,也不可能只测一次。

还有就是Patternrecognitionandmachinelearning提到的:

Afurtherproblemwithtechniquessuchascross-validationthatuseseparatedatatoassessperformanceisthatwemighthavemultiplecomplexityparametersforasinglemodel（forinstance,theremightbeseveralregularizationparameters.Exploringcombinationsofsettingforsuchparameterscould,intheworstcase,requireanumberoftrainingrunsthatisexponentialinthenumberofparameters.在数据实在太少的情况下,就要用一种特殊的crossvalidation方法了,是leave-one-outcrossvalidation,就是每次训练只留下一个样本不训练,用它来测试。

在听Ng的课程时,他的学生提了一个很有意思的问题,在LearningTheory（这个还是比较重要的,被Ng视为是否入门的标准之一中,不是可以从样本数求出来一个模型学习所需要的样本数（在一定条件下,为什么还要用modelselection这一类方法呢?

下面翻译一下（大意Ng的回答:

Itturnsoutthatwhenyou’reprovinglearningtheorybounds,veryoftentheboundswillbeextremelyloosebecauseyou’resortofprovingtheworsecaseupperboundthatholdstrueevenforverybad–whatisit–sotheboundsthatIprovedjustnow;right?

Thatholdstrueforabsolutelyanyprobabilitydistributionovertrainingexamples;right?

Sojustassumethetrainingexampleswe’vedrawn,iidfromsomedistributionscriptd,andtheboundsIprovedholdtrueforabsolutelyanyprobabilitydistributionoverscriptd.Andchancesarewhateverreallifedistributionyougetover,youknow,housesandtheirpricesorwhatever,isprobablynotasbadastheveryworseoneyoucould’vegotten;okay?

Andsoitturnsoutthatifyouactuallyplugintheconstantsoflearningtheorybounds,youoftengetextremelylargenumbers.

你在证明学习理论的边界时,通常边界都是异常loose,因为你在证的都是比较糟糕的上界,也就是在很坏的时候都成立的边界,这些就是训练样本无论服从何从概率分布都成立。

现实中的数据,也许不会和最坏的情况一样。

当你将常量加入学习理论的边界时（像算法时间空间分析时,忽略所有常量,你通常都会得到一个非常大的值。

Takelogisticregression–logisticregressionyouhavetenparametersand0.01error,andwith95percentprobability.HowmanytrainingexamplesdoIneed?

Ifyouactuallypluginactualconstantsintothetextforlearningtheorybounds,youoftengetextremelypessimisticestimateswiththenumberofexamplesyouneed.Youendupwithsomeridiculouslylargenumbers.Youwouldneed10,000trainingexamplestofittenparameters.Soagoodwaytothinkoftheselearningtheoryboundsis–andthisiswhy,also,whenIwritepapersonlearningtheorybounds,Iquiteoftenusebig-Onotationtojustabsolutelyjustignoretheconstantfactorsbecausethe

boundsseemtobeveryloose.

以logisticregression为例-你有10个参数和在95%的概率下错误率小于0.01。

我需要多少样本,如果你将常量代入边界,你会得到一个非常悲观的估计,你会得到一个大的可笑的值,比如10,000个训练样本来拟合10个参数。

所以一个好的方式来理解这些学习边界是,忽略常量,因为边界非常loose。

Therearesomeattemptstousetheseboundstogiveguidelinesastowhatmodeltochoose,andsoon.ButIpersonallytendtousethebounds–again,intuitionabout–forexample,whatarethenumberoftrainingexamplesyouneedgrowslinearlyinthenumberofparametersorwhatareyourgrowsxdimensioninnumberofparameters;whetheritgoesquadratic–parameters?

Soit’squiteoftentheshapeofthebounds.Thefactthatthenumberoftrainingexamples–thefactthatsomecomplexityislinearintheVCdimension,that’ssortofausefulintuitionyoucangetfromthesetheories.Buttheactualmagnitudeoftheboundwilltendtobemuchlooserthanwillholdtrueforaparticularproblemyouareworkingon.

有一些关于用这些边界来指导选择哪一种模型的方法,但我个人倾向于用这些边界——再强调一下,直观的——比如,你所需要的样本数是与参数的个数是线性关系,还是二次关系。

所以通常都是这些关系给你了一个直观的认识。

而它得到出的边界比真实的一些你正在处理的特殊问题要loose的多。

代码在classifiers.Evaluation类中:

/**

*Performsa（stratifiedifclassisnominalcross-validation

*foraclassifieronasetofinstances.Nowperforms

*adeepcopyoftheclassifierbeforeeachcallto

*buildClassifier（（justincasetheclassifierisnot

*initializedproperly.

**/

publicvoidcrossValidateModel（Classifierclassifier,Instancesdata,intnumFolds,RandomrandomthrowsException{

//Makeacopyofthedatawecanreorder

data=newInstances（data;

data.randomize（random;

if（data.classAttribute（.isNominal（{

data.stratify（numFolds;

}

//Dothefolds

for（inti=0;i

Instancestrain=data.trainCV（numFolds,i,random;

setPriors（train;

ClassifiercopiedClassifier=Classifier.makeCopy（classifier;copiedClassifier.buildClassifier（train;

Instancestest=data.testCV（numFolds,i;

evaluateModel（copiedClassifier,test;

}

m_NumFolds=numFolds;

}

Randomize很简单:

publicvoidrandomize（Randomrandom{

for（intj=numInstances（-1;j>0;j--

swap（j,random.nextInt（j+1;

}

Randomize注意它是从后向前打乱的,这样写的确简单点。

publicvoidstratify（intnumFolds{

if（numFolds<=0{

thrownewIllegalArgumentException（

"Numberoffoldsmustbegreaterthan1";

}

if（m_ClassIndex<0{

thrownewUnassignedClassException（

"Classindexisnegative（notset!

";

}

if（classAttribute（.isNominal（{

//sortbyclass

intindex=1;

while（index

Instanceinstance1=instance（index-1;

for（intj=index;j

Instanceinstance2=instance（j;

if（（instance1.classValue（==instance2.classValue（||（instance1.classIsMissing（&&instance2.classIsMissing（{

swap（index,j;

index++;

}

index++;

}

stratStep（numFolds;

}

如果类别是离散值,两层循环的作用是把类别相同的样本连到一起,stratify的本意是分层,也就是在循环完之后样本的类别是class0,class0,…,class0,class1,…class1,…,classN,…,classN。

stratStep的代码如下:

protectedvoidstratStep（intnumFolds{

FastVectornewVec=newFastVector（m_Instances.capacity（;

intstart=0,j;

//createstratifiedbatch

while（newVec.size（

j=start;

while（j

newVec.addElement（instance（j;

j=j+numFolds;

}

start++;

}

m_Instances=newVec;

}

第numFolds选择一个样本,这样的目的是可以让类别尽可能分开一点,不要一个训练集或是测试集（当然更可能是它类别很不平均。

在crossValidationModel的代码中,用trainCV函数得到训练集,学习一个分类器,用testCV得到测试集,分类器在test测试集上测试,evaluateModel:

publicdouble[]evaluateModel（Classifierclassifier,InstancesdatathrowsException{

doublepredictions[]=newdouble[data.numInstances（];

for（inti=0;i

predictions[i]=evaluateModelOnce（（Classifierclassifier,data.instance（i;

}

returnpredictions;

}

在这里基本上就可以认为看完了,而evaluateModelOnce调用的函数updateStatsForClassifier可以在需要的时候再看吧,与crossValidation已经有点距离了。

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

下载	加入VIP,免费下载

版权申诉 word格式文档无特别注明外均可编辑修改；预览文档经过压缩，下载后原文更清晰！ 立即下载

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: Weka31 crossValidation源代码分析汇总 crossValidation 源代码分析汇总

冰豆网所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

关于本文

本文标题：Weka31 crossValidation源代码分析汇总.docx
链接地址：https://www.bdocx.com/doc/4355402.html

Weka31 crossValidation源代码分析汇总.docx

热门标签