书签分享收藏举报版权申诉 / 14

立即下载加入VIP,免费下载

当前位置：首页 > 高等教育 > 教育学 > R软件实现随机森林算法带详细代码操作Word下载.docx

R软件实现随机森林算法带详细代码操作Word下载.docx

文档编号：17643512
上传时间：2022-12-07
格式：DOCX
页数：14
大小：54.20KB

《R软件实现随机森林算法带详细代码操作Word下载.docx》由会员分享，可在线阅读，更多相关《R软件实现随机森林算法带详细代码操作Word下载.docx（14页珍藏版）》请在冰豆网上搜索。

R软件实现随机森林算法带详细代码操作Word下载.docx

"

whole.weight"

shucked.weight"

viscera.weight"

shell.weight"

age"

#对预测变量进行划分

-abalone_data%>

%

mutate（old=age>

10）%>

#removethe"

variable

select（-age）

#把数据分割成训练集合测试集

set.seed（23489）

train_index<

-sample（1:

nrow（abalone_data）,0.9*nrow（abalone_data））

abalone_train<

-abalone_data[train_index,]

abalone_test<

-abalone_data[-train_index,]

#removetheoriginaldataset

rm（abalone_data）

#viewthefirst6rowsofthetrainingdata

head（abalone_train）

可以看到，输出结果如下：

下一步，拟合随机森林模型

rf_fit<

-train（as.factor（old）~.,

data=abalone_train,

method="

ranger"

默认情况下，train不带任何参数函数重新运行模型超过25个bootstrap样本和在调谐参数的3个选项（用于调谐参数ranger是mtry;

随机选择的预测器在树中的每个切口的数目）。

rf_fit

##RandomForest

##

##3759samples

##8predictor

##2classes:

'

FALSE'

'

TRUE'

##Nopre-processing

##Resampling:

Bootstrapped（25reps）

##Summaryofsamplesizes:

3759,3759,3759,3759,3759,3759,...

##Resamplingresultsacrosstuningparameters:

##mtrysplitruleAccuracyKappa

##2gini0.78288870.5112202

##2extratrees0.78073730.4983028

##5gini0.77501200.4958132

##5extratrees0.78062440.5077483

##9gini0.76811040.4819231

##9extratrees0.77842640.5036977

##Tuningparameter'

min.node.size'

washeldconstantatavalueof1

##Accuracywasusedtoselecttheoptimalmodelusingthelargestvalue.

##Thefinalvaluesusedforthemodelweremtry=2,splitrule=gini

##andmin.node.size=1.

使用内置predict函数，在独立的测试集上测试数据同样简单。

#predicttheoutcomeonatestset

abalone_rf_pred<

-predict（rf_fit,abalone_test）

#comparepredictedoutcomeandtrueoutcome

confusionMatrix（abalone_rf_pred,as.factor（abalone_test$old））

##ConfusionMatrixandStatistics

##Reference

##PredictionFALSETRUE

##FALSE23152

##TRUE4293

##Accuracy:

0.7751

##95%CI:

（0.732,0.8143）

##NoInformationRate:

0.6531

##P-Value[Acc>

NIR]:

3.96e-08

##Kappa:

0.4955

##Mcnemar'

sTestP-Value:

0.3533

##Sensitivity:

0.8462

##Specificity:

0.6414

##PosPredValue:

0.8163

##NegPredValue:

0.6889

##Prevalence:

##DetectionRate:

0.5526

##DetectionPrevalence:

0.6770

##BalancedAccuracy:

0.7438

##'

Positive'

Class:

FALSE

现在我们已经看到了如何拟合模型以及默认的重采样实现（引导）和参数选择。

尽管这很棒，但使用插入符号可以做更多的事情。

预处理（preProcess）

插入符号很容易实现许多预处理步骤。

脱字号的几个独立功能针对设置模型时可能出现的特定问题。

这些包括

dummyVars：

根据具有多个类别的分类变量创建伪变量

nearZeroVar：

识别零方差和接近零方差的预测变量（在进行二次采样时可能会引起问题）

findCorrelation：

确定相关的预测变量

findLinearCombos：

确定预测变量之间的线性相关性

除了这些单独的功能外，还存在preProcess可用于执行更常见任务（例如居中和缩放，插补和变换）的功能。

preProcess接收要处理的数据帧和方法，可以是“BoxCox”，“YeoJohnson”，“expoTrans”，“center”，“scale”，“range”，“knnImpute”，“bagImpute”，“medianImpute”中的任何一种”，“pca”，“ica”，“spatialSign”，“corr”，“zv”，“nzv”和“conditionalX”。

#center,scaleandperformaYeoJohnsontransformation

#identifyandremovevariableswithnearzerovariance

#performpca

abalone_no_nzv_pca<

-preProcess（select（abalone_train,-old）,

method=c（"

center"

scale"

nzv"

pca"

））

abalone_no_nzv_pca

##Createdfrom3759samplesand8variables

##Pre-processing:

##-centered（7）

##-ignored

（1）

##-principalcomponentsignalextraction（7）

##-scaled（7）

##PCAneeded3componentstocapture95percentofthevariance

#identifywhichvariableswereignored,centered,scaled,etc

abalone_no_nzv_pca$method

##$center

##[1]"

##[5]"

##$scale

##$pca

##$ignore

#identifytheprincipalcomponents

abalone_no_nzv_pca$rotation

##PC1PC2PC3

##length-0.3833860-0.024833640.5915467

##diameter-0.3837457-0.051612550.5853768

##height-0.3464346-0.87775177-0.2975826

##whole.weight-0.39093850.22610064-0.2335635

##shucked.weight-0.37853090.33107101-0.2537499

##viscera.weight-0.38189680.24715579-0.2842531

##shell.weight-0.37917510.06675157-0.1382400

资料分割（createDataPartition和groupKFold）

通过该createDataPartition函数可以轻松生成数据的子集。

尽管可以使用此功能简单地生成训练和测试集，但也可以使用它来对数据进行子集处理，同时尊重数据中存在的重要分组。

首先，我们展示了执行常规样本拆分以生成10个不同的80％子样本的示例。

#identifytheindicesof1080%subsamplesoftheirisdata

-createDataPartition（iris$Species,

p=0.8,

list=FALSE,

times=10）

#lookatthefirst6indicesofeachsubsample

head（train_index）

##Resample01Resample02Resample03Resample04Resample05Resample06

##[1,]121121

##[2,]232242

##[3,]343355

##[4,]454566

##[5,]565777

##[6,]776988

##Resample07Resample08Resample09Resample10

##[1,]1111

##[2,]2222

##[3,]3333

##[4,]4444

##[5,]6555

##[6,]7666

尽管以上内容非常有用，但是使用for循环也非常容易。

没那么令人兴奋。

更令人兴奋的是能够进行K折交叉验证，该验证尊重数据中的分组。

该groupKFold功能就是这样！

例如，让我们考虑以下组成的鲍鱼组，以使在数据集中一起出现的5个鲍鱼的每个顺序集都在同一组中。

为简单起见，我们仅考虑前50个鲍鱼。

#addamadeupgroupingvariablethatgroupeseachsubsequent5abalonetogether

#filtertothefirst50abaloneforsimplicity

abalone_grouped<

-cbind（abalone_train[1:

50,],group=rep（1:

10,each=5））

head（abalone_grouped,10）

##sexlengthdiameterheightwhole.weightshucked.weightviscera.weight

##3670I0.5850.4600.1400.76350.32600.1530

##249I0.3050.2450.0750.15600.06750.0380

##498F0.6050.4850.1651.01050.43500.2090

##1889F0.5650.4450.1250.83050.31350.1785

##3488I0.5100.4050.1300.59900.30650.1155

##1852I0.4850.3700.1150.45700.18850.0965

##2880I0.4700.3850.1300.58700.26400.1170

##3203F0.6200.4850.2201.51100.50950.2840

##365F0.6200.5000.1751.18600.49850.3015

##2230M0.3700.2800.0950.22250.08050.0510

##shell.weightoldgroup

##36700.2650FALSE1

##2490.0450FALSE1

##4980.3000TRUE1

##18890.2300TRUE1

##34880.1485FALSE1

##18520.1500FALSE2

##28800.1740FALSE2

##32030.5100TRUE2

##3650.3500TRUE2

##22300.0750FALSE2

以下代码在尊重鲍鱼数据中的组的同时执行10倍交叉验证。

也就是说，每组鲍鱼必须始终一起出现在同一组中。

#performgroupedKmeans

group_folds<

-groupKFold（abalone_grouped$group,k=10）

group_folds

##$Fold01

##[1]6789101112131415161718192021222324252627282930

##[26]3132333435363738394041424344454647484950

##$Fold02

##[1]123451112131415161718192021222324252627282930

##$Fold03

##[1]12345678910161718192021222324252627282930

##$Fold04

##[1]12345678910111213141521222324252627282930

##$Fold05

##[1]12345678910111213141516171819202627282930

##$Fold06

##[1]12345678910111213141516171819202122232425

##$Fold07

##[26]2627282930363738394041424344454647484950

##$Fold08

##[26]2627282930313233343541424344454647484950

##$Fold09

##[26]2627282930313233343536373839404647484950

##$Fold10

##[26]2627282930313233343536373839404142434445

重采样选项（trainControl）

训练ML模型最重要的部分之一就是调整参数。

您可以使用该trainControl函数在模型中指定许多参数（包括采样参数）。

从中输出的对象trainControl将作为的参数提供train。

set.seed（998）

#createatestingandtrainingset

in_training<

-createDataPartition（abalone_train$old,p=.75,list=FALSE）

training<

-abalone_train[in_training,]

testing<

-abalone_train[-in_training,]

#specifythattheresamplingmethodis

fit_control<

-trainControl（##10-foldCV

cv"

number=10）

#runarandomforestmodel

set.seed（825）

trControl=fit_control）

Cross-Validated（10fold）

3382,3383,3384,3383,3383,3382,...

##2gini0.78719440.5188557

##2extratrees0.78693270.5113079

##5gini0.78213760.5100375

##5extratrees0.78586530.5172792

##9gini0.77628870.4969119

##9extratrees0.78559580.5175170

相反，我们可以通过将参数be指定为来使用分组折叠（而不是随机CV折叠）。

indextrainControlgrouped_folds

group_fit_control<

-trainControl（##usegroupedCVfolds

index=group_folds,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

下载	加入VIP,免费下载

版权申诉 word格式文档无特别注明外均可编辑修改；预览文档经过压缩，下载后原文更清晰！ 立即下载

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: 软件实现随机森林算法详细代码操作

冰豆网所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

关于本文

本文标题：R软件实现随机森林算法带详细代码操作Word下载.docx
链接地址：https://www.bdocx.com/doc/17643512.html

R软件实现随机森林算法带详细代码操作Word下载.docx

热门标签