article1Word文档下载推荐.docx
- 文档编号:17898051
- 上传时间:2022-12-12
- 格式:DOCX
- 页数:18
- 大小:34.67KB
article1Word文档下载推荐.docx
《article1Word文档下载推荐.docx》由会员分享,可在线阅读,更多相关《article1Word文档下载推荐.docx(18页珍藏版)》请在冰豆网上搜索。
.Eventhepopularbook"
Datamining:
PracticalmachinelearningtoolsandtechniqueswithJava"
[4](whichcoversmostlymachinelearningmaterial)wasoriginallytobenamedjust"
Practicalmachinelearning"
andtheterm"
datamining"
wasonlyaddedformarketingreasons.[5]Oftenthemoregeneralterms"
(largescale)dataanalysis"
or"
analytics"
aremoreappropriate.
Theactualdataminingtaskistheautomaticorsemi-automaticanalysisoflargequantitiesofdatainordertoextractpreviouslyunknowninterestingpatternssuchasgroupsofdatarecords(clusteranalysis),unusualrecords(anomalydetection)anddependencies(associationrulemining).Thesepatternscanthenbeseenasakindofsummaryoftheinputdata,andusedinfurtheranalysisorforexampleinmachinelearningandpredictiveanalytics.Forexample,thedataminingstepmightidentifymultiplegroupsinthedata,whichcanthenbeusedtoobtainmoreaccuratepredictionresultsbyadecisionsupportsystem.Neitherthedatacollection,datapreparationnorresultinterpretationandreportingarepartofthedataminingstep,butdobelongtotheoveralldataminingprocessasadditionalsteps.
Therelatedtermsdatadredging,datafishinganddatasnoopingrefertotheuseofdataminingmethodstosamplepartsofalargerpopulationdatasetthatare(ormaybe)toosmallforreliablestatisticalinferencestobemadeaboutthevalidityofanypatternsdiscovered.Thesemethodscan,however,beusedincreatingnewhypothesestotestagainstthelargerdatapopulations.
Contents
∙1Background
o1.1Researchandevolution
∙2Process
o2.1Pre-processing
o2.2Datamining
o2.3Resultsvalidation
∙3Standards
∙4Notableuses
o4.1Games
o4.2Business
o4.3Scienceandengineering
o4.4Spatialdatamining
▪4.4.1Challenges
o4.5VisualDataMining
o4.6Surveillance
▪4.6.1Patternmining
▪4.6.2Subject-baseddatamining
∙5Privacyconcernsandethics
∙6Marketplacesurveys
∙7Groupsandassociations
∙8Seealso
o8.1Methods
o8.2Applicationdomains
o8.3Applicationexamples
o8.4Miscellaneous
o8.5Relatedtopics
o8.6Commercialdata-miningsoftwareandapplications
o8.7Freelibreopensourcedata-miningsoftwareandapplications
∙9References
∙10Furtherreading
∙11Externallinks
[edit]Background
Themanualextractionofpatternsfromdatahasoccurredforcenturies.EarlymethodsofidentifyingpatternsindataincludeBayes'
theorem(1700s)andregressionanalysis(1800s).Theproliferation,ubiquityandincreasingpowerofcomputertechnologyhasincreaseddatacollection,storageandmanipulations.Asdatasetshavegrowninsizeandcomplexity,directhands-ondataanalysishasincreasinglybeenaugmentedwithindirect,automaticdataprocessing.Thishasbeenaidedbyotherdiscoveriesincomputerscience,suchasneuralnetworks,clustering,geneticalgorithms(1950s),decisiontrees(1960s)andsupportvectormachines(1990s).Dataminingistheprocessofapplyingthesemethodstodatawiththeintentionofuncoveringhiddenpatterns.[6]Ithasbeenusedformanyyearsbybusinesses,scientistsandgovernmentstosiftthroughvolumesofdatasuchasairlinepassengertriprecords,censusdataandsupermarketscannerdatatoproducemarketresearchreports.(Note,however,thatreportingisnotalwaysconsideredtobedatamining.)
Aprimaryreasonforusingdataminingistoassistintheanalysisofcollectionsofobservationsofbehavior.Suchdataarevulnerabletocollinearitybecauseofunknowninterrelations.Anunavoidablefactofdataminingisthatthe(sub-)set(s)ofdatabeinganalyzedmaynotberepresentativeofthewholedomain,andthereforemaynotcontainexamplesofcertaincriticalrelationshipsandbehaviorsthatexistacrossotherpartsofthedomain.Toaddressthissortofissue,theanalysismaybeaugmentedusingexperiment-basedandotherapproaches,suchaschoicemodellingforhuman-generateddata.Inthesesituations,inherentcorrelationscanbeeithercontrolledfor,orremovedaltogether,duringtheconstructionoftheexperimentaldesign.
[edit]Researchandevolution
ThepremierprofessionalbodyinthefieldistheAssociationforComputingMachinery'
sSpecialInterestGrouponKnowledgediscoveryandDataMining(SIGKDD).Since1989theyhavehostedanannualinternationalconferenceandpublisheditsproceedings,[7]andsince1999havepublishedabiannualacademicjournaltitled"
SIGKDDExplorations"
.[8]
Computerscienceconferencesondatamininginclude:
∙CIKM-ACMConferenceonInformationandKnowledgeManagement
∙DMIN–InternationalConferenceonDataMining
∙DMKD–ResearchIssuesonDataMiningandKnowledgeDiscovery
∙ECDM–EuropeanConferenceonDataMining
∙ECML-PKDD–EuropeanConferenceonMachineLearningandPrinciplesandPracticeofKnowledgeDiscoveryinDatabases
∙EDM–InternationalConferenceonEducationalDataMining
∙ICDM–IEEEInternationalConferenceonDataMining
∙KDD-ACMSIGKDDConferenceonKnowledgeDiscoveryandDataMining
∙MLDM–MachineLearningandDataMininginPatternRecognition
∙PAKDD–TheannualPacific-AsiaConferenceonKnowledgeDiscoveryandDataMining
∙PAW–PredictiveAnalyticsWorld
∙SDM–SIAMInternationalConferenceonDataMining(SIAM)
∙SSTD-SymposiumonSpatialandTemporalDatabases
Dataminingtopicsarepresentonmostdatamanagement/databaseconferences.
[edit]Process
TheKnowledgeDiscoveryinDatabases(KDD)processiscommonlydefinedwiththestages
(1)Selection
(2)Preprocessing(3)Transformation(4)DataMining(5)Interpretation/Evaluation.[1]ItexistshoweverinmanyvariationsofthisthemesuchastheCRossIndustryStandardProcessforDataMining(CRISP-DM)whichdefinessixphases:
(1)BusinessUnderstanding,
(2)DataUnderstanding,(3)DataPreparation,(4)Modeling,(5)Evaluation,and(6)Deploymentorasimplifiedprocesssuchas
(1)Pre-processing,
(2)Datamining,and(3)Resultsvalidation.
[edit]Pre-processing
Beforedataminingalgorithmscanbeused,atargetdatasetmustbeassembled.Asdataminingcanonlyuncoverpatternsactuallypresentinthedata,thetargetdatasetmustbelargeenoughtocontainthesepatternswhileremainingconciseenoughtobeminedinanacceptabletimeframe.Acommonsourcefordataisadatamartordatawarehouse.Pre-processisessentialtoanalysethemultivariatedatasetsbeforedatamining.
Thetargetsetisthencleaned.Datacleaningremovestheobservationswithnoiseandmissingdata.
[edit]Datamining
Datamininginvolvessixcommonclassesoftasks:
[1]
∙Anomalydetection(Outlier/change/deviationdetection)-Theidentificationofunusualdatarecords,thatmightbeinterestingordataerrorsandrequirefurtherinvestigation.
∙Associationrulelearning(Dependencymodeling)–Searchesforrelationshipsbetweenvariables.Forexampleasupermarketmightgatherdataoncustomerpurchasinghabits.Usingassociationrulelearning,thesupermarketcandeterminewhichproductsarefrequentlyboughttogetherandusethisinformationformarketingpurposes.Thisissometimesreferredtoasmarketbasketanalysis.
∙Clustering–isthetaskofdiscoveringgroupsandstructuresinthedatathatareinsomewayoranother"
similar"
withoutusingknownstructuresinthedata.
∙Classification–isthetaskofgeneralizingknownstructuretoapplytonewdata.Forexample,anemailprogrammightattempttoclassifyanemailaslegitimateorspam.
∙Regression–Attemptstofindafunctionwhichmodelsthedatawiththeleasterror.
∙Summarization-providingamorecompactrepresentationofthedataset,includingvisualizationandreportgeneration.
[edit]Resultsvalidation
Thissectionismissinginformationaboutnon-classificationtasksindatamining,itonlycoversmachinelearning.Thisconcernhasbeennotedonthetalkpagewherewhetherornottoincludesuchinformationmaybediscussed.(September2011)
Thefinalstepofknowledgediscoveryfromdataistoverifythepatternsproducedbythedataminingalgorithmsoccurinthewiderdataset.Notallpatternsfoundbythedataminingalgorithmsarenecessarilyvalid.Itiscommonforthedataminingalgorithmstofindpatternsinthetrainingsetwhicharenotpresentinthegeneraldataset.Thisiscalledoverfitting.Toovercomethis,theevaluationusesatestsetofdataonwhichthedataminingalgorithmwasnottrained.Thelearnedpatternsareappliedtothistestsetandtheresultingoutputiscomparedtothedesiredoutput.Forexample,adataminingalgorithmtryingtodistinguishspamfromlegitimateemailswouldbetrainedonatrainingset
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- article1