数据预处理技术PPT文档格式.ppt
- 文档编号:14695277
- 上传时间:2022-10-24
- 格式:PPT
- 页数:50
- 大小:180KB
数据预处理技术PPT文档格式.ppt
《数据预处理技术PPT文档格式.ppt》由会员分享,可在线阅读,更多相关《数据预处理技术PPT文档格式.ppt(50页珍藏版)》请在冰豆网上搜索。
lackingattributevalues,lackingcertainattributesofinterest,orcontainingonlyaggregatedatane.g.,occupation=“”noisy:
containingerrorsoroutliersne.g.,Salary=“-10”inconsistent:
containingdiscrepanciesincodesornamesne.g.,Age=“42”Birthday=“03/07/1997”ne.g.,Wasrating“1,2,3”,nowrating“A,B,C”ne.g.,discrepancybetweenduplicaterecordsWhyIsDataDirty?
nIncompletedatacomesfromn/adatavaluewhencollecteddifferentconsiderationbetweenthetimewhenthedatawascollectedandwhenitisanalyzed.human/hardware/softwareproblemsnNoisydatacomesfromtheprocessofdatacollectionentrytransmissionnInconsistentdatacomesfromDifferentdatasourcesFunctionaldependencyviolationWhyIsDataPreprocessingImportant?
nNoqualitydata,noqualityminingresults!
Qualitydecisionsmustbebasedonqualitydatane.g.,duplicateormissingdatamaycauseincorrectorevenmisleadingstatistics.DatawarehouseneedsconsistentintegrationofqualitydatanDataextraction,cleaning,andtransformationcomprisesthemajorityoftheworkofbuildingadatawarehouse.BillInmonMulti-DimensionalMeasureofDataQualitynAwell-acceptedmultidimensionalview:
AccuracyCompletenessConsistencyTimelinessBelievabilityValueaddedInterpretabilityAccessibilitynBroadcategories:
intrinsic,contextual,representational,andaccessibility.MajorTasksinDataPreprocessingnDatacleaningFillinmissingvalues,smoothnoisydata,identifyorremoveoutliers,andresolveinconsistenciesnDataintegrationIntegrationofmultipledatabases,datacubes,orfilesnDatatransformationNormalizationandaggregationnDatareductionObtainsreducedrepresentationinvolumebutproducesthesameorsimilaranalyticalresultsnDatadiscretizationPartofdatareductionbutwithparticularimportance,especiallyfornumericaldataFormsofdatapreprocessingII.DataCleaningnImportance“Datacleaningisoneofthethreebiggestproblemsindatawarehousing”RalphKimball“Datacleaningisthenumberoneproblemindatawarehousing”DCIsurveynDatacleaningtasksFillinmissingvaluesIdentifyoutliersandsmoothoutnoisydataCorrectinconsistentdataResolveredundancycausedbydataintegrationMissingDatanDataisnotalwaysavailableE.g.,manytupleshavenorecordedvalueforseveralattributes,suchascustomerincomeinsalesdatanMissingdatamaybeduetoequipmentmalfunctioninconsistentwithotherrecordeddataandthusdeleteddatanotenteredduetomisunderstandingcertaindatamaynotbeconsideredimportantatthetimeofentrynotregisterhistoryorchangesofthedatanMissingdatamayneedtobeinferred.HowtoHandleMissingData?
nIgnorethetupleusuallydonewhenclasslabelismissing(assumingthetasksinclassificationnoteffectivewhenthepercentageofmissingvaluesperattributevariesconsiderably).nFillinthemissingvaluemanuallytedious+infeasible?
nFillinitautomaticallywithaglobalconstant:
e.g.,“unknown”,anewclass?
!
theattributemeantheattributemeanforallsamplesbelongingtothesameclass:
smarterthemostprobablevalue:
inference-basedsuchasBayesianformulaordecisiontreeNoisyDatanNoise:
randomerrororvarianceinameasuredvariablenIncorrectattributevaluesmayduetofaultydatacollectioninstrumentsdataentryproblemsdatatransmissionproblemstechnologylimitationinconsistencyinnamingconventionnOtherdataproblemswhichrequiresdatacleaningduplicaterecordsincompletedatainconsistentdataHowtoHandleNoisyData?
nBinningmethod:
firstsortdataandpartitioninto(equi-depth)binsthenonecansmoothbybinmeans,smoothbybinmedian,smoothbybinboundaries,etc.nClusteringdetectandremoveoutliersnCombinedcomputerandhumaninspectiondetectsuspiciousvaluesandcheckbyhuman(e.g.,dealwithpossibleoutliers)nRegressionsmoothbyfittingthedataintoregressionfunctionsSimpleDiscretizationMethods:
BinningnEqual-width(distance)partitioning:
DividestherangeintoNintervalsofequalsize:
uniformgridifAandBarethelowestandhighestvaluesoftheattribute,thewidthofintervalswillbe:
W=(BA)/N.Themoststraightforward,butoutliersmaydominatepresentationSkeweddataisnothandledwell.nEqual-depth(frequency)partitioning:
DividestherangeintoNintervals,eachcontainingapproximatelysamenumberofsamplesGooddatascalingManagingcategoricalattributescanbetricky.BinningMethodsforDataSmoothingSorteddataforprice(indollars)4,8,9,15,21,21,24,25,26,28,29,34*Partitioninto(equi-depth)bins:
-Bin1:
4,8,9,15-Bin2:
21,21,24,25-Bin3:
26,28,29,34*Smoothingbybinmeans:
9,9,9,9-Bin2:
23,23,23,23-Bin3:
29,29,29,29*Smoothingbybinboundaries:
4,4,4,15-Bin2:
21,21,25,25-Bin3:
26,26,26,34ClusterAnalysisRegressionxyy=x+1X1Y1Y1III.Data
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 数据 预处理 技术