文献翻译数据类型泛化用于数据挖掘算法文档格式.docx
- 文档编号:17179832
- 上传时间:2022-11-28
- 格式:DOCX
- 页数:18
- 大小:570.49KB
文献翻译数据类型泛化用于数据挖掘算法文档格式.docx
《文献翻译数据类型泛化用于数据挖掘算法文档格式.docx》由会员分享,可在线阅读,更多相关《文献翻译数据类型泛化用于数据挖掘算法文档格式.docx(18页珍藏版)》请在冰豆网上搜索。
专业
班级
学生姓名
学号
指导教师
DataTypesGeneralizationforDataMiningAlgorithms
Abstract
Withtheincreasingofdatabaseapplications,mininginterestinginformationfromhugedatabasesbecomesofmostconcernandavarietyofminingalgorithmshavebeenproposedinrecentyears.Asweknow,thedataprocessedindataminingmaybeobtainedfrommanysourcesinwhichdifferentdatatypesmaybeused.However,noalgorithmcanbeappliedtoallapplicationsduetothedifficultyforfittingdatatypesofthealgorithm,sotheselectionofanappropriateminingalgorithmisbasedonnotonlythegoalofapplication,butalsothedatafittability.Therefore,totransformthenon-fittingdatatypeintotargetoneisalsoanimportantworkindatamining,buttheworkisoftentediousorcomplexsincealotofdatatypesexistinrealworld.Mergingthesimilardatatypesofagivenselectedminingalgorithmintoageneralizeddatatypeseemstobeagoodapproachtoreducethetransformationcomplexity.Inthiswork,thedatatypesfittabilityproblemforsixkindsofwidelyuseddataminingtechniquesisdiscussedandadatatypegeneralizationprocessincludingmergingandtransformingphasesisproposed.Inthemergingphase,theoriginaldatatypesofdatasourcestobeminedarefirstmergedintothegeneralizedones.Thetransformingphaseisthenusedtoconvertthegeneralizeddatatypesintothetargetonesfortheselectedminingalgorithm.Usingthedatatypegeneralizationprocess,theusercanselectappropriateminingalgorithmjustforthegoalofapplicationwithoutconsideringthedatatypes.
1.Introduction
Inrecentyears,theamountofvariousdatagrowsrapidlyWidelyavailable,low-costcomputertechnologynowmakesitpossibletobothcollecthistoricaldataandalsoinstituteon-lineanalysisfornewlyarrivingdata.AutomateddatagenerationandgatheringleadstotremendousamountsofdatastoredindatabasesAlthoughwearefilledwithdata,butwelackforknowledge.Dataminingistheautomateddiscoveryofnon-trivial,previouslyunknown,andpotentiallyusefulknowledgeembeddedindatabases.Differentkindsofdataminingmethodsandalgorithmshavebeenproposed,eachofwhichhasitsownadvantagesandsuitableapplicationdomains.However,itisdifficultforuserstochooseanappropriateonebythemselves.tochooseanappropriateonebythemselves.Thisisbecausethedataprovidedcannotbedirectlyusedfordataminingalgorithms.Sincemostdataminingalgorithmscanonlybeappliedtosomespecificdatatypes,thetypesofdatastoredindatabasesrestrictsthechoiceofdataminingmethods.Ifcertainkindsofknowledgeneedtobeobtainedusingsomedataminingalgorithms,datatypestransformationshouldbedonefirstandthisiswhatwecalled“thedatatypesfittabilityproblem”fordatamining.Forthetimebeing,thereisnotoolthatcanhelpuserstodothiskindofdatatypestransformation.Inthispaper,wewillsurveyandanalyzethedatatypesfittabilityproblemfordataminingalgorithms,andthenweproposea“datatypesgeneralizationprocess”tosolvethedatatypesfittabilityproblemfortheattributesinrelationaldatabases.
The“datatypesgeneralizationprocess”includingmergingandtransformingphasesisaproceduretotransformthedatatypesofatttributescontainedinrelations(tables).Inthemergingphase,theoriginaldatatypesofdatasourcestobeminedarefirstmergedintothegeneralizedones.Thetransformingphaseisthenusedtoconvertthegeneralizeddatatypesintothetargetonesfortheselectedminingalgorithm.Usingthedatatypegeneralizationprocess,theusercanselectappropriateminingalgorithmjustforthegoalofapplicationwithoutconsideringthedatatypes.
2.Relatedwork
Asmentionedabove,becausemanydataminingalgorithmscanonlybeappliedtothedatatypeswithrestrictedrange,userspossiblyneedtododatatypestransformationbeforetheselectedalgorithmhasbeenexecuted.Inthispaper,weproposeageneralconceptcalled“datatypesgeneralizationprocess“whichprovideaprocedurefordoingthiskindofdatatypestransformation.Datatypesgeneralizationcanbeseenasapre-processingofdatamining.Ofcourse,otherpre-processingsuchasdataselection,datacleaning,dimension(attribute)reduction,missingdatahandlingmayalsoneedtobeperformedbeforerunningtheselecteddataminingalgorithm.Insummary,thewholeprocessofdataminingistheso-calledKDD(knowledgediscoveryindatabases),asshowninFigure1.
Figure1:
TheKDDprocessandtheroleofdatatypesgeneralization.
Thereisamajordifferencebetweenthedatatypesgeneralizationprocessandotherdataminingpre-processes.Otherpre-processes(likemissingvaluehandling)areallindependentoftheselecteddataminingmethod.Thatis,theycanbedonewithoutknowingwhatdataminingalgorithmwillbeused.Butitisclearthatdatatypesgeneralizationprocessdependsonthedesiredminingmethod.Thetargetofdoingdatatransformationusingdatatypesgeneralizationistomakethespecifieddatasetsuitablefortheminingalgorithm.Therefore,ifwewanttoachievethisgoal,wemustsurveyboththedatatypesindatabasesandtheirrelationswithvariousdataminingmethods.TheflowofsolvingadataminingproblemwithdoingdatatransformationisillustratedinFigure2.
Figure2:
Solvingdataminingproblemswithdatatransformationdatatypestransformation
Someresearchersproposedhowtogeneralizethedatacontainedinattributesusing"
attribute-orientedinduction"
whichallowsthegeneralizationofdata,offerstwomajoradvantagesfortheminingoflargedatabases.First,itallowstherawdatatobehandledathigherconceptuallevels.Generalizationisperformedwiththeuseof"
attributeconcepthierarchies"
wheretheleavesofagivenattribute'
sconcepthierarchycorre-spondtotheattribute'
svaluesinthedata(referredtoasprimitiveleveldata).Generalizationofthetrainingdataisachievedbyreplacingprimitiveleveldatabyhigherlevelconcepts.
Infact,datageneralizationusingattributeconcepthierarchiesisakindofdatatypetransformationwhichreducesthenumberofdistinctvaluescontainedinattributes.Wefirstprovideatypicaldescriptionofthedatatypesfittabilityproblemandadatatypesgeneralizationprocesstodefineandsolvethedatatypestransformationproblemforattributes.Hence,datageneralizationusingconcepthierarchiesisincludedintheprocessforperformingspecifieddatatypestransformation.
Anotherrelatedworkisthatsomeresearcherssurveyedabouthowtotransformdataintonumericalvalues.Almostalldata-drivenalgorithmsutilizenumericinputs.Fromacomputerprocessingpointofview,handlingcomputationswithnumbersiseasierandmoreefficient.Therefore,iftheinputvaluesarenon-numeric(e.g.,textstrings),theyshouldbeintelligentlyconvertedtomeaningfulnumericalvaluesinmanycases.Numericalvaluescanbeseenasadatatypeandtransformingdataintonumericalvaluesisakindofdatatypestransformation.Thestrategiesareincludedinthedatatypesgeneralizationprocessforperformingdatatypestransformation.
3.Analysisofthedatatypesfittabilityproblem
Inrecentyears,duetotheexplosionofinformationandtherapidgrowthofdatabaseapplications,dataminingtechniquesbecomemoreandmoreimportant.Forthisreason,differentkindsofdataminingmethodsoralgorithmshavebeenproposed.However,itisdifficultforuserstochooseasuitableonebythemselveswithoutpriorknowledgeaboutdatamining.Actually,thekindofdataminingmethodsshouldbeapplieddependsonboththecharacteristicofthedatatobeminedandthekindofknowledgetobefoundthroughthedataminingprocess.Hence,thetypesofdatastoredindatabasesplayanimportantroleduringthedataminingprocessandrestrictthedataminingmethodscanbechosenbyusers.Itistruethatallkindsofdataminingmethodscanonlybeappliedtoparticulardatabasessuitableforeachkindandthisiswhatwecalled"
thedatatypesfittabilityproblem"
fordatamining.Tosolvethisproblem,weneedtoinvestigatetherelationshipsbetweenthecharacteristicsofthedatatobeminedandvariouskindsofdataminingtechniques.Withtherelation-ships,wecanclearlyanalyzethedatatypesfittabilityproblemandfurtherknowwhetherthedatatypestransformationcanbeperformedornot.Hence,analyzingthiskindofrelationshipsisapreparationworkforourdatatypesgeneralizationprocess,whichexplainswhythedatatypesgeneralizationprocesscansolvethedatafittabilityproblem.Wenowillustratetheanalysisasfollows.
3.1Fourkindsofdataformsfordatamining
Dataminingtechniquesususallycanbeappliedtofourkindsofdataforms:
texual,temporal,transactionalandrelationalforms.Differentkindsofdataformsareusedtostoredifferentkindsofdatatypes.Wedescribeeachkindofdataformsinthefollowing:
(1)Textualdataforms:
Textualdataformsareusedtorepresenttextsordocuments.Basically,thiskindofdataformscanbeseenasasetofcharacterswithhugeamount.
(2)Temporaldataforms:
Time-seriesdataisstoredintemporaldataforms.Datathatvarieswithtime(suchashistoricaldata)canbestoredintheformofnumericaltime-series.
(3)Transactionaldataforms:
Forexample,thepasttransactionsofamarketcanbestoredintransactionaldataforms.Eachtransactionrecordsalistofitemsboughtinthattransaction.
(4)Relationaldataforms:
Relationaldataformsarethemostwidelyuseddataformsandcanstorediffierentkindsofdata.Thebasicunitsofrelationaldataformsarerelations(
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 文献 翻译 数据类型 泛化 用于 数据 挖掘 算法