Hadoop云计算外文翻译文献.docx
- 文档编号:26100584
- 上传时间:2023-06-17
- 格式:DOCX
- 页数:18
- 大小:32.09KB
Hadoop云计算外文翻译文献.docx
《Hadoop云计算外文翻译文献.docx》由会员分享,可在线阅读,更多相关《Hadoop云计算外文翻译文献.docx(18页珍藏版)》请在冰豆网上搜索。
Hadoop云计算外文翻译文献
Hadoop云计算外文翻译文献
(文档含中英文对照即英文原文和中文翻译)
原文:
MeetHadoop
Inpioneerdaystheyusedoxenforheavypulling,andwhenoneoxcouldn’tbudgealog,theydidn’ttrytogrowalargerox.Weshouldn’tbetryingforbiggercomputers,butformoresystemsofcomputers.
—GraceHopper
Data!
Weliveinthedataage.It’snoteasytomeasurethetotalvolumeofdatastoredelectronically,butanIDCestimateputthesizeofthe“digitaluniverse”at0.18zettabytesin2006,andisforecastingatenfoldgrowthby2011to1.8zettabytes.Azettabyteis1021bytes,orequivalentlyonethousandexabytes,onemillionpetabytes,oronebillionterabytes.That’sroughlythesameorderofmagnitudeasonediskdriveforeverypersonintheworld.
Thisfloodofdataiscomingfrommanysources.Considerthefollowing:
•TheNewYorkStockExchangegeneratesaboutoneterabyteofnewtradedataper
day.
•Facebookhostsapproximately10billionphotos,takinguponepetabyteofstorage.
•A,thegenealogysite,storesaround2.5petabytesofdata.
•TheInternetArchivestoresaround2petabytesofdata,andisgrowingatarateof
20terabytespermonth.
•TheLargeHadronCollidernearGeneva,Switzerland,willproduceabout15petabytesofdataperyear.
Sothere’salotofdataoutthere.Butyouareprobablywonderinghowitaffectsyou.
Mostofthedataislockedupinthelargestwebproperties(likesearchengines),or
scientificorfinancialinstitutions,isn’tit?
Doestheadventof“BigData,”asitisbeingcalled,affectsmallerorganizationsorindividuals?
Iarguethatitdoes.Takephotos,forexample.Mywife’sgrandfatherwasanavid
photographer,andtookphotographsthroughouthisadultlife.Hisentirecorpusofmediumformat,slide,and35mmfilm,whenscannedinathigh-resolution,occupiesaround10gigabytes.Comparethistothedigitalphotosthatmyfamilytooklastyear,whichtakeupabout5gigabytesofspace.Myfamilyisproducingphotographicdataat35timestheratemywife’sgrandfather’sdid,andtherateisincreasingeveryyearasitbecomeseasiertotakemoreandmorephotos.
Moregenerally,thedigitalstreamsthatindividualsareproducingaregrowingapace.
MicrosoftResearch’sMyLifeBitsprojectgivesaglimpseofarchivingofpersonalinformationthatmaybecomecommonplaceinthenearfuture.MyLifeBitswasanexperimentwhereanindividual’sinteractions—phonecalls,emails,documentswerecapturedelectronicallyandstoredforlateraccess.Thedatagatheredincludedaphototakeneveryminute,whichresultedinanoveralldatavolumeofonegigabyteamonth.Whenstoragecostscomedownenoughtomakeitfeasibletostorecontinuousaudioandvideo,thedatavolumeforafutureMyLifeBitsservicewillbemanytimesthat.
Thetrendisforeveryindividual’sdatafootprinttogrow,butperhapsmoreimportantlytheamountofdatageneratedbymachineswillbeevengreaterthanthatgeneratedbypeople.Machinelogs,RFIDreaders,sensornetworks,vehicleGPStraces,retailtransactions—allofthesecontributetothegrowingmountainofdata.
Thevolumeofdatabeingmadepubliclyavailableincreaseseveryyeartoo.Organizationsnolongerhavetomerelymanagetheirowndata:
successinthefuturewillbedictatedtoalargeextentbytheirabilitytoextractvaluefromotherorganizations’data.InitiativessuchasPublicDataSetsonAmazonWebServices,Infochimps.org,andtheinfo.orgexisttofosterthe“informationcommons,”wheredatacanbefreely(orinthecaseofAWS,foramodestprice)sharedforanyonetodownloadandanalyze.Mashupsbetweendifferentinformationsourcesmakeforunexpectedandhithertounimaginableapplications.
Take,forexample,theAproject,whichwatchestheAstrometrygroup
onFlickrfornewphotosofthenightsky.Itanalyzeseachimage,andidentifieswhichpartoftheskyitisfrom,andanyinterestingcelestialbodies,suchasstarsorgalaxies.Althoughit’sstillanewandexperimentalservice,itshowsthekindofthingsthatarepossiblewhendata(inthiscase,taggedphotographicimages)ismadeavailableandusedforsomething(imageanalysis)thatwasnotanticipatedbythecreator.
Ithasbeensaidthat“Moredatausuallybeatsbetteralgorithms,”whichistosaythatforsomeproblems(suchasrecommendingmoviesormusicbasedonpastpreferences),howeverfiendishyouralgorithmsare,theycanoftenbebeatensimplybyhavingmoredata(andalesssophisticatedalgorithm).
ThegoodnewsisthatBigDataishere.Thebadnewsisthatwearestrugglingtostoreandanalyzeit.
DataStorageandAnalysis
Theproblemissimple:
whilethestoragecapacitiesofharddriveshaveincreasedmassivelyovertheyears,accessspeeds--therateatwhichdatacanbereadfromdrives--havenotkeptup.Onetypicaldrivefrom1990couldstore1370MBofdataandhadatransferspeedof4.4MB/s,soyoucouldreadallthedatafromafulldriveinaroundfiveminutes.Almost20yearslateroneterabytedrivesarethenorm,butthetransferspeedisaround100MB/s,soittakesmorethantwoandahalfhourstoreadallthedataoffthedisk.
Thisisalongtimetoreadalldataonasingledriveandwritingisevenslower.Theobviouswaytoreducethetimeistoreadfrommultipledisksatonce.Imagineifwehad100drives,eachholdingonehundredthofthedata.Workinginparallel,wecouldreadthedatainundertwominutes.
Onlyusingonehundredthofadiskmayseemwasteful.Butwecanstoreonehundred
datasets,eachofwhichisoneterabyte,andprovidesharedaccesstothem.Wecanimaginethattheusersofsuchasystemwouldbehappytoshareaccessinreturnforshorteranalysistimes,and,statistically,thattheiranalysisjobswouldbelikelytobespreadovertime,sotheywouldn`tinterferewitheachothertoomuch.
There`smoretobeingabletoreadandwritedatainparalleltoorfrommultipledisks,
though.Thefirstproblemtosolveishardwarefailure:
assoonasyoustartusingmanypiecesofhardware,thechancethatonewillfailisfairlyhigh.Acommonwayofavoidingdatalossisthroughreplication:
redundantcopiesofthedataarekeptbythesystemsothatintheeventoffailure,thereisanothercopyavailable.ThisishowRAIDworks,forinstance,althoughHadoop`sfilesystem,theHadoopDistributedFilesystem(HDFS),takesaslightlydifferentapproach,asyoushallseelater.Thesecondproblemisthatmostanalysistasksneedtobeabletocombinethedatainsomeway;datareadfromonediskmayneedtobecombinedwiththedatafromanyoftheother99disks.Variousdistributedsystemsallowdatatobecombinedfrom
multiplesources,butdoingthiscorrectlyisnotoriouslychallenging.MapReduceprovidesaprogrammingmodelthatabstractstheproblemfromdiskreadsandwrites,transformingitintoacomputationoversetsofkeysandvalues.Wewilllookatthedetailsofthismodelinlaterchapters,buttheimportantpointforthepresentdiscussionisthattherearetwopartstothecomputation,themapandthereduce,andit’stheinterfacebetweenthetwowherethe“mixing”occurs.LikeHDFS,MapReducehasreliabilitybuilt-in.
This,inanutshell,iswhatHadoopprovides:
areliablesharedstorageandanalysissystem.ThestorageisprovidedbyHDFS,andanalysisbyMapReduce.ThereareotherpartstoHadoop,butthesecapabilitiesareitskernel.
ComparisonwithOtherSystems
TheapproachtakenbyMapReducemayseemlikeabrute-forceapproach.Thepremiseisthattheentiredataset—oratleastagoodportionofit—isprocessedforeachquery.Butthisisitspower.MapReduceisabatchqueryprocessor,andtheabilitytorunanadhocqueryagainstyourwholedatasetandgettheresultsinareasonabletimeistransformative.Itchangesthewayyouthinkaboutdata,andunlocksdatathatwaspreviouslyarchivedontapeordisk.Itgivespeopletheopportunitytoinnovatewithdata.Questionsthattooktoolongtogetansweredbeforecannowbeanswered,whichinturnleadstonewquestionsandnewinsights.
Forexample,Mailtrust,Rackspace’smaildivision,usedHadoopforprocessingemaillogs.Oneadhocquerytheywrotewastofindthegeographicdistributionoftheirusers.
Intheirwords:
Thisdatawassousefulthatwe’vescheduledtheMapReducejobtorunmonthlyandwewillbeusingthisdatatohelpusdecidewhichRackspacedatacenterstoplacenewmailserversinaswegrow.Bybringingseveralhundredgigabytesofdatatogetherandhavingthetoolstoanalyzeit,theRackspaceengineerswereabletogainanunderstandingofthedatathattheyotherwisewouldneverhavehad,and,furthermore,theywereabletousewhattheyhadlearnedtoimprovetheservicefortheircustomers.YoucanreadmoreabouthowRackspaceusesHadoopinChapter14.
RDBMS
Whycan’tweusedatabaseswithlotsofdiskstodolarge-scalebatchanalysis?
WhyisMapReduceneeded?
Theanswertothesequestionscomesfromanothertrendindiskdrives:
seektimeisimprovingmoreslowlythantransferrate.Seekingistheprocessofmovingthedisk’sheadtoaparticularplaceonthedisktoreadorwritedata.Itcharacterizesthelatencyofadiskoperation,whereasthetransferratecorrespondstoadisk’sbandwidth.
Ifthedataaccesspatternisdominatedbyseeks,itwilltakelongertoreadorwritelargeportionsofthedatasetthanstreamingthroughit,whichoperatesatthetransferrate.Ontheotherhand,forupdatingasmallproportionofrecordsinadatabase,atraditionalB-Tree(thedatastructureusedinrelationaldatabases,whichislimitedbytherateitcanperformseeks)workswell.Forupdatingthemajorityofadatabase,aB-TreeislessefficientthanMapReduce,whichusesSort/Mergetorebuildthedatabase.
Inm
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- Hadoop 计算 外文 翻译 文献