书签分享收藏举报版权申诉 / 18

立即下载加入VIP,免费下载

当前位置：首页 > 外语学习 > 英语考试 > Hadoop云计算外文翻译文献.docx

Hadoop云计算外文翻译文献.docx

文档编号：26100584
上传时间：2023-06-17
格式：DOCX
页数：18
大小：32.09KB

《Hadoop云计算外文翻译文献.docx》由会员分享，可在线阅读，更多相关《Hadoop云计算外文翻译文献.docx（18页珍藏版）》请在冰豆网上搜索。

Hadoop云计算外文翻译文献.docx

Hadoop云计算外文翻译文献

（文档含中英文对照即英文原文和中文翻译）

原文:

MeetHadoop

Inpioneerdaystheyusedoxenforheavypulling,andwhenoneoxcouldn’tbudgealog,theydidn’ttrytogrowalargerox.Weshouldn’tbetryingforbiggercomputers,butformoresystemsofcomputers.

—GraceHopper

Data!

Weliveinthedataage.It’snoteasytomeasurethetotalvolumeofdatastoredelectronically,butanIDCestimateputthesizeofthe“digitaluniverse”at0.18zettabytesin2006,andisforecastingatenfoldgrowthby2011to1.8zettabytes.Azettabyteis1021bytes,orequivalentlyonethousandexabytes,onemillionpetabytes,oronebillionterabytes.That’sroughlythesameorderofmagnitudeasonediskdriveforeverypersonintheworld.

Thisfloodofdataiscomingfrommanysources.Considerthefollowing:

•TheNewYorkStockExchangegeneratesaboutoneterabyteofnewtradedataper

day.

•Facebookhostsapproximately10billionphotos,takinguponepetabyteofstorage.

•A,thegenealogysite,storesaround2.5petabytesofdata.

•TheInternetArchivestoresaround2petabytesofdata,andisgrowingatarateof

20terabytespermonth.

•TheLargeHadronCollidernearGeneva,Switzerland,willproduceabout15petabytesofdataperyear.

Sothere’salotofdataoutthere.Butyouareprobablywonderinghowitaffectsyou.

Mostofthedataislockedupinthelargestwebproperties（likesearchengines）,or

scientificorfinancialinstitutions,isn’tit?

Doestheadventof“BigData,”asitisbeingcalled,affectsmallerorganizationsorindividuals?

Iarguethatitdoes.Takephotos,forexample.Mywife’sgrandfatherwasanavid

photographer,andtookphotographsthroughouthisadultlife.Hisentirecorpusofmediumformat,slide,and35mmfilm,whenscannedinathigh-resolution,occupiesaround10gigabytes.Comparethistothedigitalphotosthatmyfamilytooklastyear,whichtakeupabout5gigabytesofspace.Myfamilyisproducingphotographicdataat35timestheratemywife’sgrandfather’sdid,andtherateisincreasingeveryyearasitbecomeseasiertotakemoreandmorephotos.

Moregenerally,thedigitalstreamsthatindividualsareproducingaregrowingapace.

MicrosoftResearch’sMyLifeBitsprojectgivesaglimpseofarchivingofpersonalinformationthatmaybecomecommonplaceinthenearfuture.MyLifeBitswasanexperimentwhereanindividual’sinteractions—phonecalls,emails,documentswerecapturedelectronicallyandstoredforlateraccess.Thedatagatheredincludedaphototakeneveryminute,whichresultedinanoveralldatavolumeofonegigabyteamonth.Whenstoragecostscomedownenoughtomakeitfeasibletostorecontinuousaudioandvideo,thedatavolumeforafutureMyLifeBitsservicewillbemanytimesthat.

Thetrendisforeveryindividual’sdatafootprinttogrow,butperhapsmoreimportantlytheamountofdatageneratedbymachineswillbeevengreaterthanthatgeneratedbypeople.Machinelogs,RFIDreaders,sensornetworks,vehicleGPStraces,retailtransactions—allofthesecontributetothegrowingmountainofdata.

Thevolumeofdatabeingmadepubliclyavailableincreaseseveryyeartoo.Organizationsnolongerhavetomerelymanagetheirowndata:

successinthefuturewillbedictatedtoalargeextentbytheirabilitytoextractvaluefromotherorganizations’data.InitiativessuchasPublicDataSetsonAmazonWebServices,Infochimps.org,andtheinfo.orgexisttofosterthe“informationcommons,”wheredatacanbefreely（orinthecaseofAWS,foramodestprice）sharedforanyonetodownloadandanalyze.Mashupsbetweendifferentinformationsourcesmakeforunexpectedandhithertounimaginableapplications.

Take,forexample,theAproject,whichwatchestheAstrometrygroup

onFlickrfornewphotosofthenightsky.Itanalyzeseachimage,andidentifieswhichpartoftheskyitisfrom,andanyinterestingcelestialbodies,suchasstarsorgalaxies.Althoughit’sstillanewandexperimentalservice,itshowsthekindofthingsthatarepossiblewhendata（inthiscase,taggedphotographicimages）ismadeavailableandusedforsomething（imageanalysis）thatwasnotanticipatedbythecreator.

Ithasbeensaidthat“Moredatausuallybeatsbetteralgorithms,”whichistosaythatforsomeproblems（suchasrecommendingmoviesormusicbasedonpastpreferences）,howeverfiendishyouralgorithmsare,theycanoftenbebeatensimplybyhavingmoredata（andalesssophisticatedalgorithm）.

ThegoodnewsisthatBigDataishere.Thebadnewsisthatwearestrugglingtostoreandanalyzeit.

DataStorageandAnalysis

Theproblemissimple:

whilethestoragecapacitiesofharddriveshaveincreasedmassivelyovertheyears,accessspeeds--therateatwhichdatacanbereadfromdrives--havenotkeptup.Onetypicaldrivefrom1990couldstore1370MBofdataandhadatransferspeedof4.4MB/s,soyoucouldreadallthedatafromafulldriveinaroundfiveminutes.Almost20yearslateroneterabytedrivesarethenorm,butthetransferspeedisaround100MB/s,soittakesmorethantwoandahalfhourstoreadallthedataoffthedisk.

Thisisalongtimetoreadalldataonasingledriveandwritingisevenslower.Theobviouswaytoreducethetimeistoreadfrommultipledisksatonce.Imagineifwehad100drives,eachholdingonehundredthofthedata.Workinginparallel,wecouldreadthedatainundertwominutes.

Onlyusingonehundredthofadiskmayseemwasteful.Butwecanstoreonehundred

datasets,eachofwhichisoneterabyte,andprovidesharedaccesstothem.Wecanimaginethattheusersofsuchasystemwouldbehappytoshareaccessinreturnforshorteranalysistimes,and,statistically,thattheiranalysisjobswouldbelikelytobespreadovertime,sotheywouldn`tinterferewitheachothertoomuch.

There`smoretobeingabletoreadandwritedatainparalleltoorfrommultipledisks,

though.Thefirstproblemtosolveishardwarefailure:

assoonasyoustartusingmanypiecesofhardware,thechancethatonewillfailisfairlyhigh.Acommonwayofavoidingdatalossisthroughreplication:

redundantcopiesofthedataarekeptbythesystemsothatintheeventoffailure,thereisanothercopyavailable.ThisishowRAIDworks,forinstance,althoughHadoop`sfilesystem,theHadoopDistributedFilesystem（HDFS）,takesaslightlydifferentapproach,asyoushallseelater.Thesecondproblemisthatmostanalysistasksneedtobeabletocombinethedatainsomeway;datareadfromonediskmayneedtobecombinedwiththedatafromanyoftheother99disks.Variousdistributedsystemsallowdatatobecombinedfrom

multiplesources,butdoingthiscorrectlyisnotoriouslychallenging.MapReduceprovidesaprogrammingmodelthatabstractstheproblemfromdiskreadsandwrites,transformingitintoacomputationoversetsofkeysandvalues.Wewilllookatthedetailsofthismodelinlaterchapters,buttheimportantpointforthepresentdiscussionisthattherearetwopartstothecomputation,themapandthereduce,andit’stheinterfacebetweenthetwowherethe“mixing”occurs.LikeHDFS,MapReducehasreliabilitybuilt-in.

This,inanutshell,iswhatHadoopprovides:

areliablesharedstorageandanalysissystem.ThestorageisprovidedbyHDFS,andanalysisbyMapReduce.ThereareotherpartstoHadoop,butthesecapabilitiesareitskernel.

ComparisonwithOtherSystems

TheapproachtakenbyMapReducemayseemlikeabrute-forceapproach.Thepremiseisthattheentiredataset—oratleastagoodportionofit—isprocessedforeachquery.Butthisisitspower.MapReduceisabatchqueryprocessor,andtheabilitytorunanadhocqueryagainstyourwholedatasetandgettheresultsinareasonabletimeistransformative.Itchangesthewayyouthinkaboutdata,andunlocksdatathatwaspreviouslyarchivedontapeordisk.Itgivespeopletheopportunitytoinnovatewithdata.Questionsthattooktoolongtogetansweredbeforecannowbeanswered,whichinturnleadstonewquestionsandnewinsights.

Forexample,Mailtrust,Rackspace’smaildivision,usedHadoopforprocessingemaillogs.Oneadhocquerytheywrotewastofindthegeographicdistributionoftheirusers.

Intheirwords:

Thisdatawassousefulthatwe’vescheduledtheMapReducejobtorunmonthlyandwewillbeusingthisdatatohelpusdecidewhichRackspacedatacenterstoplacenewmailserversinaswegrow.Bybringingseveralhundredgigabytesofdatatogetherandhavingthetoolstoanalyzeit,theRackspaceengineerswereabletogainanunderstandingofthedatathattheyotherwisewouldneverhavehad,and,furthermore,theywereabletousewhattheyhadlearnedtoimprovetheservicefortheircustomers.YoucanreadmoreabouthowRackspaceusesHadoopinChapter14.

RDBMS

Whycan’tweusedatabaseswithlotsofdiskstodolarge-scalebatchanalysis?

WhyisMapReduceneeded?

Theanswertothesequestionscomesfromanothertrendindiskdrives:

seektimeisimprovingmoreslowlythantransferrate.Seekingistheprocessofmovingthedisk’sheadtoaparticularplaceonthedisktoreadorwritedata.Itcharacterizesthelatencyofadiskoperation,whereasthetransferratecorrespondstoadisk’sbandwidth.

Ifthedataaccesspatternisdominatedbyseeks,itwilltakelongertoreadorwritelargeportionsofthedatasetthanstreamingthroughit,whichoperatesatthetransferrate.Ontheotherhand,forupdatingasmallproportionofrecordsinadatabase,atraditionalB-Tree（thedatastructureusedinrelationaldatabases,whichislimitedbytherateitcanperformseeks）workswell.Forupdatingthemajorityofadatabase,aB-TreeislessefficientthanMapReduce,whichusesSort/Mergetorebuildthedatabase.

Inm