书签分享收藏举报版权申诉 / 23

立即下载加入VIP,免费下载

当前位置：首页 > 高等教育 > 工学 > SparkSQLOnHBasev22.docx

SparkSQLOnHBasev22.docx

文档编号：25969391
上传时间：2023-06-16
格式：DOCX
页数：23
大小：50.92KB

《SparkSQLOnHBasev22.docx》由会员分享，可在线阅读，更多相关《SparkSQLOnHBasev22.docx（23页珍藏版）》请在冰豆网上搜索。

SparkSQLOnHBasev22.docx

SparkSQLOnHBasev22

ProposalofSparkSQLonHBase

Contents

ProposalofSparkSQLonHBase

YanZhou

July12,2015

Version2.2

ModificationHistory

Overview

1.CodeStructure

2.SupportedHBaseVersion

3.APIs

4.CommandLineInterface

5.MetadataPersistence

6.InteractiveshellandDSL

7.RDDandPartitions

8.DesignPrinciples

9.Deployment

10.Configuration

11.HBaseConnectionHandling

12.HBaseScannerCachingandPartitioncaching

13.RowKeyComposition

14.PartitionPruningandPredicatePushdown

15.Utilities

16.SQLSupport

16.1DDL

16.1.1CREATETABLE

16.1.2DROPTABLE

16.1.3ALTERTABLE

16.2DML

16.2.1INSERT

16.2.2BulkLoading

16.3MiscellaneousCommands

17.Newfiles,classesandfunctionalities

17.1HBaseCatalogextendsCatalog

17.2HBaseSqlParserextendsSqlParser

17.3HBaseSQLContextextendsSQLContext

17.4HBasePartitionextendsPartition

17.5HBaseSQLReaderRDDextendsRDD

17.6ClassDiagram

18.DataFrame

19.JavaSupport

20.PythonSupport

20.1PythonShell

20.2PythonScript

21.Coprocessor

21.1AvailabilityandLoadingofCoprocessor

21.2CoprocessorSub-Plan

21.3ThecoprocessorexecutionbytheRegionServer

21.4HBaseRelationcachingandHTablePool

21.5PhasesofDevelopment

22.CustomFilters

22.1RowskipsfromFiltersonNon-leadingDimensionKey

22.2FilteronanyportionoftheRowKey

22.3The“other”Filters

23.Limitations

24.RelatedWork

25.DevelopmentPhases

26.SupportedSparkReleases

27.FutureWork

28.FAQs

ModificationHistory

Version2.1:

Coprocessor

Overview

ApacheHBaseisadistributedKey-ValuestoreofdataonHDFS.ItismodeledafterGoogle’sBigTable,andprovidesAPIstoquerythedata.Thedataisorganized,partitionedanddistributedbyits“rowkeys”.Perpartition,thedataisfurtherphysicallypartitionedby“columnfamilies”thatspecifycollectionsof“columns”ofdata.Thedatamodelisforwideandsparsetableswherecolumnsaredynamicandmaywellbesparse.

AlthoughHBaseisaveryusefulbigdatastore,itsaccessmechanismisveryprimitiveandonlythroughclient-sideAPIs,Map/Reduceinterfacesandinteractiveshells.SQLaccessestoHBasedataareavailablethroughMap/ReduceorinterfacesmechanismssuchasApacheHiveandImpala,orsome“native”SQLtechnologieslikeApachePhoenix.Whiletheformerisusuallycheapertoimplementanduse,theirlatenciesandefficienciesoftencannotcomparefavorablywiththelatterandareoftensuitableonlyforofflineanalysis.Thelattercategory,incontrast,oftenperformsbetterandqualifiesmoreasonlineengines;theyareoftenontopofpurpose-builtexecutionengines.

CurrentlySparksupportsqueriesagainstHBasedatathroughHBase’sMap/Reduceinterface（i.e.,TableInputFormat）.SparkSQLsupportsuseofHivedata,whichtheoreticallyshouldbeabletosupportHBasedataaccess,out-of-box,throughHBase’sMap/Reduceinterfaceandthereforefallsintothefirstcategoryofthe“SQLonHBase”technologies.

Webelieve,asaunifiedbigdataprocessingengine,SparkisingoodpositiontoprovidebetterHBasesupport.

1.CodeStructure

Thesourcefileswillbeinthesql/hbasesubdirectoryoftheSparksourcetreethatcontainssubdirectoriesofutl,dsl,api,catalystandexecution,whosepurposeswillbeexplainedinlatersections.

2.SupportedHBaseVersion

HBase0.98willbesupported.

3.APIs

PythonAPIswillbeprovided.

4.CommandLineInterface

An（enhanced）commandlineinterface（CLI）willbeprovidedtosupportthenewDDL/DMLcommandsintroducedinthisproject.

5.MetadataPersistence

TablemetadatawillbestoredinaHBasetablenamed“SPARK_SQL_HBASE_TABLE”,ofasinglecolumnfamilynamed“CF”.EachSQLtablewilluseasinglerowintheHBasetable,referredasthe“metatable”hereafter.AndeachcolumnwillstorethenameandtypeencodingofacolumnoftheSQLtable.

Metadatarelatedcodesareinthe“meta”subdirectory

6.InteractiveshellandDSL

TheinteractiveshellisessentiallyacomboofSparkshellandHBaseshell.TheSparkshellwillprovidethesamefunctionalitiesaswhatSparkandSparkSQLcurrentlydo.ThefunctionalitiesfromtheHBaseshellwilladdHBase-specificonestotheSparkshell.ThiswillfacilitateseasonedHBaseusers’andadmins’transitiontotheSparkworld.Furthermore,Spark/ScalapowercanbeappliedtoresultsfromHBaseserver.Forthispurpose,aDSLofexistingHBaseshellcommandswillbecreated.TheoutputfromthisDSLcanthenbeathandforfurtherprocessing,whichisnotpossiblewithHBaseshellbyitself.Anexampleisasfollows:

spark-hbase>scan‘mytable’

res0:

（（String,String）,Seq[（String,String）]）=

ROWCOLUMN+CELL

Row1column=cf1.c1,timestamp=12345678,value=v1

Row2column=cf2.c2,timestamp=12345679,value=v2

spark-hbase>res0._2.filter（_._2.equals（Row2））

res1:

Seq[（String,String）]=List（（Row2,column=cf2.c2,timestamp=12345679,value=v2））

CodesforDSLwillbeputinthe“dsl”subdirectory.

NotethatthisfeatureisnottobeconfusedwithSchemaRDD’sDSL,whichisstillsupportedforHBase-basedSchemaRDDs.

Thefunctionalitiesareprimarilyforuseconvenience,andcouldbeputinaseparate“contrib”sourcesubdirectoryinsteadofthemainsourcetree.

Thisfeatureisnotsupportedinthefirstversion.

7.RDDandPartitions

Anewtypeofrelation,HBaseRelation,willbeintroducedtoworkasabridgebetweenSparkSQLphysicalruntimeandHBase-specificdataaccessmechanism.ItsfunctionalitiesincludemanagementofHBaseconfigurationandconnections,andprovidingvariousmappingandconversionmechanismsandutilities,includingthe（logical）tableschema,key/columnmappings,keycompositionandextractionconveniencemethods,…,etc.TheconnectionattheclientsidewillbemadebythecatalogduringitsrelationlookupprocessandkeptwiththeHBaseRelationinstance.Attheexecutorside,theconnectionswillbemadeeitherbyeachtaskindividuallyandclosedatitsfinish,orbytheexternalresourcepool.HBaseRelationwillbeusedbyHBaseSQLReaderRDD,whichsupportsfilteredscanandapplicationofHBasecoprocessorandwillbeusedbyanewdatasourcenodeinthephysicalplan,HBaseSQLTableScan.ItwillalsobecontainedinanewShuffledRDD,HBaseLoadRDD,forthereducerworkbythebulkloadoperator,LoadIntoHBaseTable.NotethattheHBaseRelationmustbeserializableforusabilitybytheslaves.

PartitionsofHBasedataarethroughtheHBaseregions.Specifically,RDD’sgetPartitionswillreturntherangepartitionsasembodiedbyHBaseregions;RDD’sgetPreferredLocationswillreturnthehostsofHBase’sregionserversifHBaseandSparkarecollocatedonthesamesetofmachines.

HTable’smethodsofgetStartKeysandgetRegionLocationscanbeusedtofetchtheregioninformationfromHBaseserver.

Co-locatedexecutionthusattemptedcanminimizenetworktraffic.Infuture,whenSparksupportslong-runningservices,andeitherSparkExecutororHBaseregionservercanbe“enginized”,eventhe（local）datatransferbetweentheregionserverandtheSparkexecutorcouldbeavoided.

8.DesignPrinciples

ThereareafewdesignprincipleswewouldliketofollowtomaximizetheuserconvenienceonSparkasaunifieddistributedexecutionengine.

1）Itisintendedtohaveleastintrusivecodeimpactonothermodulesoutsideofthissubproject.Thisprincipleincludes

a.MaximizethesupportofSparkSQL’sfunctionalities.Thefewremainingareaswheredifferenceshavetobepresent,mostnoticeablyDDL,willbeimplementedinanextensionofSqlContext,calledHBaseSQLContext,wheredifferentiationwillbeimplementedthroughextendedparser,physicaloptimizerandexecutionenginespecifictoHBase-basedtables.AnotheradvantageofthiscompatibilityprincipleistoallowforqueriesagainstamixofHBasetablesandothers.

b.Onthedevelopmentside,everyeffortwillbemadetoisolatechangestothehbasesubdirectories.Ifneeded,codecopy/pastecouldbeappliedifaccessrestrictionforbidsclassinheritanceormethodoverridingbeforearesolutioncanbereachedonpotentialchangesintheparentproject,SparkSQL.

c.Existingcodingpatterns/models/paradigmsaretobehonoredtothemaximumpossibledegree.

2）AccesstotheHBasedatasourceisprovidedthroughanimplementationofanewSparkSQL1.3“foreigndatasourceinterface”.

9.Deployment

ItisrequiredthatallSparkslavemachinesbeconfiguredasHBase,andimplicitlyZookeeper,clients.ItispreferablethattheSparkandHBaseclusterareco-locatedonthesamesetofphysicalorvirtualboxes,butitisnotactuallyamust.

Coprocessor-andcustom-filter-relatedHBaseconfigurationsandnecessaryjarscontainingcorrespondinglogicsfromSparkSQLwillbedeployedtotheHBasecluster.

Specifically,thehbase-site.xmlwillneedtobeaddedthefollowingfourlines:

hbase.coprocessor.user.region.classes

org.apache.spark.sql.hbase.CheckDirEndPointImpl

Inthehbase-env.shscript,theHBASE_CLASSPATHneedtoaddtheSparkjarandthespark-hbasejarofthisproduct.

10.Configuration

HBaseconfigurationwillbethrougheithertheSparkconfigurationswiththeconventional“spark.sql.hbase”prefix.

Currently,therearefoursupportedconfigurationflags:

1.spark.sql.hbase.partition.expirationspecifiestheexpirationtime（inseconds）ofthecachedHBasetableregioninformation.Thedefaultis600for10minutes.

2.spark.sql.hbase.scanner.fetchsizespecifiestheHBasescannerfetchsizeanddefaultsto1000.

3.spark.sql.hbase.coprocessorisaBooleantoswitchon/offtheuseofcoprocessors.

4.spark.sql.hbase.customfilterisaBooleantoswitchon/offtheuseofcustomfilters.

Ex

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

下载	加入VIP,免费下载

版权申诉 word格式文档无特别注明外均可编辑修改；预览文档经过压缩，下载后原文更清晰！ 立即下载

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: SparkSQLOnHBasev22

冰豆网所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

关于本文

本文标题：SparkSQLOnHBasev22.docx
链接地址：https://www.bdocx.com/doc/25969391.html

SparkSQLOnHBasev22.docx

热门标签