SparkSQLOnHBasev22.docx
- 文档编号:25969391
- 上传时间:2023-06-16
- 格式:DOCX
- 页数:23
- 大小:50.92KB
SparkSQLOnHBasev22.docx
《SparkSQLOnHBasev22.docx》由会员分享,可在线阅读,更多相关《SparkSQLOnHBasev22.docx(23页珍藏版)》请在冰豆网上搜索。
SparkSQLOnHBasev22
ProposalofSparkSQLonHBase
Contents
ProposalofSparkSQLonHBase
YanZhou
July12,2015
Version2.2
ModificationHistory
Overview
1.CodeStructure
2.SupportedHBaseVersion
3.APIs
4.CommandLineInterface
5.MetadataPersistence
6.InteractiveshellandDSL
7.RDDandPartitions
8.DesignPrinciples
9.Deployment
10.Configuration
11.HBaseConnectionHandling
12.HBaseScannerCachingandPartitioncaching
13.RowKeyComposition
14.PartitionPruningandPredicatePushdown
15.Utilities
16.SQLSupport
16.1DDL
16.1.1CREATETABLE
16.1.2DROPTABLE
16.1.3ALTERTABLE
16.2DML
16.2.1INSERT
16.2.2BulkLoading
16.3MiscellaneousCommands
17.Newfiles,classesandfunctionalities
17.1HBaseCatalogextendsCatalog
17.2HBaseSqlParserextendsSqlParser
17.3HBaseSQLContextextendsSQLContext
17.4HBasePartitionextendsPartition
17.5HBaseSQLReaderRDDextendsRDD
17.6ClassDiagram
18.DataFrame
19.JavaSupport
20.PythonSupport
20.1PythonShell
20.2PythonScript
21.Coprocessor
21.1AvailabilityandLoadingofCoprocessor
21.2CoprocessorSub-Plan
21.3ThecoprocessorexecutionbytheRegionServer
21.4HBaseRelationcachingandHTablePool
21.5PhasesofDevelopment
22.CustomFilters
22.1RowskipsfromFiltersonNon-leadingDimensionKey
22.2FilteronanyportionoftheRowKey
22.3The“other”Filters
23.Limitations
24.RelatedWork
25.DevelopmentPhases
26.SupportedSparkReleases
27.FutureWork
28.FAQs
ModificationHistory
Version2.1:
Coprocessor
Overview
ApacheHBaseisadistributedKey-ValuestoreofdataonHDFS.ItismodeledafterGoogle’sBigTable,andprovidesAPIstoquerythedata.Thedataisorganized,partitionedanddistributedbyits“rowkeys”.Perpartition,thedataisfurtherphysicallypartitionedby“columnfamilies”thatspecifycollectionsof“columns”ofdata.Thedatamodelisforwideandsparsetableswherecolumnsaredynamicandmaywellbesparse.
AlthoughHBaseisaveryusefulbigdatastore,itsaccessmechanismisveryprimitiveandonlythroughclient-sideAPIs,Map/Reduceinterfacesandinteractiveshells.SQLaccessestoHBasedataareavailablethroughMap/ReduceorinterfacesmechanismssuchasApacheHiveandImpala,orsome“native”SQLtechnologieslikeApachePhoenix.Whiletheformerisusuallycheapertoimplementanduse,theirlatenciesandefficienciesoftencannotcomparefavorablywiththelatterandareoftensuitableonlyforofflineanalysis.Thelattercategory,incontrast,oftenperformsbetterandqualifiesmoreasonlineengines;theyareoftenontopofpurpose-builtexecutionengines.
CurrentlySparksupportsqueriesagainstHBasedatathroughHBase’sMap/Reduceinterface(i.e.,TableInputFormat).SparkSQLsupportsuseofHivedata,whichtheoreticallyshouldbeabletosupportHBasedataaccess,out-of-box,throughHBase’sMap/Reduceinterfaceandthereforefallsintothefirstcategoryofthe“SQLonHBase”technologies.
Webelieve,asaunifiedbigdataprocessingengine,SparkisingoodpositiontoprovidebetterHBasesupport.
1.CodeStructure
Thesourcefileswillbeinthesql/hbasesubdirectoryoftheSparksourcetreethatcontainssubdirectoriesofutl,dsl,api,catalystandexecution,whosepurposeswillbeexplainedinlatersections.
2.SupportedHBaseVersion
HBase0.98willbesupported.
3.APIs
PythonAPIswillbeprovided.
4.CommandLineInterface
An(enhanced)commandlineinterface(CLI)willbeprovidedtosupportthenewDDL/DMLcommandsintroducedinthisproject.
5.MetadataPersistence
TablemetadatawillbestoredinaHBasetablenamed“SPARK_SQL_HBASE_TABLE”,ofasinglecolumnfamilynamed“CF”.EachSQLtablewilluseasinglerowintheHBasetable,referredasthe“metatable”hereafter.AndeachcolumnwillstorethenameandtypeencodingofacolumnoftheSQLtable.
Metadatarelatedcodesareinthe“meta”subdirectory
6.InteractiveshellandDSL
TheinteractiveshellisessentiallyacomboofSparkshellandHBaseshell.TheSparkshellwillprovidethesamefunctionalitiesaswhatSparkandSparkSQLcurrentlydo.ThefunctionalitiesfromtheHBaseshellwilladdHBase-specificonestotheSparkshell.ThiswillfacilitateseasonedHBaseusers’andadmins’transitiontotheSparkworld.Furthermore,Spark/ScalapowercanbeappliedtoresultsfromHBaseserver.Forthispurpose,aDSLofexistingHBaseshellcommandswillbecreated.TheoutputfromthisDSLcanthenbeathandforfurtherprocessing,whichisnotpossiblewithHBaseshellbyitself.Anexampleisasfollows:
spark-hbase>scan‘mytable’
res0:
((String,String),Seq[(String,String)])=
ROWCOLUMN+CELL
Row1column=cf1.c1,timestamp=12345678,value=v1
Row2column=cf2.c2,timestamp=12345679,value=v2
spark-hbase>res0._2.filter(_._2.equals(Row2))
res1:
Seq[(String,String)]=List((Row2,column=cf2.c2,timestamp=12345679,value=v2))
CodesforDSLwillbeputinthe“dsl”subdirectory.
NotethatthisfeatureisnottobeconfusedwithSchemaRDD’sDSL,whichisstillsupportedforHBase-basedSchemaRDDs.
Thefunctionalitiesareprimarilyforuseconvenience,andcouldbeputinaseparate“contrib”sourcesubdirectoryinsteadofthemainsourcetree.
Thisfeatureisnotsupportedinthefirstversion.
7.RDDandPartitions
Anewtypeofrelation,HBaseRelation,willbeintroducedtoworkasabridgebetweenSparkSQLphysicalruntimeandHBase-specificdataaccessmechanism.ItsfunctionalitiesincludemanagementofHBaseconfigurationandconnections,andprovidingvariousmappingandconversionmechanismsandutilities,includingthe(logical)tableschema,key/columnmappings,keycompositionandextractionconveniencemethods,…,etc.TheconnectionattheclientsidewillbemadebythecatalogduringitsrelationlookupprocessandkeptwiththeHBaseRelationinstance.Attheexecutorside,theconnectionswillbemadeeitherbyeachtaskindividuallyandclosedatitsfinish,orbytheexternalresourcepool.HBaseRelationwillbeusedbyHBaseSQLReaderRDD,whichsupportsfilteredscanandapplicationofHBasecoprocessorandwillbeusedbyanewdatasourcenodeinthephysicalplan,HBaseSQLTableScan.ItwillalsobecontainedinanewShuffledRDD,HBaseLoadRDD,forthereducerworkbythebulkloadoperator,LoadIntoHBaseTable.NotethattheHBaseRelationmustbeserializableforusabilitybytheslaves.
PartitionsofHBasedataarethroughtheHBaseregions.Specifically,RDD’sgetPartitionswillreturntherangepartitionsasembodiedbyHBaseregions;RDD’sgetPreferredLocationswillreturnthehostsofHBase’sregionserversifHBaseandSparkarecollocatedonthesamesetofmachines.
HTable’smethodsofgetStartKeysandgetRegionLocationscanbeusedtofetchtheregioninformationfromHBaseserver.
Co-locatedexecutionthusattemptedcanminimizenetworktraffic.Infuture,whenSparksupportslong-runningservices,andeitherSparkExecutororHBaseregionservercanbe“enginized”,eventhe(local)datatransferbetweentheregionserverandtheSparkexecutorcouldbeavoided.
8.DesignPrinciples
ThereareafewdesignprincipleswewouldliketofollowtomaximizetheuserconvenienceonSparkasaunifieddistributedexecutionengine.
1)Itisintendedtohaveleastintrusivecodeimpactonothermodulesoutsideofthissubproject.Thisprincipleincludes
a.MaximizethesupportofSparkSQL’sfunctionalities.Thefewremainingareaswheredifferenceshavetobepresent,mostnoticeablyDDL,willbeimplementedinanextensionofSqlContext,calledHBaseSQLContext,wheredifferentiationwillbeimplementedthroughextendedparser,physicaloptimizerandexecutionenginespecifictoHBase-basedtables.AnotheradvantageofthiscompatibilityprincipleistoallowforqueriesagainstamixofHBasetablesandothers.
b.Onthedevelopmentside,everyeffortwillbemadetoisolatechangestothehbasesubdirectories.Ifneeded,codecopy/pastecouldbeappliedifaccessrestrictionforbidsclassinheritanceormethodoverridingbeforearesolutioncanbereachedonpotentialchangesintheparentproject,SparkSQL.
c.Existingcodingpatterns/models/paradigmsaretobehonoredtothemaximumpossibledegree.
2)AccesstotheHBasedatasourceisprovidedthroughanimplementationofanewSparkSQL1.3“foreigndatasourceinterface”.
9.Deployment
ItisrequiredthatallSparkslavemachinesbeconfiguredasHBase,andimplicitlyZookeeper,clients.ItispreferablethattheSparkandHBaseclusterareco-locatedonthesamesetofphysicalorvirtualboxes,butitisnotactuallyamust.
Coprocessor-andcustom-filter-relatedHBaseconfigurationsandnecessaryjarscontainingcorrespondinglogicsfromSparkSQLwillbedeployedtotheHBasecluster.
Specifically,thehbase-site.xmlwillneedtobeaddedthefollowingfourlines:
Inthehbase-env.shscript,theHBASE_CLASSPATHneedtoaddtheSparkjarandthespark-hbasejarofthisproduct.
10.Configuration
HBaseconfigurationwillbethrougheithertheSparkconfigurationswiththeconventional“spark.sql.hbase”prefix.
Currently,therearefoursupportedconfigurationflags:
1.spark.sql.hbase.partition.expirationspecifiestheexpirationtime(inseconds)ofthecachedHBasetableregioninformation.Thedefaultis600for10minutes.
2.spark.sql.hbase.scanner.fetchsizespecifiestheHBasescannerfetchsizeanddefaultsto1000.
3.spark.sql.hbase.coprocessorisaBooleantoswitchon/offtheuseofcoprocessors.
4.spark.sql.hbase.customfilterisaBooleantoswitchon/offtheuseofcustomfilters.
Ex
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- SparkSQLOnHBasev22