Spark整合kafka0100新特性二.docx
- 文档编号:3812361
- 上传时间:2022-11-25
- 格式:DOCX
- 页数:17
- 大小:89.65KB
Spark整合kafka0100新特性二.docx
《Spark整合kafka0100新特性二.docx》由会员分享,可在线阅读,更多相关《Spark整合kafka0100新特性二.docx(17页珍藏版)》请在冰豆网上搜索。
Spark整合kafka0100新特性二
Spark整合kafka0.10.0新特性
(二)
接着Spark整合kafka0.10.0新特性
(一)开始
importorg.apache.kafka.clients.consumer.ConsumerRecord
importmon.serialization.StringDeserializer
importorg.apache.spark.streaming.kafka010._
importorg.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
importorg.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
valkafkaParams=Map[String,Object](
"bootstrap.servers"->"localhost:
9092,anotherhost:
9092",
"key.deserializer"->classOf[StringDeserializer],
"value.deserializer"->classOf[StringDeserializer],
"group.id"->"use_a_separate_group_id_for_each_stream",
"auto.offset.reset"->"latest",
"mit"->(false:
java.lang.Boolean)
)
valtopics=Array("topicA","topicB")
valstream=KafkaUtils.createDirectStream[String,String](
streamingContext,
PreferConsistent,
Subscribe[String,String](topics,kafkaParams)
)
stream.map(record=>(record.key,record.value))
分析完位置策略和消费策略,接下来先看看org.apache.Spark.streaming.kafka010.KafkaUtils$#createDirectStream的具体实现:
[java]viewplaincopy在CODE上查看代码片派生到我的代码片
@Experimental
defcreateDirectStream[K,V](
ssc:
StreamingContext,
locationStrategy:
LocationStrategy,
consumerStrategy:
ConsumerStrategy[K,V]):
InputDStream[ConsumerRecord[K,V]]={ pre"> valppc=newDefaultPerPartitionConfig(ssc.sparkContext.getConf) createDirectStream[K,V](ssc,locationStrategy,consumerStrategy,ppc) } 返回的是InputDStream[ConsumerRecord[K,V]]类型,查看一下ConsumerRecord类型: [java]viewplaincopy在CODE上查看代码片派生到我的代码片 /** *Akey/valuepairtobereceivedfromKafka.Thisconsistsofatopicnameandapartitionnumber,fromwhichthe *recordisbeingreceivedandanoffsetthatpointstotherecordinaKafkapartition. *从Kafka接受到的消息对key/value,包含topic名字、分区编号、以及消息在分区的offset */ publicfinalclassConsumerRecord publicstaticfinallongNO_TIMESTAMP=Record.NO_TIMESTAMP; publicstaticfinalintNULL_SIZE=-1; publicstaticfinalintNULL_CHECKSUM=-1; privatefinalStringtopic; privatefinalintpartition; privatefinallongoffset; privatefinallongtimestamp; privatefinalTimestampTypetimestampType; privatefinallongchecksum; privatefinalintserializedKeySize; privatefinalintserializedValueSize; privatefinalKkey; privatefinalVvalue; 等等省略 } 关于InputDStream具体细节略,看一下类继承结构: 所以createDirectStream返回的具体类型是DirectKafkaInputDStream。 接着在createDirectStream中创建DefaultPerPartitionConfig,DefaultPerPartitionConfig就是一个设置每一个分区获取消息的组大数率,设置参数为spark.streaming.kafka.maxRatePerPartition.源码如下: [java]viewplaincopy在CODE上查看代码片派生到我的代码片 packageorg.apache.spark.streaming.kafka010 importmon.TopicPartition importorg.apache.spark.SparkConf importorg.apache.spark.annotation.Experimental /** *: : Experimental: : *Interfaceforuser-suppliedconfigurationsthatcan'totherwisebesetviaSparkproperties, *becausetheyneedtweakingonaper-partitionbasis, * *为用户提供的一个配置接口,但是这些参数不可以使用spark配置文件进行配置,因为spark配置文件配置,因为他们需要 *对每一个分区的比率进行调整。 可以使用SparkConf进行设置数率 */ @Experimental abstractclassPerPartitionConfigextendsSerializable{ /** *Maximumrate(numberofrecordspersecond)atwhichdatawillberead *fromeachKafkapartition. * *从Kafka分区中读取数据的最大比率(每秒最大记录数) */ defmaxRatePerPartition(topicPartition: TopicPartition): Long } /** *Defaultper-partitionconfiguration */ privateclassDefaultPerPartitionConfig(conf: SparkConf) extendsPerPartitionConfig{ valmaxRate=conf.getLong("spark.streaming.kafka.maxRatePerPartition",0) //从Kafka分区中读取数据的最大比率(每秒最大记录数) defmaxRatePerPartition(topicPartition: TopicPartition): Long=maxRate } 创建完毕PerPartitionConfig之后再次调用createDirectStream的重载方法: [java]viewplaincopy在CODE上查看代码片派生到我的代码片 defcreateDirectStream[K,V]( ssc: StreamingContext, locationStrategy: LocationStrategy, consumerStrategy: ConsumerStrategy[K,V], perPartitionConfig: PerPartitionConfig ): InputDStream[ConsumerRecord[K,V]]={ newDirectKafkaInputDStream[K,V](ssc,locationStrategy,consumerStrategy,perPartitionConfig) } 接下来重点查看DirectKafkaInputDStream的构造器(注意: Scala类的构造器是从类定义的左{开始到右}结束都是主构造器): [java]viewplaincopy在CODE上查看代码片派生到我的代码片 packageorg.apache.spark.streaming.kafka010 importjava.{util=>ju} importjava.util.concurrent.ConcurrentLinkedQueue importjava.util.concurrent.atomic.AtomicReference importscala.annotation.tailrec importscala.collection.JavaConverters._ importscala.collection.mutable importorg.apache.kafka.clients.consumer._ importmon.{PartitionInfo,TopicPartition} importorg.apache.spark.SparkException importorg.apache.spark.internal.Logging importorg.apache.spark.storage.StorageLevel importorg.apache.spark.streaming.{StreamingContext,Time} importorg.apache.spark.streaming.dstream._ importorg.apache.spark.streaming.scheduler.{RateController,StreamInputInfo} importorg.apache.spark.streaming.scheduler.rate.RateEstimator /** * *eachgivenKafkatopic/partitioncorrespondstoanRDDpartition. *Thesparkconfigurationspark.streaming.kafka.maxRatePerPartitiongivesthemaximumnumber *ofmessages *persecondthateach'''partition'''willaccept. * *每个topic的每一个分区对应一个RDD分区 *spark的spark.streaming.kafka.maxRatePerPartition参数配置指定了每秒每一个topic的每一个分区获取的最大消息数 * *@paramlocationStrategyInmostcases,passin[[PreferConsistent]], *see[[LocationStrategy]]formoredetails. *@paramexecutorKafkaParamsKafka * //kafka.apache.org/documentation.html#newconsumerconfigs"> *configurationparameters. *Requires"bootstrap.servers"tobesetwithKafkabroker(s), *NOTzookeeperservers,specifiedinhost1: port1,host2: port2form. *@paramconsumerStrategyInmostcases,passin[[Subscribe]], *see[[ConsumerStrategy]]formoredetails *@tparamKtypeofKafkamessagekeyKafka消息的Key *@tparamVtypeofKafkamessagevalueKafka消息的Value */ private[spark]classDirectKafkaInputDStream[K,V]( _ssc: StreamingContext, locationStrategy: LocationStrategy, consumerStrategy: ConsumerStrategy[K,V], ppc: PerPartitionConfig )extendsInputDStream[ConsumerRecord[K,V]](_ssc)withLoggingwithCanCommitOffsets{ valexecutorKafkaParams={ valekp=newju.HashMap[String,Object](consumerStrategy.executorKafkaParams) //根据具体的executor调整参数,防止在executor上出问题 KafkaUtils.fixKafkaParams(ekp) ekp } //存入当前偏移的 protectedvarcurrentOffsets=Map[TopicPartition,Long]() //如果偏移量为1的话,则设置偏移为1 @transientprivatevarkc: Consumer[K,V]=null defconsumer(): Consumer[K,V]=this.synchronized{ if(null==kc){ kc=consumerStrategy.onStart(currentOffsets.mapValues(l=>newjava.lang.Long(l)).asJava) } kc } overridedefpersist(newLevel: StorageLevel): DStream[ConsumerRecord[K,V]]={ logError("KafkaConsumerRecordisnotserializable."+ "Use.maptoextractfieldsbeforecalling.persistor.window") super.persist(newLevel) } // protecteddefgetBrokers={ valc=consumer valresult=newju.HashMap[TopicPartition,String]() valhosts=newju.HashMap[TopicPartition,String]() //assignment()获取该Consumer的TopicPartition,返回Set valassignments=c.assignment().iterator() //两层while循环实现获取 while(assignments.hasNext()){ valtp: TopicPartition=assignments.next() //当前的TopicPartition的主机地址没有的话,需要根据去kafka集群查找该TopicPartition的主机地址 if(null==hosts.get(tp)){ //partitionsFor获取给定topic和partition的元数据,如果本地没有会发起rpc valinfos=c.partitionsFor(tp.topic).iterator() while(infos.hasNext()){ vali=infos.next() //TopicPartition重写了equals方法 hosts.put(newTopicPartition(i.topic(),i.partition()),i.leader.host()) } } //TopicPartition重写了equals方法,所以可以hosts.get(tp) //到此处就获取到了分区和分区的地址 result.put(tp,hosts.get(tp)) } result } protecteddefgetPreferredHosts: ju.Map[TopicPartition,String]={ locationStrategymatch{ casePreferBrokers=>getBrokers casePreferConsistent=>ju.Collections.emptyMap[TopicPartition,String]() casePreferFixed(hostMap)=>hostMap } } //Keepthisconsistentwithhowotherstreamsarenamed(e.g."Flumepollingstream[2]") private[streaming]overridedefname: String=s"Kafka0.10directstream[$id]" protected[streaming]overridevalcheckpointData= newDirectKafkaInputDStreamCheckpointData /** *Asynchronouslymaintains&sendsnewratelimitstothereceiverthroughthereceivertracker. */ overrideprotected[streaming]valrateController: Option[RateController]={ if(RateController.isBackPressureEnabled(ssc.conf)){ Some(newDirectKafkaRateController(id, RateEstimator.create(ssc.conf,context.graph.batchDuration))) }else{ None } } protected[streaming]defmaxMessagesPerPartition( offsets: Map[TopicPartition,Long]): Option[Map[TopicPartition,Long]]={ valestimatedRateLimit=rateController.map(_.getLatestRate()) //calculateaper-partitionratelimitbasedoncurrentlag valeffectiveRateLimitPerPartition=estimatedRateLimit.filter(_>0)match{ caseSme(rate)=> vallagPerPartition=offsets.map{case(tp,offset)=> tp->Math.max(offset-currentOffsets(tp),0) } valtotalLag=lagPerPartition.values.sum lagPerPartition.map{case(tp,lag)=> valmaxRateLimitPerPartition=ppc.maxRatePerPartition(tp) valbackpressureRate=Math.round(lag/totalLag.toFloat*rate) tp->(if(maxRateLimitPerPartition>0){ Math.min(backpressureRate,maxRateLimitPerPartition) }elsebackpressureRate) } caseNone=>
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- Spark 整合 kafka0100 特性