Hadoop新旧API对比初探+链式mapreduce.docx
- 文档编号:3538323
- 上传时间:2022-11-23
- 格式:DOCX
- 页数:7
- 大小:17.79KB
Hadoop新旧API对比初探+链式mapreduce.docx
《Hadoop新旧API对比初探+链式mapreduce.docx》由会员分享,可在线阅读,更多相关《Hadoop新旧API对比初探+链式mapreduce.docx(7页珍藏版)》请在冰豆网上搜索。
Hadoop新旧API对比初探+链式mapreduce
Hadoop新旧API对比初探+链式mapreduce
2013-11-1617:
04:
34| 分类:
海量数据|举报|字号 订阅
1.背景
做毕设的时候碰到一个需要用链式mapreduce的问题,然后调研了一下ChainMapper和ChainReducer的使用,顺带缕一下新旧API的异同。
首先需要说明的是:
从0.20.0开始,hadoop的API发生了改变,但是旧API依然保留,包名:
org.apache.hadoop.mapred
新版API包名:
org.apache.hadoop.mapreduce
所有hadoop已发布版本的具体文档可以查看这里。
2.新旧版API的异同
以map/reduce定义为例
2.1map
类名定义
//新版
publicstaticclassMyMapperextendsMapper
//旧版
publicstaticclassMyMapperextendsMapReduceBaseimplementsMapper
涉及包名
//新版
importorg.apache.hadoop.mapreduce.Mapper
//旧版
importorg.apache.hadoop.mapred.MapReduceBase
importorg.apache.hadoop.mapred.Mapper
方法定义
新版
publicvoidmap(Objectkey,Textvalue,Contextcontext)throwsIOException,InterruptedException{}
旧版
publicvoidmap(Objectkey,Textvalue,OutputCollector
涉及包名
//新版
importjava.io.IOException
//旧版
importjava.io.IOException
importorg.apache.hadoop.mapred.OutputCollector
importorg.apache.hadoop.mapred.Reporter
2.2reduce
类名定义
//新版
publicstaticMyReducerextendsReducer
//旧版
publicstaticclassMyReducerextendsMapReduceBaseimplementsReducer
涉及包名
//新版
importorg.apache.hadoop.mapreduce.Reducer
//旧版
importorg.apache.hadoop.mapred.Reducer
方法定义
//新版
publicvoidreduce(Textkey,Iterable
//旧版
publicvoidreduce(Textkey,Iterator
涉及包名
//新版
与map相同
//旧版
比map多一个包
importjava.util.Iterator;
2.3main方法
新版
Configurationconf=newConfiguration();
Jobjob=newJob(conf,"mapreduce");
job.setJarByClass(MyMapReduce.class);
job.setMapperClass(MyMapper.class);
job.setCombinerClass(MyReducer.class);
job.setReducerClass(MyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job,newPath(path1));
FileOutputFormat.setOutputPath(job,newPath(path2));
job.waitForCompletion(true);
旧版
JobConfconf=newJobConf(MyMapReduce.class);
conf.setJobName("mapreduce");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setMapperClass(MyMapper.class);
conf.setCombinerClass(MyReducer.class);
conf.setReducerClass(MyReducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,newPath(path1));
FileOutputFormat.setOutputPath(conf,newPath(path2));
JobClient.runJob(conf);
即新版API中用Job代替了旧版中的JobConf,同时对一些方法也进行了重新定义,具体看下面的包对比
涉及包名
新版
importorg.apache.hadoop.fs.Path;
importorg.apache.hadoop.mapreduce.Job;
importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat;
importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
旧版
importorg.apache.hadoop.fs.Path;
importorg.apache.hadoop.mapred.JobConf;
importorg.apache.hadoop.mapred.JobClient;
importorg.apache.hadoop.mapred.FileInputFormat;
importorg.apache.hadoop.mapred.FileOutputFormat;
importorg.apache.hadoop.mapred.TextInputFormat;
importorg.apache.hadoop.mapred.TextOutputFormat;
3.链式MapReduce
ChainMapper和ChainReducer在旧版包中,链式mapreduce只允许有一个reduce,但是可以有多个map,包括reduce之前和之后,我在实现时链如下:
map->reduce->map
packageorg.apache.hadoop.examples;
importjava.io.IOException;
importjava.util.StringTokenizer;
importjava.util.Iterator;
importorg.apache.hadoop.conf.Configuration;
importorg.apache.hadoop.fs.Path;
importorg.apache.hadoop.io.IntWritable;
importorg.apache.hadoop.io.Text;
importorg.apache.hadoop.mapred.JobConf;
importorg.apache.hadoop.mapred.JobClient;
importorg.apache.hadoop.mapred.FileInputFormat;
importorg.apache.hadoop.mapred.FileOutputFormat;
importorg.apache.hadoop.mapred.MapReduceBase;
importorg.apache.hadoop.mapred.lib.ChainMapper;
importorg.apache.hadoop.mapred.lib.ChainReducer;
importorg.apache.hadoop.mapred.Mapper;
importorg.apache.hadoop.mapred.Reducer;
importorg.apache.hadoop.mapred.OutputCollector;
importorg.apache.hadoop.mapred.Reporter;
importorg.apache.hadoop.mapred.TextInputFormat;
importorg.apache.hadoop.mapred.TextOutputFormat;
publicclassGraphPartition{
publicstaticclassLabelCompareMapper
extendsMapReduceBaseimplementsMapper
publicvoidmap(Objectkey,Textvalue,OutputCollector
throwsIOException{
Stringline=value.toString();
if(line.substring(0,1).matches("[0-9]{1}"))
{
String[]values=line.split("\t");
String[]heads=values[0].split("");
if(values[1].contains("_"))
{
values[1]=values[1].replace("_","");
String[]tails=values[1].split("");
Stringsrc_ver=heads[0];
Stringlabel=heads[1];
for(inti=0;i { output.collect(newText(tails[i]),newText(src_ver+"_"+label));//<21_1> } } output.collect(newText(heads[0]),newText(heads[1]));//<22> } } } publicstaticclassLabelCompareReducer extendsMapReduceBaseimplementsReducer privateTextresult=newText(); publicvoidreduce(Textkey,Iterator throwsIOException{ Stringhead=""; Stringtail=""; while(values.hasNext()) { Stringval=values.next().toString(); if(val.contains("_")) { tail=tail+val+""; } else { head=val; } } if(tail.contains("_")) { result.set(head+""+tail); output.collect(key,result);//<221_1> } } } publicstaticclassGraphPartitionMapper extendsMapReduceBaseimplementsMapper publicvoidmap(Textkey,Textvalue,OutputCollector throwsIOException{ Stringver=key.toString(); Stringtail=value.toString(); String[]sps=tail.split(""); Stringver_lab=sps[0]; for(inti=0;i { if(sps[i].length()>=1) { if(sps[i].contains("_")) { String[]blocks=sps[i].split("_"); if(blocks[1]==ver_lab) { output.collect(newText(blocks[0]+""+ver+""+blocks[1]),newText()); } else { output.collect(newText(blocks[0]+""+ver+""+ver_lab),newText()); } } else { ver_lab=sps[i]; } } } } } publicstaticvoidmain(String[]args)throwsException{ Stringpath1="lbp/input"; Stringpath2="lbp/out1"; JobConfjob=newJobConf(GraphPartition.class); job.setJobName("ChainJob"); job.setInputFormat(TextInputFormat.class); job.setOutputFormat(TextOutputFormat.class); JobConflabelcomparemapperconf=newJobConf(false); ChainMapper.addMapper(job,LabelCompareMapper.class,Object.class,Text.class,Text.class,Text.class,true,labelcomparemapperconf); JobConflabelcomparereducerconf=newJobConf(false); ChainReducer.setReducer(job,LabelCompareReducer.class,Text.class,Text.class,Text.class,Text.class,true,labelcomparereducerconf); JobConfgraphpartitionmapperconf=newJobConf(false); ChainReducer.addMapper(job,GraphPartitionMapper.class,Text.class,Text.class,Text.class,Text.class,true,graphpartitionmapperconf); job.setJarByClass(GraphPartition.class); job.setNumReduceTasks (1); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); FileInputFormat.setInputPaths(job,newPath(path1)); FileOutputFormat.setOutputPath(job,newPath(path2)); JobClient.runJob(job); } } 上面代码已经经过测试,我一直相信实践是检验真理的唯一标准,这么说是因为网上有好几篇介绍链式mapreduce的博客都没有说明hadoop版本,而且代码漏洞百出,所以觉得有必要写这么一篇博客,希望能给后来者一些借鉴。 吐槽一下,网易博客的代码编辑功能真的是很烂。 。 。 。 希望不影响阅读。
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- Hadoop 新旧 API 对比 初探 链式 mapreduce