书签分享收藏举报版权申诉 / 35

立即下载加入VIP,免费下载

当前位置：首页 > 自然科学 > 数学 > HadoopArchives详解Word格式文档下载.docx

HadoopArchives详解Word格式文档下载.docx

文档编号：17420161
上传时间：2022-12-01
格式：DOCX
页数：35
大小：48.92KB

《HadoopArchives详解Word格式文档下载.docx》由会员分享，可在线阅读，更多相关《HadoopArchives详解Word格式文档下载.docx（35页珍藏版）》请在冰豆网上搜索。

HadoopArchives详解Word格式文档下载.docx

具体请看下一节分析。

2.1.2程序流程介绍（类图分析）

程序最开始由HadoopArchives的main函数开始，通过ToolRunner这个类来调用HadoopArchives的run方法。

run方法的主要作用是判断和提取命令行参数，将命令行参数转换成输入目录和输出目录传递给archive方法。

下面来具体看一下run方法：

/**themaindriverforcreatingthearchives

*ittakesatleasttwocommandlineparameters.Thesrcandthe

*dest.Itdoesanlsronthesourcepaths.

*Themappercreatedarchuvesandthereducercreates

*thearchiveindex.

*/

publicintrun（String[]args）throwsException{

try{

List<

Path>

srcPaths=newArrayList<

（）;

PathdestPath=null;

//checkweweresupposedtoarchiveor

//unarchive

StringarchiveName=null;

if（args.length<

4）{

System.out.println（usage）;

thrownewIOException（"

Invalidusage."

）;

}

if（!

"

-archiveName"

.equals（args[0]））{

ArchiveNamenotspecified."

archiveName=args[1];

checkValidName（archiveName））{

Invalidnameforarchives."

+archiveName）;

for（inti=2;

i<

args.length;

i++）{

if（i==（args.length-1））{

destPath=newPath（args[i]）;

else{

srcPaths.add（newPath（args[i]））;

if（srcPaths.size（）==0）{

InvalidUsage:

Noinputsourcesspecified."

//doaglobonthesrcPathsandthenpassiton

globPaths=newArrayList<

for（Pathp:

srcPaths）{

FileSystemfs=p.getFileSystem（getConf（））;

FileStatus[]statuses=fs.globStatus（p）;

for（FileStatusstatus:

statuses）{

globPaths.add（fs.makeQualified（status.getPath（）））;

archive（globPaths,archiveName,destPath）;

}catch（IOExceptionie）{

System.err.println（ie.getLocalizedMessage（））;

return-1;

return0;

上面标红的代码是为了将输入路径（由输入参数而来）补全，加上scheme和authority，形成完整的路径。

Thenpassiton.

这个方法主要是为了提取globPaths、archiveName和destPath三个变量，作为参数传给archive方法，这个变量的意思很简单，globPaths：

所有的输入路径，archiveName：

生成har文件的文件名，destPath：

输出目录，存放以archiveName为名的文件的目录。

在提取这三个参数的时候，做了一些相关错误检查。

具体可以想见代码。

下面来看archive方法，archive方法主要做了三件事，一是配置相关作业参数；

二是根据输入文件生成SequenceFile,它来保存所有和输入文件相关的目录信息，它会作为Map的输入；

最后一件就是启动作业。

下面看一下archive方法的源码：

/**archivethegivensourcepathsinto

*thedest

*paramsrcPathsthesrcpathstobearchived

*paramdestthedestdirthatwillcontainthearchive

publicvoidarchive（List<

srcPaths,StringarchiveName,Pathdest）

throwsIOException{

checkPaths（conf,srcPaths）;

intnumFiles=0;

longtotalSize=0;

conf.set（DST_HAR_LABEL,archiveName）;

PathoutputPath=newPath（dest,archiveName）;

FileOutputFormat.setOutputPath（conf,outputPath）;

FileSystemoutFs=outputPath.getFileSystem（conf）;

if（outFs.exists（outputPath）||outFs.isFile（dest））{

InvalidOutput."

conf.set（DST_DIR_LABEL,outputPath.toString（））;

finalStringrandomId=DistCp.getRandomId（）;

PathjobDirectory=newPath（newJobClient（conf）.getSystemDir（）,

NAME+"

_"

+randomId）;

conf.set（JOB_DIR_LABEL,jobDirectory.toString（））;

//getatmpdirectoryforinputsplits

FileSystemjobfs=jobDirectory.getFileSystem（conf）;

jobfs.mkdirs（jobDirectory）;

PathsrcFiles=newPath（jobDirectory,"

_har_src_files"

conf.set（SRC_LIST_LABEL,srcFiles.toString（））;

SequenceFile.WritersrcWriter=SequenceFile.createWriter（jobfs,conf,

srcFiles,LongWritable.class,Text.class,

SequenceFile.CompressionType.NONE）;

//getthelistoffiles

//createsinglelistoffilesanddirs

//writethetopleveldirsinfirst

writeTopLevelDirs（srcWriter,srcPaths）;

srcWriter.sync（）;

//thesearetheinputpathspassed

//fromthecommandline

//wedoarecursivelsonthesepaths

//andthenwritethemtotheinputfile

//oneatatime

for（Pathsrc:

FileSystemfs=src.getFileSystem（conf）;

ArrayList<

FileStatus>

allFiles=newArrayList<

recursivels（fs,src,allFiles）;

for（FileStatusstat:

allFiles）{

StringtoWrite="

;

longlen=stat.isDir（）?

0:

stat.getLen（）;

if（stat.isDir（））{

toWrite="

+fs.makeQualified（stat.getPath（））+"

dir"

//getthechildren

FileStatus[]list=fs.listStatus（stat.getPath（））;

StringBuffersbuff=newStringBuffer（）;

sbuff.append（toWrite）;

for（FileStatusstats:

list）{

sbuff.append（stats.getPath（）.getName（）+"

"

toWrite=sbuff.toString（）;

toWrite+=fs.makeQualified（stat.getPath（））+"

file"

srcWriter.append（newLongWritable（len）,new

Text（toWrite））;

numFiles++;

totalSize+=len;

}finally{

srcWriter.close（）;

//increasethereplicationofsrcfiles

jobfs.setReplication（srcFiles,（short）10）;

conf.setInt（SRC_COUNT_LABEL,numFiles）;

conf.setLong（TOTAL_SIZE_LABEL,totalSize）;

intnumMaps=（int）（totalSize/partSize）;

//runatleastonemap.

conf.setNumMapTasks（numMaps==0?

1:

numMaps）;

conf.setNumReduceTasks

（1）;

conf.setInputFormat（HArchiveInputFormat.class）;

conf.setOutputFormat（NullOutputFormat.class）;

conf.setMapperClass（HArchivesMapper.class）;

conf.setReducerClass（HArchivesReducer.class）;

conf.setMapOutputKeyClass（IntWritable.class）;

conf.setMapOutputValueClass（Text.class）;

conf.set（"

hadoop.job.history.user.location"

"

none"

FileInputFormat.addInputPath（conf,jobDirectory）;

//makesurenospeculativeexecutionisdone

conf.setSpeculativeExecution（false）;

JobClient.runJob（conf）;

//deletethetmpjobdirectory

jobfs.delete（jobDirectory,true）;

LOG.info（"

Unabletocleantmpdirectory"

+jobDirectory）;

标红处

fs.makeQualified（stat.getPath（））

是为了将路径补全成全路径。

可能是因为再在获取子目录和子目录中的文件是用的getPath方法，导致只获取了目录，相关的要看一下Path类和URI类。

首先是检查输入路径，之后开始配置一些参数，配置参数分成两种，一是conf的配置，二是MapReduce的作业输入输出路径的配置。

通过conf的配置中，有和一般MapReduce一样的配置过程，这部分就不说了，还有几个参数比较重要这里说一下，

conf.set（DST_HAR_LABEL,archiveName）;

conf.set（DST_DIR_LABEL,outputPath.toString（））;

conf.set（JOB_DIR_LABEL,jobDirectory.toString（））;

conf.setLong（TOTAL_SIZE_LABEL,totalSize）;

这些参数的意思也比较清楚，可以结合代码看一下。

为什么要设置他们，是因为将在后面的程序用到他们，比如说在获取输入分片的时候，这个我们后面再说。

FileInputFormat.addInputPath（conf,jobDirectory）;

……

MapReduce作业配置这里略过。

之后我们来看一下，它是如何生成Map的输入文件，它被放到了srcFiles里面，

PathsrcFiles=newPath（jobDirectory,"

它容的格式是这样的：

文件大小（目录为0）+路径名+dirorfile+[子目录]（如果是目录的话并且有子目录，子目录只有当前目录下的名字，即不是全文路径，都是相对于当前目录的）

它是一个SequenceFile，生成它主要有两个方法：

writeTopLevelDirs和recursivels。

这两个方法的作用是：

writeTopLevelDirs，提取所有输入文件和目录的父目录的每一层目录信息，例如/a/b/c/d.txt，会提取/，/a/，/a/b/，/a/b/c/；

recursivels，递归获取当前目录下的所有文件和目录信息，包括当前目录，如果是文件，就只获得当文件的信息。

在输入文件流写完和一起配置完成之后，方法会启动一个MapReduce作业，作业完成之后方法会删除工作目录。

在分析Map过程前，我们先来分析一下这个MapReduce的作业的输入格式，

conf.setInputFormat（HArchiveInputFormat.class）;

/**

*Inputformatofahadooparchivejobresponsiblefor

*generatingsplitsofthefilelist

staticclassHArchiveInputFormatimplementsInputFormat<

LongWritable,Text>

{

//generateinputsplitsfromthesrcfilelists

publicInputSplit[]getSplits（JobConfjconf,intnumSplits）

Stringsrcfilelist=jconf.get（SRC_LIST_LABEL,"

if（"

.equals（srcfilelist））{

Unabletogetthe"

+

srcfileforarchivegeneration."

longtotalSize=jconf.getLong（TOTAL_SIZE_LABEL,-1）;

if（totalSize==-1）{

Invalidsizeoffilestoarchive"

//weshouldbesafesincethisissetbyourowncode

Pathsrc=newPath（srcfilelist）;

FileSystemfs=src.getFileSystem（jconf）;

FileStatusfstatus=fs.getFileStatus（src）;

FileSplit>

splits=newArrayList<

（numSplits）;

LongWritablekey=newLongWritable（）;

Textvalue=newText（）;

SequenceFile.Readerreader=null;

//theremainingbytesinthefilesplit

longremaining=fstatus.getLen（）;

//thecountofsizescalculatedtillnow

longcurrentCount=0L;

//theendpositionofthesplit

longlastPos=0L;

//thestartpositionofthesplit

longstartPos=0L;

longtargetSize=totalSize/numSplits;

//createsplitsofsizetargetsizesothatallthemaps

//haveequalssizeddatatoreadandwriteto.

reader=newSequenceFile.Reader（fs,src,jconf）;

while（reader.next（key,value））{

if（currentCount+key.get（）>

targetSize&

&

currentCount!

=0）{

longsize=lastPos-startPos;

splits.add（newFileSplit（src,startPos,size,（String[]）null））;

remaining=remaining-size;

startPos=lastPos;

currentCount=0L;

currentCount+=key.get（）;

lastPos=reader.getPosition（）;

//theremainingnotequaltothetargetsize.

if（remaining!

=0）{

splits.add（newFileSplit（src,startPos,remaining,（String[]）null））;

finally{

reader.close（）;

returnsplits.toArray（newFileSplit[splits.size（）]）;

publicRecordReader<

getRecordReader（InputSplitsplit,

JobConfjob,Reporterreporter）throwsIOException{

returnnewSequenceFileRecordReader<

（job,

（FileSplit）split）;

这个输入格式，主要就是为了提供获取分片的功能，它的读功能代理给了SequenceFile做。

下面我们来看一下它是如何生成分片的。

输入文件是SequenceFile格式的，所以它是可划分的。

首先它根据配置文件读取输入文件的路径，这是在archive方法中配置的，当时

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

下载	加入VIP,免费下载

版权申诉 word格式文档无特别注明外均可编辑修改；预览文档经过压缩，下载后原文更清晰！ 立即下载

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: HadoopArchives 详解

冰豆网所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

关于本文

本文标题：HadoopArchives详解Word格式文档下载.docx
链接地址：https://www.bdocx.com/doc/17420161.html

HadoopArchives详解Word格式文档下载.docx

热门标签