书签分享收藏举报版权申诉 / 35

立即下载加入VIP,免费下载

当前位置：首页 > 自然科学 > 生物学 > HadoopArchives详解.docx

HadoopArchives详解.docx

文档编号：4487814
上传时间：2022-12-01
格式：DOCX
页数：35
大小：48.92KB

《HadoopArchives详解.docx》由会员分享，可在线阅读，更多相关《HadoopArchives详解.docx（35页珍藏版）》请在冰豆网上搜索。

HadoopArchives详解.docx

HadoopArchives详解

Hadoophar文件系统详解

Hadoophar文件系统详解1

1.har文件系统概述2

1.1har文件系统的用途或目的2

1.2har文件系统的结构与组织2

1.3har文件系统的使用方法2

1.4har文件系统的不足2

2.har文件系统的类分析3

2.1HadoopArchives类分析3

2.1.1功能介绍3

2.1.2程序流程介绍（类图分析）3

2.1.3涉及的相关类和工具的介绍（类图分析）16

2.2HarFileSystem类分析17

2.2.1功能介绍17

2.2.2读文件功能分析17

1.har文件系统概述

介绍har文件及其文件系统。

1.1har文件系统的用途或目的

1.2har文件系统的结构与组织

1.3har文件系统的使用方法

1.4har文件系统的不足

2.har文件系统的类分析

2.1HadoopArchives类分析

这个类的成员变量重要的有conf，加类图分析。

2.1.1功能介绍

HadoopArchives类的主要功能是为了生成har文件，它是一个工具类，实现了Tools接口。

它的实行过程其实是执行一个MapReduce作业。

生成har文件主要有四个过程：

run方法，archive方法，Map过程和Reduce过程。

具体请看下一节分析。

2.1.2程序流程介绍（类图分析）

程序最开始由HadoopArchives的main函数开始，通过ToolRunner这个类来调用HadoopArchives的run方法。

run方法的主要作用是判断和提取命令行参数，将命令行参数转换成输入目录和输出目录传递给archive方法。

下面来具体看一下run方法：

/**themaindriverforcreatingthearchives

*ittakesatleasttwocommandlineparameters.Thesrcandthe

*dest.Itdoesanlsronthesourcepaths.

*Themappercreatedarchuvesandthereducercreates

*thearchiveindex.

*/

publicintrun（String[]args）throwsException{

try{

ListsrcPaths=newArrayList（）;

PathdestPath=null;

//checkweweresupposedtoarchiveor

//unarchive

StringarchiveName=null;

if（args.length<4）{

System.out.println（usage）;

thrownewIOException（"Invalidusage."）;

}

if（!

"-archiveName".equals（args[0]））{

System.out.println（usage）;

thrownewIOException（"ArchiveNamenotspecified."）;

}

archiveName=args[1];

if（!

checkValidName（archiveName））{

System.out.println（usage）;

thrownewIOException（"Invalidnameforarchives."+archiveName）;

}

for（inti=2;i

if（i==（args.length-1））{

destPath=newPath（args[i]）;

}

else{

srcPaths.add（newPath（args[i]））;

}

if（srcPaths.size（）==0）{

System.out.println（usage）;

thrownewIOException（"InvalidUsage:

Noinputsourcesspecified."）;

}

//doaglobonthesrcPathsandthenpassiton

ListglobPaths=newArrayList（）;

for（Pathp:

srcPaths）{

FileSystemfs=p.getFileSystem（getConf（））;

FileStatus[]statuses=fs.globStatus（p）;

for（FileStatusstatus:

statuses）{

globPaths.add（fs.makeQualified（status.getPath（）））;

}

archive（globPaths,archiveName,destPath）;

}catch（IOExceptionie）{

System.err.println（ie.getLocalizedMessage（））;

return-1;

}

return0;

}

上面标红的代码是为了将输入路径（由输入参数而来）补全，加上scheme和authority，形成完整的路径。

Thenpassiton.

这个方法主要是为了提取globPaths、archiveName和destPath三个变量，作为参数传给archive方法，这个变量的意思很简单，globPaths：

所有的输入路径，archiveName：

生成har文件的文件名，destPath：

输出目录，存放以archiveName为名的文件的目录。

在提取这三个参数的时候，做了一些相关错误检查。

具体可以想见代码。

下面来看archive方法，archive方法主要做了三件事，一是配置相关作业参数；二是根据输入文件生成SequenceFile,它来保存所有和输入文件相关的目录信息，它会作为Map的输入；最后一件就是启动作业。

下面看一下archive方法的源码：

/**archivethegivensourcepathsinto

*thedest

*paramsrcPathsthesrcpathstobearchived

*paramdestthedestdirthatwillcontainthearchive

*/

publicvoidarchive（ListsrcPaths,StringarchiveName,Pathdest）

throwsIOException{

checkPaths（conf,srcPaths）;

intnumFiles=0;

longtotalSize=0;

conf.set（DST_HAR_LABEL,archiveName）;

PathoutputPath=newPath（dest,archiveName）;

FileOutputFormat.setOutputPath（conf,outputPath）;

FileSystemoutFs=outputPath.getFileSystem（conf）;

if（outFs.exists（outputPath）||outFs.isFile（dest））{

thrownewIOException（"InvalidOutput."）;

}

conf.set（DST_DIR_LABEL,outputPath.toString（））;

finalStringrandomId=DistCp.getRandomId（）;

PathjobDirectory=newPath（newJobClient（conf）.getSystemDir（）,

NAME+"_"+randomId）;

conf.set（JOB_DIR_LABEL,jobDirectory.toString（））;

//getatmpdirectoryforinputsplits

FileSystemjobfs=jobDirectory.getFileSystem（conf）;

jobfs.mkdirs（jobDirectory）;

PathsrcFiles=newPath（jobDirectory,"_har_src_files"）;

conf.set（SRC_LIST_LABEL,srcFiles.toString（））;

SequenceFile.WritersrcWriter=SequenceFile.createWriter（jobfs,conf,

srcFiles,LongWritable.class,Text.class,

SequenceFile.CompressionType.NONE）;

//getthelistoffiles

//createsinglelistoffilesanddirs

try{

//writethetopleveldirsinfirst

writeTopLevelDirs（srcWriter,srcPaths）;

srcWriter.sync（）;

//thesearetheinputpathspassed

//fromthecommandline

//wedoarecursivelsonthesepaths

//andthenwritethemtotheinputfile

//oneatatime

for（Pathsrc:

srcPaths）{

FileSystemfs=src.getFileSystem（conf）;

ArrayListallFiles=newArrayList（）;

recursivels（fs,src,allFiles）;

for（FileStatusstat:

allFiles）{

StringtoWrite="";

longlen=stat.isDir（）?

0:

stat.getLen（）;

if（stat.isDir（））{

toWrite=""+fs.makeQualified（stat.getPath（））+"dir";

//getthechildren

FileStatus[]list=fs.listStatus（stat.getPath（））;

StringBuffersbuff=newStringBuffer（）;

sbuff.append（toWrite）;

for（FileStatusstats:

list）{

sbuff.append（stats.getPath（）.getName（）+""）;

}

toWrite=sbuff.toString（）;

}

else{

toWrite+=fs.makeQualified（stat.getPath（））+"file";

}

srcWriter.append（newLongWritable（len）,new

Text（toWrite））;

srcWriter.sync（）;

numFiles++;

totalSize+=len;

}

}finally{

srcWriter.close（）;

}

//increasethereplicationofsrcfiles

jobfs.setReplication（srcFiles,（short）10）;

conf.setInt（SRC_COUNT_LABEL,numFiles）;

conf.setLong（TOTAL_SIZE_LABEL,totalSize）;

intnumMaps=（int）（totalSize/partSize）;

//runatleastonemap.

conf.setNumMapTasks（numMaps==0?

1:

numMaps）;

conf.setNumReduceTasks

（1）;

conf.setInputFormat（HArchiveInputFormat.class）;

conf.setOutputFormat（NullOutputFormat.class）;

conf.setMapperClass（HArchivesMapper.class）;

conf.setReducerClass（HArchivesReducer.class）;

conf.setMapOutputKeyClass（IntWritable.class）;

conf.setMapOutputValueClass（Text.class）;

conf.set（"hadoop.job.history.user.location","none"）;

FileInputFormat.addInputPath（conf,jobDirectory）;

//makesurenospeculativeexecutionisdone

conf.setSpeculativeExecution（false）;

JobClient.runJob（conf）;

//deletethetmpjobdirectory

try{

jobfs.delete（jobDirectory,true）;

}catch（IOExceptionie）{

LOG.info（"Unabletocleantmpdirectory"+jobDirectory）;

}

标红处

fs.makeQualified（stat.getPath（））

是为了将路径补全成全路径。

可能是因为再在获取子目录和子目录中的文件是用的getPath方法，导致只获取了目录，相关的要看一下Path类和URI类。

首先是检查输入路径，之后开始配置一些参数，配置参数分成两种，一是conf的配置，二是MapReduce的作业输入输出路径的配置。

通过conf的配置中，有和一般MapReduce一样的配置过程，这部分就不说了，还有几个参数比较重要这里说一下，

conf.set（DST_HAR_LABEL,archiveName）;

conf.set（DST_DIR_LABEL,outputPath.toString（））;

conf.set（JOB_DIR_LABEL,jobDirectory.toString（））;

conf.set（SRC_LIST_LABEL,srcFiles.toString（））;

conf.setInt（SRC_COUNT_LABEL,numFiles）;

conf.setLong（TOTAL_SIZE_LABEL,totalSize）;

这些参数的意思也比较清楚，可以结合代码看一下。

为什么要设置他们，是因为将在后面的程序用到他们，比如说在获取输入分片的时候，这个我们后面再说。

FileOutputFormat.setOutputPath（conf,outputPath）;

FileInputFormat.addInputPath（conf,jobDirectory）;

……

MapReduce作业配置这里略过。

之后我们来看一下，它是如何生成Map的输入文件，它被放到了srcFiles里面，

PathsrcFiles=newPath（jobDirectory,"_har_src_files"）;

它容的格式是这样的：

文件大小（目录为0）+路径名+dirorfile+[子目录]（如果是目录的话并且有子目录，子目录只有当前目录下的名字，即不是全文路径，都是相对于当前目录的）

它是一个SequenceFile，生成它主要有两个方法：

writeTopLevelDirs和recursivels。

这两个方法的作用是：

writeTopLevelDirs，提取所有输入文件和目录的父目录的每一层目录信息，例如/a/b/c/d.txt，会提取/，/a/，/a/b/，/a/b/c/；recursivels，递归获取当前目录下的所有文件和目录信息，包括当前目录，如果是文件，就只获得当文件的信息。

在输入文件流写完和一起配置完成之后，方法会启动一个MapReduce作业，作业完成之后方法会删除工作目录。

在分析Map过程前，我们先来分析一下这个MapReduce的作业的输入格式，

conf.setInputFormat（HArchiveInputFormat.class）;

/**

*Inputformatofahadooparchivejobresponsiblefor

*generatingsplitsofthefilelist

*/

staticclassHArchiveInputFormatimplementsInputFormat{

//generateinputsplitsfromthesrcfilelists

publicInputSplit[]getSplits（JobConfjconf,intnumSplits）

throwsIOException{

Stringsrcfilelist=jconf.get（SRC_LIST_LABEL,""）;

if（"".equals（srcfilelist））{

thrownewIOException（"Unabletogetthe"+

"srcfileforarchivegeneration."）;

}

longtotalSize=jconf.getLong（TOTAL_SIZE_LABEL,-1）;

if（totalSize==-1）{

thrownewIOException（"Invalidsizeoffilestoarchive"）;

}

//weshouldbesafesincethisissetbyourowncode

Pathsrc=newPath（srcfilelist）;

FileSystemfs=src.getFileSystem（jconf）;

FileStatusfstatus=fs.getFileStatus（src）;

ArrayListsplits=newArrayList（numSplits）;

LongWritablekey=newLongWritable（）;

Textvalue=newText（）;

SequenceFile.Readerreader=null;

//theremainingbytesinthefilesplit

longremaining=fstatus.getLen（）;

//thecountofsizescalculatedtillnow

longcurrentCount=0L;

//theendpositionofthesplit

longlastPos=0L;

//thestartpositionofthesplit

longstartPos=0L;

longtargetSize=totalSize/numSplits;

//createsplitsofsizetargetsizesothatallthemaps

//haveequalssizeddatatoreadandwriteto.

try{

reader=newSequenceFile.Reader（fs,src,jconf）;

while（reader.next（key,value））{

if（currentCount+key.get（）>targetSize&¤tCount!

=0）{

longsize=lastPos-startPos;

splits.add（newFileSplit（src,startPos,size,（String[]）null））;

remaining=remaining-size;

startPos=lastPos;

currentCount=0L;

}

currentCount+=key.get（）;

lastPos=reader.getPosition（）;

}

//theremainingnotequaltothetargetsize.

if（remaining!

=0）{

splits.add（newFileSplit（src,startPos,remaining,（String[]）null））;

}

finally{

reader.close（）;

}

returnsplits.toArray（newFileSplit[splits.size（）]）;

}

publicRecordReadergetRecordReader（InputSplitsplit,

JobConfjob,Reporterreporter）throwsIOException{

returnnewSequenceFileRecordReader（job,

（FileSplit）split）;

}

这个输入格式，主要就是为了提供获取分片的功能，它的读功能代理给了SequenceFile做。

下面我们来看一下它是如何生成分片的。

输入文件是SequenceFile格式的，所以它是可划分的。

首先它根据配置文件读取输入文件的路径，这是在archive方法中配置的，当时

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

下载	加入VIP,免费下载

版权申诉 word格式文档无特别注明外均可编辑修改；预览文档经过压缩，下载后原文更清晰！ 立即下载

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: HadoopArchives 详解

冰豆网所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

关于本文

本文标题：HadoopArchives详解.docx
链接地址：https://www.bdocx.com/doc/4487814.html

HadoopArchives详解.docx

热门标签