检索工具调研.docx
- 文档编号:6630871
- 上传时间:2023-01-08
- 格式:DOCX
- 页数:62
- 大小:927.16KB
检索工具调研.docx
《检索工具调研.docx》由会员分享,可在线阅读,更多相关《检索工具调研.docx(62页珍藏版)》请在冰豆网上搜索。
检索工具调研
开放原始码之全文检索系统
▪ApacheSolr
▪BaseX
▪ClusterpointServer (freewarelicenceforasingle-server)
▪DataparkSearch
▪Ferret
▪Ht-//Dig(搜它的时候,全是关于它的安全漏洞新闻)
▪HyperEstraier
KinoSearch(Perl)借鉴了很多Lucene的思想,底层关键部分用C实现,效率还不错。
有潜质成为Perl界的明日之星。
▪Lemur/Indri
▪Lucene
▪mnoGoSearch(不是免费的)
▪Sphinx
▪Swish-e
▪Xapian
LemurProject
AboutTheLemurProject
●startedin2000
●bytheCenterforIntelligentInformationRetrieva(CIIR)
●attheUniversityofMassachusetts,Amherst,andtheLanguageTechnologiesInstitute(LTI)atCarnegieMellonUniversity
●usingstatisticallanguagemodels
●typicallymakesitsmajorsoftwarereleasesinJuneandDecemberofeachyear
●supportedinpartbyNationalScienceFoundation
●developssearchengines,browsertoolbars,textanalysistools,anddataresourcesthatsupportresearchanddevelopmentofinformationretrievalandtextminingsoftware
●bestknownforitsIndrisearchengine,LemurToolbar,andClueWeb09dataset
●emphasizesstate-of-the-artaccuracy,flexibility,andefficiency
●underopen-sourcelicenses
Indri
Indriisasearchenginethatprovidesstate-of-the-arttextsearchandarichstructuredquerylanguagefortextcollectionsofupto50milliondocuments(singlemachine)or500milliondocuments(distributedsearch).AvailableforLinuxandWindows.
Features
PowerfulQueryInterface强大的查询接口
∙SupportspopularstructuredqueryoperatorsfromINQUERY
∙Suffix-basedwildcardtermmatching基于后缀的通配符匹配
∙Fieldretrieval
∙Passageretrieval
FlexibleIndexingandDocumentSupport灵活的索引与文件格式支持
∙SupportsUTF-8encodedtext
∙LanguageindependenttokenizationofUTF-8encodeddocuments.
∙ParsesPDF,HTML,XML,andTRECdocuments
∙WordandPowerPointparsing(Windowsonly)
∙TextAnnotations
∙DocumentMetadata
PackageVersatility包具有多功能性
∙Opensource,withaflexibleBSD-inspiredlicense
∙IncludesbothcommandlinetoolsandaJavauserinterface
∙APIcanbeusedfromJava,PHP,orC++
∙WorksonWindows,Linux,SolarisandMacOSX
ScalabilityandEfficiency可扩展与高效率
∙Best-in-classadhocretrievalperformance
∙Canbeusedonaclusterofmachinesforfasterindexingandretrieval
∙Scalestoterabyte-sizedcollections
AboutIndri
IndriisatextsearchenginedevelopedatUMass.ItisapartoftheLemurproject.
Fromanacademicperspective,Indriisinterestingbecauseitcombinesinferencenetworkswithlanguagemodeling.Thequerylanguage,whichisreminicentoftheInqueryquerylanguage,allowsresearcherstoexperimentwithproximity,documentstructure,textpassages,andotherdocumentfeatureswithoutwritingcode.Likeotheracademicengines,IndricanparseTRECnewswireandwebcollections,anditisabletoreturnresultsintheTRECstandardformat.
Fromanindustrialperspective,Indriisinterestingbecauseitisefficient,supported,andeasytointegrate.IndriisfreelyavailablefromUMasswithaflexibleBSD-inspiredlicense.IndriincludesanAPIthatisaccessiblefromC++,Java,C#andPHP.Indrialsocanbedistributedacrossaclusterofnodesforhighspeedqueryperformance.Inversion2.0,Indriaddstruemultithreadedoperation,sodocumentscanbeadded,queriedanddeletedconcurrently.
AbouttheIndriApplications
TheIndriBuildIndexapplicationcanbuildIndrirepositoriesfromTRECformatteddocuments,HTMLdocuments,textdocuments,andPDFfiles.Additionally,onWindowsitcanindexWordandPowerPointdocuments.IndriBuildIndexunderstandstagsinHTML/XMLdocuments,anditcanbeinstructedtoindexthemaswell.
TheIndriRunQueryapplicationevaluatesqueriesagainstoneormoreIndrirepositories,andreturnstheresultsinarankedlistofdocuments.IndriRunQuerycanbeinstructedtoprintthedocumenttextaswell,orthetextofpassagesifthequeryisapassageretrievalquery.
TheIndriDaemonapplicationisarepositoryserver.ItwaitsforconnectionsfromIndriRunQuery(orfromotherapplicationsusingtheQueryEnvironmentinterface)andprocessesqueriesfromnetworkrequests.OnecopyofIndriRunQuerycanconnecttomanyIndriDaemoninstancesatonce,makingretrievalusingaclusterofmachinespossible.
UsingtheIndriAPI
IndriprovidestheQueryEnvironmentandIndexEnvrionmentclasses,whichcanbeusedfromC++,Java,C#orPHP(althoughindexingisnotsupportedfromPHP).TheIndriBuildIndexandIndriRunQueryapplicationsusetheseclassesexclusively.PleasekeepinmindthatwereservetherighttochangeanyclasseswithinIndrithatarenotintheindri:
:
apinamespace.Ifyouwriteyourcodetouseonlyindri:
:
apiclasses,wewilldoourbesttomakesuretheystillworkinfutureversionsofIndri.
IndexEnvironmentunderstandsmanydifferentfiletypes.However,youcancreateyourownfiletype,aslongasitisXML-like,andtellIndexEnvironmenthowtoindexit.Then,usingtheaddFilemethod,IndexEnvironmentcanindexyourdocument(s).Ifyouwanttodomorecomplexprocessingonyourdata,orifyourdataisarrivinginrealtime,youmayparseyourdocumentintoaParsedDocumentstructure.TheIndexEnvrionmentobjectcanindexthesestructuresdirectly.
QueryEnvironmentallowsyoutorunqueriesandretrievearankedlistofresults.YoucanuserunAnnotatedQuerytoretrievematchinformation(annotations),whichisusefulforhighlightingmatchedwordsindocuments.ByusingtheaddIndexmethodwithaninstanceofIndexEnvironment,youcanevaluatequeriesonanindexthatiscurrentlybeingbuilt.TheaddServermethodallowsyoutoconnecttoIndriDaemonprocessesfordistributedretrieval.
HowdoIusetheIndriAPIfromJava?
First,youneedtobuildIndriincludingtheJavawrappers.OnUnix,youdothisbyaddingthe--enable-javalinewhenrunningtheconfigurescript.ThescriptshouldfindyourJavainstallationautomatically,butifitdoesn't,youcanshowitwheretofindjavabyusingthe--with-javahomeparameter.IfyouareusingWindows,usetheswigprojectfilefromVisualStudiotobuildtheJavaAPI.YoumayneedtochangetheincludepathontheprojecttopointtoyourJavainstallation.
Oncethat'sbuilt,indri.jarandliblemur_jni.soshouldbeinyourlemur/swig/obj/javadirectory.IfyouareusingMacOSX,liblemur_jni.sowillbecalledliblemur_jni.jnilib.Theindri.jarfilecontainsalloftheJavasupportfilesforIndri,whileliblemur_jnii.socontainstheIndriC++code.
Ifyourunanapplicationthatusestheindri.jarfile,itwillattempttoloadtheliblemurjni.sofileautomatically.Forthistowork,youneedtosetthejava.library.pathvariablecorrectly.Youcandothisonthejavacommandline:
`java-cpindri.jar\
-Djava.library.path=lemur/swig/obj/java\
MyIndriApplication`
fromhttp:
//www.lemurproject.org/lemur.php
TheLemurToolkit
TheLemurToolitAPIshavebeendeprecated.ThefinalreleasedversionoftheLemurToolkitisversion 4.12,released06/21/2010.
TheLemurToolkitisdesignedtofacilitateresearchinlanguagemodelingandinformationretrieval(IR),whereIRisbroadlyinterpretedtoincludesuchtechnologiesasadhocanddistributedretrievalwithstructuredqueries,cross-languageIR,summarization,filtering,andcategorization.Thesystem'sunderlyingarchitecturewasbuilttosupportthetechnologiesabove.Weprovidemanyusefulsampleapplications,buthavedesignedthetoolkittoallowyoutoeasilyprogramyourowncustomizationsandapplications.
Features
∙Sophisticatedstructuredquerylanguages(usingInQueryandIndri)高级的结构化查询语言
∙SupportforXMLandstructureddocumentretrieval
∙Usedcommonlywithawiderangeofresearchtestcollections(e.g.,TRECCDs1-5,wt10g,RCV1,gov,gov2)
∙Indexyourwebpageswithan"out-of-the-box"sitesearchcapability
∙InteractiveinterfacesforWindows,Linux,andWeb
∙Distributedinformationretrievalanddocumentclusteringapplications
∙Cross-platform,fastandmodularcodewritteninC++
∙C++,JavaandC#APIs
∙Freeandopen-sourcesoftware
∙Inuseforover6yearsbyalargeandgrowingusercommunity
Indexing
∙Multipleindexingmethodsforsmall,mediumandlarge-scale(terabyte)collections
∙Built-insupportforEnglish,ChineseandArabictext
∙PorterandKrovetzwordstemming
∙Incrementalindexing
∙Out-of-the-boxindexingsupportforTRECText,TRECWeb,plaintext,HTML,XML,PDF,MBox,MicrosoftWord,andMicrosoftPowerPoint
∙Indexesinlineandoffsettextannotations(e.g.,part-of-speechandnamedentities)
∙Indexesdocumentattributes
Retrieval
∙SupportsmajorlanguagemodelingapproachessuchasIndriandKL-divergence,aswellasvectorspace,tf.idf,OkapiandInQuery
∙Relevance-andpseudo-relevancefeedback
∙Wildcardtermexpansion(usingIndri)
∙PassageandXMLelementretrieval
∙Cross-lingualretrieval
∙SmoothingviaDirichletpriorsandMarkovchains
∙Supportsarbitrarydocumentpriors(e.g.,PageRank,URLdepth)
Fromhttp:
//www.lemurproject.org/lemur/overview.php
OverviewoftheLemurToolkit
Contents
1.WhatisLemur?
2.WhatkindsofthingscanLemurdo?
3.Howcanitbeuseful?
4.WhathavepeopleusedLemurfor?
5.HowcanIuseLemur?
6.WhatdoesLemurcomewith?
7.WhatwasLemurwrittenin,andwhatplatformsdoesitworkon?
1.WhatisLemur?
Lemurisatoolkitdesignedtofacilitateresearchinlanguagemodelingandinformationretrieval(IR),whereIRisbroadlyinterpretedtoincludesuchtechnologiesasadhocanddistributedretrieval,withstructuredqueries,cross-languageIR,summarization,filtering,andcategorization. Thesystem'sunderlyingarchitecturewasbuilttosupportthetechnologiesabove. Weprovidemanyusefulsampleapplications,buthavedesignedthetoolkittoallowyoutoeasilyprogramyourowncustomizationsandapplications.
2.WhatkindsofthingscanLemurdo?
TheLemurtool
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 检索 工具 调研