Multilevel error annotation in FALKOWord文件下载.docx
- 文档编号:19203581
- 上传时间:2023-01-04
- 格式:DOCX
- 页数:15
- 大小:30.31KB
Multilevel error annotation in FALKOWord文件下载.docx
《Multilevel error annotation in FALKOWord文件下载.docx》由会员分享,可在线阅读,更多相关《Multilevel error annotation in FALKOWord文件下载.docx(15页珍藏版)》请在冰豆网上搜索。
1.Introduction
Learnercorpora–principledcollectionsoflearnerlanguage–provideinterestinginsightsintothemechanismsbywhichaforeignlanguageisacquired.ForoverviewsoverthecurrentstateoflearnercorpusresearchseeGranger(2002,toappear),Nesselhauf(2004),andPravec(2002).
Learnercorporaareusedtotesthypothesesinthetheoryofacquisitionintwomainways.First,learnercorporacanbeusedfortheso-calledcontrastiveinterlanguageanalysis(CIA),i.e.thequantitativecomparisonoflearnerlanguageandnativelanguagetofindpatternsofoveruseorunderuse.ForCIA,acorpusdoesnothavetobetagged.
Inthisarticleweareconcernedwiththesecondmainareaoflearnercorpusresearch:
errortagging.Whileerror-taggingisproblematicinmanytheoreticalrespects,itisprobablynotcontroversialanymorethaterror-taggedlearnercorporacanbeusefulforanumberofresearchquestionsifthetaggingfollowscertainguidelines.Inthispaperwedonotarguefortheneedforerrorannotation(seeGranger,toappear,foramotivation)ordiscussthetheoreticalproblemsinvolvedbutareconcernedonlywithissuesoferrortaggingandcorpusarchitecture.Weargueforamulti-levelstandoffarchitecture(ratherthanaflattoken-tagarchitecture)forerror-taggedlearnercorpora.ByusingtheGermanlearnercorpusFalkoasanexample,weshowhowmulti-levelapproachestolearnercorporacanhelpsolvesomeoftheproblemsthatoccurinerrortaggingifflatannotationmodelsareused.
1.1.TheGermanlearnercorporasituation
WhiletherearemanylearnercorporaforEnglishandsomeforotherlanguages,forexampleFrenchandNorwegian(seeGranger,toappear,foranoverview)thereareonlyveryfewlearnercorporaforGermanasaForeignLanguage(GFL).MostGermanlearnercorporaaresmalland/ornotpubliclyavailable.UrsulaWeinbergerfromLancasterUniversitycollectedacorpus(95texts,27635words)ofGermanlearnerswithEnglishastheirL1(Weinberger2002).Thecorpusistheonlyerror-taggedGFLcorpusweareawareofbutitisnotpubliclyavailable.JulieBelzandhercolleaguesatPennsylvaniaStateUniversityarebuildingacorpusoftelecollaborativedatathatis(Belz2004)whichisalsonotpubliclyavailable.Inadditionthereisawell-knowncollectionoflearnererrorswhichisavailableonCD(Heringer1995).Thiscollectionisnotacorpusinthatitdoesnotcontainfulltexts.
1.2.Falko
BeforewedescribethearchitectureofourexamplelearnercorpusFalkointhefollowingsectionswewanttosayafewwordsaboutitsdesignandcontent.Falko,whichstandsforFehlerannotiertesLernerkorpus‘errorannotatedlearnercorpus’isinitsbuilding-upphase.Thecorpusiscurrentlysmall,butgrowing.Fromsummer2005on,Falkowillbeavailableonlineathttp:
//www.linguistik.hu-berlin.de/korpuslinguistik/projekte/falko/index.php.
AlreadyanumberoflearnerlanguagestudieshaveusedFalkodataintheirresearch(Hirschmann2005;
Lippert,inpreparation;
Schmidt&
Walter,toappear;
Walter,Schmidt&
Dittmar,submitted).
Falkocontainsseveraldistinctsetsofdata.Eachtextisannotatedwithdetailedheaderinformationsothatthedatasetscanbecombinedtosub-corporaaccordingtotheneedsoftheresearcher.
Corecorpus:
ThecorecorpusisahighlycontrolledsetofsummariesofacademictextswrittenbyadvancedGFLlearners(henceforthL2)andnativeGermanspeakers(henceforthL1)undercomparableconditions.AlltextsareproducedbystudentsofGermanafterapproximatelytwoyearsofstudy.AlloftheL2studentshavepassedtheDSHexam(DeutscheSprachprü
fungfü
rdenHochschulzugangauslä
ndischerStudienbewerber,roughlycomparabletotheTOEFLtestforEnglish).AsofJune2005,thecorecorpusconsistsof59L2textsand41L1texts(together35949tokens).Furtherdatasetsarecollectedeveryterm.Becausestudentsoftencopywholesequencesfromtheoriginaltexts,theoriginalsareprovidedinthesameformatforreference.Thelearnertextswerewrittenmanuallyandlaterdigitized.Uptonow,errortagginghasonlybeendoneonthecorecorpus.
Inadditionwehaveseveralextensiontextsets.MostimportantamongtheseisthelongitudinaldatacollectedatGeorgetownUniversityinWashingtonD.C.Thisdataconsistsofso-calledprototypicalperformancetaskscollectedfromstudentsattheendoffourconsecutivecurricularlevelsandcodedforclausetypes(Byrnes2002).Otherextensionsetsarecomposedofsummariesofthesameoriginaltextsasthecorecorpustextswhicharecollectedinforeigncountriesandessayswritteninadvancedlinguisticclasses.
AlltextsinFalkoaretokenizedandtaggedforpart-of-speechusingtheTreeTagger(Schmid1994).Weevaluatedthetaggingerrorrateandfoundthat,althoughitisslightlyhigherthantheerrorratefornewspapertexts,itisstilllowenoughforthedatatobeusable.
2.Errortaggingandlearnercorpusarchitecture
Learnercorpora,aswellasothertextcorpora,differwithrespecttohowmuchlinguisticinformationisaddedtotherawtext.WhereasmostavailablelearnercorporaprovideaheaderfortheirtextsthatspecifiesinformationsuchastheL1ofthelearner,thetask,thelearnerhistory,etc.,mostlearnercorporadonothaveanyfurtherlinguisticannotation(Granger,toappear).
Inthisarticleweareconcernedwitherror-annotatedlearnercorpora.Mosterror-taggedlearnercorporathatarecurrentlyavailableusesomekindofflatfileformat.Wewanttoillustratewhythisformatisproblematicforlearnerdataandmotivateamulti-layerstandoffmodelasamoreappropriateannotationmodelforlearnerdata.Simplystated,inmulti-layerstandoffannotation(originallydevelopedforspeechcorpora,Carlettaetal.2003)theoriginaltextiscodedinareferencelineandeachannotationiscodedinanindependentlevelwithpointerstothereferenceline.BeforeweexplaintheformatinSection3,wediscussthreepropertiesoflearnerlanguageanderrorannotationthatareproblematicforflatannotationmodels.
2.1.Targethypothesis
Allerrorannotationimpliesaninterpretationonthepartofanannotator.Thisfacthasoftenbeendiscussedandis,infact,oneofthemainargumentagainsterrorannotation.ConsiderthefollowinglearnerutterancewithanerrortagfromWeinberger(2002,25).
(1)dieErklä
rungfü
r<
MoArInGn>
diesePhä
nomenisteinfach
theexplanationforthesephenomenonissimply
(Mo–morphology,Ar–article,In–Inflection,Gn–gender)
Inthisutterancediese‘this’andPhä
nomen‘phenomenon’donotagree.Weinbergerinterpretsthisasagendererror,mostlikelyonthebasisofthesurroundingcontextwhichisnotgiveninthedescription(itshouldbediesesPhä
nomen,Phä
nomenisneuterandthedeterminershouldagreeingender).Withoutfurtherinformation,theerrorcould,however,alsobeseenasanumbererror(diesePhä
nomene,plural).Inthiscase,theerrorwouldbemarkedonthenounandclassifieddifferently.
Inerrortaggingitisimpossiblenottointerpret,whichleadstotwoproblemsinflatannotationmodels.First,the‘targethypothesis’or“reconstructionofthoseutterancesinthetargetlanguage”(Ellis1994:
54)isusuallyimplicitinasentence,asin
(1).Second,encodingseveraldifferenthypothesesisnotpossible.Thisisduetothefactthattheerrortagsarecodedinaflatfiletogetherwiththeoriginaldata.
InFalkowehavechosenthereforetogiveanexplicittargethypothesisTARGET1foreachdeviantsection(sequenceoftokens)ofthetext.Theerrorsarethencodedwithregardtothistargethypothesis(seeSection4).DifferenttargethypothesesTARGET2,TARGET3,etc.canbestatedanderrorscanbecodedwithregardtoeachtarget.Theaboveexamplewouldappearasfollows:
utterance
Die
Erklä
rung
fü
r
diese
Phä
nomen
...
TARGET1
dieses
ErrorTag
Gender
TARGET2
nomene
Number
Table1:
Illustrationofseveraltargethypothesesforonelearnererror
2.2.Errorexponent
Oftenanerrorisnotconfinedtooneword.ConsiderthefollowingexamplefromFalko(orthographicerrorsareinterpretedinthegloss):
(2)Einenichtinformationsü
bermittelndeKommunikationmitnichternsthaften
Anotinformationtransmittingcommunicationwithnotserious
Menschenkannnurdannstadtfinden,wennsieentwedersichü
ber
peoplecanonlythenplacetake,iftheyeitherREFLabout
dasThemaderDiskusionnichgeeignethaben,odersiewollenkein
thetopicthe-POSSdiscussionnotagreedhave,ortheywantno
Gewinnerziehlen(sondernredennursodahin).
profitrealize(buttalkonlysothere)
[Translation(withinterpretation):
Anon-information-transmittingcommunicationwithunseriouspeoplecanonlytakeplaceifeithertheyhavenotagreedonthetopicofthediscussion,ortheydonotwanttomakeaprofit(butonlychat).(Falko1.1.L2)]
Mostorthographicerrorscanbetiedtoasingletoken(thecorrectionisgivenaftertheslash):
Diskusion/Diskussion,nich/nicht,erziehlen/erzielen.Otherorthographicerrorsshouldideallybetiedtotwotokens:
informationsü
bermittelnde/informationsü
bermittelnde,stadtfinden/stattfinden.Inthesecondcase,wehavetwoorthographicerrors:
stadt/stattandthespatium.(Thewholeerrorcouldalsobeclassifiedasawordformationerror,seebelow).Othererrors,likewordordererrors,canbemarkedonasequenceofseveraltokens.Anexampleis[...]weilsieentwedersich.
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- Multilevel error annotation in FALKO