软件容错性.docx
- 文档编号:8836718
- 上传时间:2023-02-02
- 格式:DOCX
- 页数:15
- 大小:32.60KB
软件容错性.docx
《软件容错性.docx》由会员分享,可在线阅读,更多相关《软件容错性.docx(15页珍藏版)》请在冰豆网上搜索。
软件容错性
Abstract—Thispaperpresentsanddiscussestherationalebehindamethodforstructuringcomplexcomputingsystemsbytheuseofwhatweterm“recoveryblocks,”“conversations,”and“fault-tolerantinterfaces.”Theaimistofacilitatetheprovisionofdependableerrordetectionandrecoveryfacilitieswhichcancopewitherrorscausedbyresidualdesigninadequacies,particularlyinthesystemsoftware,ratherthanmerelytheoccasionalmalfunctioningofhardwarecomponents.
IndexTerms—Acceptancetest,alternateblock,checkpoint,conversation,errordetection,errorrecovery,recoveryblock,recursivecache.
Theconceptof“fault-tolerantcomputing”hasexistedforalongtime.Thefirstbookonthesubject[10]waspublishednolessthantenyearsago,butthenotionoffaulttolerancehasremainedalmostexclusivelythepreserveofthehardwaredesigner.Hardwarestructureshavebeendevelopedwhichcan“tolerate”faults,i.e.,continuetoprovidetherequiredfacilitiesdespiteoccasionalfailures,eithertransientorpermanent,ofinternalcomponentsandmodules.However,hardwarecomponentfailuresareonlyonesourceofunreliabilityincomputingsystems,decreasinginsignificanceascomponentreliabilityimproves,whilesoftwarefaultshavebecomeincreasinglyprevalentwiththesteadilyincreasingsizeandcomplexityofsoftwaresystems.
Ingeneral,fault-toleranthardwaredesignsareexpectedtobecorrect,i.e.,thetoleranceappliestocomponentfailuresratherthandesigninadequacies,althoughthedividinglinebetweenthetwomayonoccasionbedifficulttodefine.Butallsoftwarefaultsresultfromdesignerrors.Therelativefrequencyofsucherrorsreflectsthemuchgreaterlogicalcomplexityofthetypicalsoftwaredesigncomparedtothatofatypicalhardwaredesign.Thedifferenceincomplexityarisesfromthefactthatthe“machines”thathardwaredesignersproducehavearelativelysmallnumberofdistinctiveinternalstates,whereasthedesignerofevenasmallsoftwaresystemhas,bycomparison,anenormousnumberofdifferentstatestoconsider—thusonecanusuallyaffordtotreathardwaredesignsasbeing“correct,”butoftencannotdothesamewithsoftwareevenafterextensivevalidationefforts.(Thedifferenceinscaleisevidencedbythefactthatasoftwaresimulatorofacomputer,writtenatthelevelofdetailrequiredbythehardwaredesignerstoanalyzeandvalidatetheirlogicaldesign,isusuallyoneormoreordersofmagnitudesmallerthantheoperatingsystemsuppliedwiththatcomputer.)
Ifalldesigninadequaciescouldbeavoidedorremovedthiswouldsufficetoachievesoftwarereliability.(Wehereusetheterm“design”toinclude“implementation,”whichisactuallymerelylow-leveldesign,concerningitselfwithdetaileddesigndecisionswhosecorrectnessneverthelesscanbeasvitaltothecorrectfunctioningofthesoftwareasthatofanyhigh-leveldesigndecision.)Indeedmanywritersequatetheterms“softwarereliability”and“programcorrectness.”However,untilreliablecorrectnessproofs(relativetosomecorrectandadequatelydetailedspecification),whichcoverevenimplementationdetails,canbegivenforsystemsofarealisticsize,theonlyalternativemeansofincreasingsoftwarereliabilityistoincorporateprovisionsforsoftwarefaulttolerance.
Infactthereexistsophisticatedcomputingsystems,designedforenvironmentsrequiringnear-continuousservice,whichcontainadhocchecksandcheckpointingfacilitiesthatprovideameasureoftoleranceagainstsomesoftwareerrorsaswellashardwarefailures[11].Theyincidentallydemonstratethefactthatfaulttolerancedoesnotnecessarilyrequirediagnosingthecauseofthefault,orevendecidingwhetheritarisesfromthehardwareorthesoftware.Howevertherehasbeencomparativelylittlespecificresearchintotechniquesforachievingsoftwarefaulttolerance,andtheconstraintstheyimposeoncomputingsystemdesign.
ItwasconsiderationssuchasthesethatledtotheestablishmentattheUniversityofNewcastleuponTyneofaprojectonthedesignofhighlyreliablecomputingsystems,underthesponsorshipoftheScienceResearchCounciloftheUnitedKingdom.Theaimsoftheprojectwereandare“todevelop,andgivearealisticdemonstrationoftheutilityof,computerarchitectureandprogrammingtechniqueswhichwillenableasystemtohaveaveryhighprobabilityofcontinuingtogiveatrustworthyserviceinthepresenceofhardwarefaultsand/orsoftwareerrors,andduringtheirrepair.Amajoraimwillbetodeveloptechniqueswhichareofgeneralutility,ratherthanlimitedtospecialisedenvironments,andtoexplorepossibletradeoffsbetweenreliabilityandperformance.”Amodestnumberofreportsandpapershaveemanatedfromtheprojecttodate,includingageneraloverview[12],papersconcernedwithaddressingandprotection[6],[7],andapreliminaryaccountofourworkonerrordetectionandrecovery[5].Thepresentpaperendeavors
toprovidearathermoreextensivediscussionofourworkonsystemerrorrecoverytechniques,andconcentratesontechniquesforsystemstructuringwhichfacilitatesoftwarefaulttolerance.Acompanionpaper[1]presentsaproof-guidedmethodologyfordesigningtheerrordetectionroutinesthatourmethodrequires.
Allfaulttolerancemustbebasedontheprovisionofusefulredundancy,bothforerrordetectionanderrorrecovery.Insoftwaretheredundancyrequiredisnotsimplereplicationofprogramsbutredundancyofdesign.
Theschemeforfacilitatingsoftwarefaulttolerancethatwehavedevelopedcanberegardedasanalogoustowhathardwaredesignersterm“stand-bysparing.”Asthesystemoperates,checksaremadeontheacceptabilityoftheresultsgeneratedbyeachcomponent.Shouldoneofthesechecksfail,asparecomponentisswitchedintotaketheplaceoftheerroneouscomponent.Thesparecomponentis,ofcourse,notmerelyacopyofthemaincomponent.Ratheritisofindependentdesign,sothattherecanbehopethatitcancopewiththecircumstancesthatcausedthemaincomponenttofail.(Thesecircumstanceswillcomprisethedatathecomponentisprovidedwithand,inthecaseoferrorsduetofaultyprocesssynchronization,thetimingandformofitsinteractionswithotherprocesses.)
Incontrasttothenormalhardwarestand-bysparingscheme,thesparesoftwarecomponentisinvokedtocopewithmerelytheparticularsetofcircumstancesthatresultedinthefailureofthemaincomponent.Weassumethefailureofthiscomponenttobeduetoresidualdesigninadequacies,andhencethatsuchfailuresoccuronlyinexceptionalcircumstances.Thenumberofdifferentsetsofcircumstancesthatcanariseevenwithasoftwarecomponentofcomparativelymodestsizeisimmense.Thereforethesystemcanreverttotheuseofthemaincomponentforsubsequentoperations—inhardwarethiswouldnotnormallybedoneuntilthemaincomponenthadbeenrepaired.Thevarietyofundetectederrorswhichcouldhavebeenmadeinthedesignofanontrivialsoftwarecomponentisessentiallyinfinite.Duetothecomplexityofthecomponent,therelationshipbetweenanysucherroranditseffectatruntimemaybeveryobscure.Forthesereasonswebelievethatdiagnosisoftheoriginalcauseofsoftwareerrorsshouldbelefttohumanstodo,andshouldbedoneincomparativeleisure.Thereforeourschemeforsoftwarefaulttoleranceinnowaydependsonautomateddiagnosisofthecauseoftheerror—thiswouldsurelyresultonlyingreatlyincreasingthecomplexityandthereforetheerrorpronenessofthesystem.
Therecoveryblockschemeforachievingsoftwarefaulttolerancebymeansofstand-bysparinghastwoimportantcharacteristics.
1)Itincorporatesageneralsolutiontotheproblemofswitchingtotheuseofthesparecomponent,i.e.,ofrepairinganydamagedonebytheerroneousmaincomponent,andoftransferringcontroltotheappropriatesparecomponent.
2)Itprovidesamethodofexplicitlystructuringthesoftwaresystemwhichhastheeffectofensuringthattheextrasoftwareinvolvedinerrordetectionandinthesparecomponentsdoesnotaddtothecomplexityofthesystem,andsoreduceratherthanincreaseoverallsystemreliability.
Althoughthebasicrecoveryblockschemehasalreadybeendescribedelsewhere[5],itisconvenienttoincludeabriefaccountofithere.Wewillthendescribeseveralextensionstotheschemedirectedatmorecomplicatedsituationsthanthebasicschemewasintendedfor.Thuswestartbyconsideringtheproblemsoffaulttolerance,i.e.,oferrordetectionandrecovery,withinasinglesequentialprocessinwhichassignmentstostoredvariablesprovidetheonlymeansofmakingrecognizableprogress.Considerationsoftheproblemsofcommunicationwithotherprocesses,eitherwithinthecomputingsystem(e.g.,byasystemofpassingmessages,ortheuseofsharedstorage)orbeyondthecomputingsystem(e.g.,byexplicitinput-outputstatements)isdeferreduntilalatersection.
Theprogressofaprogramisbyitsexecutionofsequencesofthebasicoperationsofthecomputer.Clearly,errorcheckingforeachbasicoperationisoutofthequestion.Apartfromquestionsofexpense,absenceofanawarenessofthewiderscenewouldmakeitdifficulttoformulatethechecks.Wemustaimatachievingatolerablequantityofcheckingandexploitourknowledgeofthefunctionalstructureofthesystemtodistributethesecheckstobestadvantage.Itisstandardpracticetostructurethetextofaprogramofanysignificantcomplexityintoasetofblocks(bywhichtermweincludemodule,procedure,subroutine,paragraph,etc.)inordertosimplifythetaskofunderstandinganddocumentingtheprogram.Suchastructureallowsonetoprovi
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 软件 容错