外文文献及翻译自适应动态规划综述.docx
- 文档编号:4029603
- 上传时间:2022-11-27
- 格式:DOCX
- 页数:15
- 大小:288.49KB
外文文献及翻译自适应动态规划综述.docx
《外文文献及翻译自适应动态规划综述.docx》由会员分享,可在线阅读,更多相关《外文文献及翻译自适应动态规划综述.docx(15页珍藏版)》请在冰豆网上搜索。
外文文献及翻译自适应动态规划综述
外文文献:
AdaptiveDynamicProgramming:
AnIntroduction
Abstract:
Inthisarticle,weintroducesomerecentresearchtrendswithinthefieldofadaptive/approximatedynamicprogramming(ADP),includingthevariationsonthestructureofADPschemes,thedevelopmentofADPalgorithmsandapplicationsofADPschemes.ForADPalgorithms,thepointoffocusisthatiterativealgorithmsofADPcanbesortedintotwoclasses:
oneclassistheiterativealgorithmwithinitialstablepolicy;theotheristheonewithouttherequirementofinitialstablepolicy.Itisgenerallybelievedthatthelatteronehaslesscomputationatthecostofmissingtheguaranteeofsystemstabilityduringiterationprocess.Inaddition,manyrecentpapershaveprovidedconvergenceanalysisassociatedwiththealgorithmsdeveloped.Furthermore,wepointoutsometopicsforfuturestudies.
Introduction
Asiswellknown,therearemanymethodsfordesigningstablecontrolfornonlinearsystems.However,stabilityisonlyabareminimumrequirementinasystemdesign.Ensuringoptimalityguaranteesthestabilityofthenonlinearsystem.Dynamicprogrammingisaveryusefultoolinsolvingoptimizationandoptimalcontrolproblemsbyemployingtheprincipleofoptimality.In[16],theprincipleofoptimalityisexpressedas:
“Anoptimalpolicyhasthepropertythatwhatevertheinitialstateandinitialdecisionare,theremainingdecisionsmustconstituteanoptimalpolicywithregardtothestateresultingfromthefirstdecision.”Thereareseveralspectrumsaboutthedynamicprogramming.Onecanconsiderdiscrete-timesystemsorcontinuous-timesystems,linearsystemsornonlinearsystems,time-invariantsystemsortime-varyingsystems,deterministicsystemsorstochasticsystems,etc.
Wefirsttakealookatnonlineardiscrete-time(timevarying)dynamical(deterministic)systems.Time-varyingnonlinearsystemscovermostoftheapplicationareasanddiscrete-timeisthebasicconsiderationfordigitalcomputation.Supposethatoneisgivenadiscrete-timenonlinear(timevarying)dynamicalsystem
where
representsthestatevectorofthesystemand
denotesthecontrolactionandFisthesystemfunction.Supposethatoneassociateswiththissystemtheperformanceindex(orcost)
whereUiscalledtheutilityfunctionandgisthediscountfactorwith0,g#1.NotethatthefunctionJisdependentontheinitialtimeiandtheinitialstatex(i),anditisreferredtoasthecost-to-goofstatex(i).Theobjectiveofdynamicprogrammingproblemistochooseacontrolsequenceu(k),k5i,i11,c,sothatthefunctionJ(i.e.,thecost)in
(2)isminimized.AccordingtoBellman,theoptimalcostfromtimekisequalto
Theoptimalcontrolu*1k2attimekistheu1k2whichachievesthisminimum,i.e.,
Equation(3)istheprincipleofoptimalityfordiscrete-timesystems.Itsimportanceliesinthefactthatitallowsonetooptimizeoveronlyonecontrolvectoratatimebyworkingbackwardintime.
Innonlinearcontinuous-timecase,thesystemcanbedescribedby
Thecostinthiscaseisdefinedas
Forcontinuous-timesystems,Bellman’sprincipleofoptimalitycanbeapplied,too.TheoptimalcostJ*(x0)5minJ(x0,u(t))willsatisfytheHamilton-Jacobi-BellmanEquation
Equations(3)and(7)arecalledtheoptimalityequationsofdynamicprogrammingwhicharethebasisforimplementationofdynamicprogramming.Intheabove,ifthefunctionFin
(1)or(5)andthecostfunctionJin
(2)or(6)areknown,thesolutionofu(k)becomesasimpleoptimizationproblem.Ifthesystemismodeledbylineardynamicsandthecostfunctiontobeminimizedisquadraticinthestateandcontrol,thentheoptimalcontrolisalinearfeedbackofthestates,wherethegainsareobtainedbysolvingastandardRiccatiequation[47].Ontheotherhand,ifthesystemismodeledbynonlineardynamicsorthecostfunctionisnonquadratic,theoptimalstatefeedbackcontrolwilldependuponsolutionstotheHamilton-Jacobi-Bellman(HJB)equation[48]whichisgenerallyanonlinearpartialdifferentialequationordifferenceequation.However,itisoftencomputationallyuntenabletoruntruedynamicprogrammingduetothebackwardnumericalprocessrequiredforitssolutions,i.e.,asaresultofthewell-known“curseofdimensionality”[16],[28].In[69],threecursesaredisplayedinresourcemanagementandcontrolproblemstoshowthecostfunctionJ,whichisthetheoreticalsolutionoftheHamilton-Jacobi-Bellmanequation,isverydifficulttoobtain,exceptforsystemssatisfyingsomeverygoodconditions.Overtheyears,progresshasbeenmadetocircumventthe“curseofdimensionality”bybuildingasystem,called“critic”,toapproximatethecostfunctionindynamicprogramming(cf.[10],[60],[61],[63],[70],[78],[92],[94],[95]).Theideaistoapproximatedynamicprogrammingsolutionsbyusingafunctionapproximationstructuresuchasneuralnetworkstoapproximatethecostfunction.
TheBasicStructuresofADP
Inrecentyears,adaptive/approximatedynamicprogramming(ADP)hasgainedmuchattentionfrommanyresearchersinordertoobtainapproximatesolutionsoftheHJBequation,cf.[2],[3],[5],[8],[11]–[13],[21],[22],[25],[30],[31],[34],[35],[40],[46],[49],[52],[54],[55],[63],[70],[76],[80],[83],[95],[96],[99],[100].In1977,Werbos[91]introducedanapproachforADPthatwaslatercalledadaptivecriticdesigns(ACDs).ACDswereproposedin[91],[94],[97]asawayforsolvingdynamicprogrammingproblemsforward-in-time.Intheliterature,thereareseveralsynonymsusedfor“AdaptiveCriticDesigns”[10],[24],[39],[43],[54],[70],[71],[87],including“ApproximateDynamicProgramming”[69],[82],[95],“AsymptoticDynamicProgramming”[75],“AdaptiveDynamicProgramming”[63],[64],“HeuristicDynamicProgramming”[46],[93],“Neuro-DynamicProgramming”[17],“NeuralDynamicProgramming”[82],[101],and“ReinforcementLearning”[84].
BertsekasandTsitsiklisgaveanoverviewoftheneurodynamicprogrammingintheirbook[17].Theyprovidedthebackground,gaveadetailedintroductiontodynamicprogramming,discussedtheneuralnetworkarchitecturesandmethodsfortrainingthem,anddevelopedgeneralconvergencetheoremsforstochasticapproximationmethodsasthefoundationforanalysisofvariousneuro-dynamicprogrammingalgorithms.Theyprovidedthecoreneuro-dynamicprogrammingmethodology,includingmanymathematicalresultsandmethodologicalinsights.Theysuggestedmanyusefulmethodologiesforapplicationstoneurodynamicprogramming,likeMonteCarlosimulation,on-lineandoff-linetemporaldifferencemethods,Q-learningalgorithm,optimisticpolicyiterationmethods,Bellmanerrormethods,approximatelinearprogramming,approximatedynamicprogrammingwithcost-to-gofunction,etc.Aparticularlyimpressivesuccessthatgreatlymotivatedsubsequentresearch,wasthedevelopmentofabackgammonplayingprogrambyTesauro[85].Hereaneuralnetworkwastrainedtoapproximatetheoptimalcost-to-gofunctionofthegameofbackgammonbyusingsimulation,thatis,bylettingtheprogramplayagainstitself.Unlikechessprograms,thisprogramdidnotuselookaheadofmanysteps,soitssuccesscanbeattributedprimarilytotheuseofaproperlytrainedapproximationoftheoptimalcost-to-gofunction.
ToimplementtheADPalgorithm,Werbos[95]proposedameanstogetaroundthisnumericalcomplexitybyusing“approximatedynamicprogramming”formulations.Hismethodsapproximatetheoriginalproblemwithadiscreteformulation.SolutiontotheADPformulationisobtainedthroughneuralnetworkbasedadaptivecriticapproach.ThemainideaofADPisshowninFig.1.
Heproposedtwobasicversionswhichareheuristicdynamicprogramming(HDP)anddualheuristicprogramming(DHP).
HDPisthemostbasicandwidelyappliedstructureofADP[13],[38],[72],[79],[90],[93],[104],[106].ThestructureofHDPisshowninFig.2.HDPisamethodforestimatingthecostfunction.EstimatingthecostfunctionforagivenpolicyonlyrequiressamplesfromtheinstantaneousutilityfunctionU,whilemodelsoftheenvironmentandtheinstantaneousrewardareneededtofindthecostfunctioncorrespondingtotheoptimalpolicy.
InHDP,theoutputofthecriticnetworkisJ^,whichistheestimateofJinequation
(2).Thisisdonebyminimizingthefollowingerrormeasureovertime
whereJ^(k)5J^3x(k),u(k),k,WC4andWCrepresentstheparametersofthecriticnetwork.WhenEh50forallk,(8)impliesthat
Dualheuristicprogrammingisamethodforestimatingthegradientofthecostfunction,ratherthanJitself.Todothis,afunctionisneededtodescribethegradientoftheinstantaneouscostfunctionwithrespecttothestateofthesystem.IntheDHPstructure,theactionnetworkremainsthesameastheoneforHDP,butforthesecondnetwork,whichiscalledthecriticnetwork,withthecostateasitsoutputandthestatevariablesasitsinputs.
Thecriticnetwork’strainingismorecomplicatedthanthatinHDPsinceweneedtotakeintoaccountallrelevantpathwaysofbackpropagation.
Thisisdonebyminimizingthefollowingerrormeasureovertime
where'J^1k2/'x1k25'J^3x1k2,u1k2,k,WC4/'x1k2andWCrepresentstheparametersofthecriticnetwork.WhenEh50forallk,(10)impliesthat
2.TheoreticalDevelopments
In[82],Sietalsummarizesthecross-disciplinarytheoreticaldevelopmentsofADPandoverviewsDPandADP;anddiscussestheirrelationstoartificialintelligence,approximationtheory,controltheory,operationsresearch,andstatistics.
In[69],PowellshowshowADP,whencoupledwithmathematicalprogramming,cansolve(approximately)deterministicorstochasticoptimizationproblems
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 外文 文献 翻译 自适应 动态 规划 综述