马尔可夫决策过程(MDP).pdf
- 文档编号:3217025
- 上传时间:2022-11-20
- 格式:PDF
- 页数:60
- 大小:9.50MB
马尔可夫决策过程(MDP).pdf
《马尔可夫决策过程(MDP).pdf》由会员分享,可在线阅读,更多相关《马尔可夫决策过程(MDP).pdf(60页珍藏版)》请在冰豆网上搜索。
?
Markovdecisionprocess(MDP)?
/?
Email:
?
MarkovProcess?
MarkovRewardProcess?
MRP?
MarkovDecisionProcess?
MDP?
MDP?
MDP?
MDP?
MarkovProcess?
MarkovRewardProcess?
MRP?
MarkovDecisionProcess?
MDP?
MDP?
MDP?
MDP?
P(Xt+1|Xt,Xt?
1,Xt?
2,)=P(Xt+1|Xt)XtXt+1Xt?
1,Xt?
2,(Xt,t2I)?
-?
5231045P(4|3)?
Randomwalk?
P(Xt+1|Xt,Xt?
1,Xt?
2,)=P(Xt+1|Xt,Xt?
1)St=(Xt,Xt?
1)St2(s,s),(s,r),(r,s),(r,r)?
/?
randomwalk?
MarkovProcess?
MarkovRewardProcess?
MRP?
MarkovDecisionProcess?
MDP?
MDP?
MDP?
MDP?
Markovrewardprocess(MRP)10Reward20MDP=Markovprocess+reward/utilityfunctions?
+?
/?
231045Reward5Reward0RewardRewardRewardu(S=3)u(S=4)0.10.90.20.81.01.0MRP?
MRP?
statetransitionprob.?
rewardfunction?
/?
discountfactor?
SPU?
MRP-?
Reward20231045Reward5Reward0RewardRewardRewardu(S=3)u(S=4)0.10.90.20.81.01.0MRP13Reward20231045Reward5Reward0Reward6Reward2Reward90.10.90.20.81.01.0?
Reward?
immediate?
“?
”?
SH(S)startfromhereMRP?
-?
Backwardinduction14Reward20231045Reward5Reward0Reward6Reward2Reward90.10.90.20.81.01.0H(S=4)=u(S=4)=2H(S=5)=u(S=5)=9MRP?
-?
15Reward20231045Reward5Reward0Reward6Reward2Reward90.10.90.20.81.01.0?
“?
”?
H(S=3)=u(S=3)+?
0.2H(S=4)+0.8H(S=5)=6+?
0.22+0.89?
20,1)MRP?
-?
16Reward20231045Reward5Reward0Reward6Reward2Reward90.10.90.20.81.01.0?
H(S=2),H(S=1),H(S=3)=u(S=3)+?
0.2H(S=4)+0.8H(S=5)=6+?
0.22+0.89MRP17Reward20231045Reward5Reward0Reward6Reward2Reward90.10.90.20.81.01.0H(St)=E?
u(St)+?
H(St+1)H(S)=u(S)+?
XS02SP(S,S0)H(S0)?
(?
)?
MRP-?
absorbingstate?
Reward202310Reward5Reward0Reward61.01.01.01.0MRP-?
Valueiteration19Reward20231045Reward5Reward0Reward6Reward2Reward90.10.90.20.81.01.0H(S),0,8S2S?
H(S)=u(S)+?
XS02SP(S,S0)H(S0)?
H(S)?
MarkovProcess?
MarkovRewardProcess?
MRP?
MarkovDecisionProcess?
MDP?
MDP?
MDP?
MDP?
Markovdecisionprocess(MDP)22123ActionA1Reward20CurrentstatePossiblefuturestatePossiblefuturestateMDP=Markovprocess+actions+rewardfunctions?
+?
+?
123ActionA2Reward5CurrentstatePossiblefuturestatePossiblefuturestate0.10.9?
1?
2?
Markovdecisionprocess(MDP)23ActionA1Reward20MDP=Markovprocess+actions+rewardfunctions?
+?
+?
123ActionA2Reward5CurrentstatePossiblefuturestatePossiblefuturestate0.10.9?
1?
2?
MDP?
MDP?
action?
statetransitionprob.?
rewardfunction?
/?
discountfactor?
SAPU?
CMDP?
POMDP?
MDP?
MDP?
MDP?
“?
”?
“?
”?
action/decision?
MDP?
MDP?
-?
-?
reward-?
PSA?
MarkovProcess?
MarkovRewardProcess?
MRP?
MarkovDecisionProcess?
MDP?
MDP?
MDP?
MDP?
MDP?
MDP?
“?
”?
Policy?
:
S7!
ASA?
:
S7!
A?
Bellman?
Bellmanequation?
MRP?
MDP?
H(S)=u(S)+?
XS02SP(S,S0)H(S0)H(S,A)=u(S,A)+?
XS02SP(S,A,S0)U(S0)U(S)=maxA2AH(S,A)?
(S)=argmaxH(A,S)Bellmanequation?
Bellmanequation?
MRP?
backwardinduction?
absorbingstate?
“?
”?
Bellmanequation?
(Valueiterationalgorithm)?
0?
Bellmaneqn?
123ActionA2Reward5ActionA1Reward20CurrentstatePossiblefuturestatePossiblefuturestate0.10.9U0(S),0,8S2SHn+1(S,A)=u(S,A)+?
XS02SP(S,A,S0)Un(S0)Un+1(S)=maxA2AHn+1(S,A)U(S)ValueiterationalgorithmForeachstate:
SH0(S),0Repeatuntilconverge:
Foreachstate:
SForeachaction:
AHn+1(S,A)=u(S,A)+?
XS02SP(S,A,S0)Un(S0)ComputeComputeandstore?
n+1(S)=argmaxAHn+1(S,A)ComputeandstoreUn+1(S)=maxA2AHn+1(S,A)Return?
(S),U(S),8S2S?
Bellmanequation?
(Policyiterationalgorithm)?
Valueiteration?
Bellmaneqn?
0(S),8S2S?
n+1(S):
S7!
A,8S2S?
Bellmanequation?
-Theprincipleofoptimality?
O(|A|S|2)|A|S|f(x)=x?
MarkovProcess?
MarkovRewardProcess?
MRP?
MarkovDecisionProcess?
MDP?
MDP?
MDP?
MDP?
MDP?
State0State1State2State3S=0,1,2,3A=Left,Right0123Action:
LeftAction:
RightReward:
-1foreverystepmovedDiscountfactor:
0.5MDP?
State0State1State2State30123Action:
LeftAction:
RightP(A=Left)=266410001000010000103775P(A=Right)=266410000010000100013775?
Value:
H=0.00.00.00.0Action:
/Value:
H=0.0-1.0-1.0-1.0Action:
/Value:
H=0.0-1.0-1.5-1.5Action:
?
Period1Period2Period3MDP?
MDP?
MarkovProcess?
MarkovRewardProcess?
MRP?
MarkovDecisionProcess?
MDP?
MDP?
MDP?
MDP?
:
?
?
?
/?
-?
:
?
:
?
41Copyright:
Forbes?
-?
:
?
RFenergyTx/RxFriisformulaBeamforming?
42PowercasterTxandRxPChargingstation?
Electricitychargers?
:
Atdifferentfixedlocations,e.g.,poweroutlets,basestationsEndusersofenergy?
:
Thosewhoneedenergy,butarenotcoveredbychargersMobileenergygateway?
:
Movingandcharge/transferring(wirelessly)43?
Buy/SellenergyEnergygatewaybuysfromchargers(Charging)EachchargerasksacertainpricewhenchargingEnergygatewaysellstoendusers(Transferring)Moreusers,morepaymentsNearusergetsmoreenergy,thushigherpayments44“?
”“?
”?
Mobileenergygateway?
enduserofenergy?
RF?
-?
45?
MDP?
:
?
;:
?
;:
?
:
?
?
?
46S=S=(L,E,N,P)LENdecidesenduserpaymentA=A=0,1,2PMDP:
?
47f(n,l|N)=3RB(n+23,N?
n+1)B(N?
n+1,n)?
l3R3;n+23,N?
n+1R(n,ES)=ZR?
0f(n,l|N)r(eDn)dl+ZRR?
f(n,l|N)r(gESl2)dlnthsumuptogetoverallpaymentMDP:
?
48MDP:
?
49P(A=1)=2664.0.30.7.3775P(A=0)=2664.1.00.0.3775?
Bellmanequation?
valueiterationalgorithm?
pymdptoolbox?
MDP?
mdptoolbox?
Matlab?
MDP?
MDP?
/?
GreedyschemeGRDY?
:
maximizingimmediateutilityRandomschemeRND?
:
randomlytakinganyaction(i.e.,0,1,2)fromtheactionsetLocation-awareschemeLOCA:
chargingatcharger,transfe
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 马尔可夫 决策 过程 MDP