HAIP异常导致RAC节点无法启动的解决方案.docx
- 文档编号:5675659
- 上传时间:2022-12-31
- 格式:DOCX
- 页数:9
- 大小:53.99KB
HAIP异常导致RAC节点无法启动的解决方案.docx
《HAIP异常导致RAC节点无法启动的解决方案.docx》由会员分享,可在线阅读,更多相关《HAIP异常导致RAC节点无法启动的解决方案.docx(9页珍藏版)》请在冰豆网上搜索。
HAIP异常导致RAC节点无法启动的解决方案
HAIP异常,导致RAC节点无法启动的解决方案
一个网友咨询一个问题,他的11.2.0.2RAC(forAix),没有安装任何patch或PSU。
其中一个节点重启之后无法正常启动,查看ocssd日志如下:
2014-08-0914:
21:
46.094:
[CSSD][5414]clssnmSendingThread:
sent4joinmsgstoallnodes
2014-08-0914:
21:
46.421:
[CSSD][4900]clssgmWaitOnEventValue:
afterCmInfoStateval3,eval1waited0s
2014-08-0914:
21:
47.042:
[CSSD][4129]clssnmvDHBValidateNCopy:
node1,rac01,hasadiskHB,butnonetworkHB,DHBhasrcfg217016033,wrtcnt,255958157,LATS1518247992,lastSeqNo255958154,uniqueness1406064021,timestamp1407565306/1501758072
2014-08-0914:
21:
47.051:
[CSSD][3358]clssnmvDHBValidateNCopy:
node1,rac01,hasadiskHB,butnonetworkHB,DHBhasrcfg217016033,wrtcnt,255958158,LATS1518248002,lastSeqNo255958155,uniqueness1406064021,timestamp1407565306/1501758190
2014-08-0914:
21:
47.421:
[CSSD][4900]clssgmWaitOnEventValue:
afterCmInfoStateval3,eval1waited0
2014-08-0914:
21:
48.042:
[CSSD][4129]clssnmvDHBValidateNCopy:
node1,rac01,hasadiskHB,butnonetworkHB,DHBhasrcfg217016033,wrtcnt,255958160,LATS1518248993,lastSeqNo255958157,uniqueness1406064021,timestamp1407565307/1501759080
2014-08-0914:
21:
48.052:
[CSSD][3358]clssnmvDHBValidateNCopy:
node1,rac01,hasadiskHB,butnonetworkHB,DHBhasrcfg217016033,wrtcnt,255958161,LATS1518249002,lastSeqNo255958158,uniqueness1406064021,timestamp1407565307/1501759191
2014-08-0914:
21:
48.421:
[CSSD][4900]clssgmWaitOnEventValue:
afterCmInfoStateval3,eval1waited0
2014-08-0914:
21:
49.043:
[CSSD][4129]clssnmvDHBValidateNCopy:
node1,rac01,hasadiskHB,butnonetworkHB,DHBhasrcfg217016033,wrtcnt,255958163,LATS1518249993,lastSeqNo255958160,uniqueness1406064021,timestamp1407565308/1501760082
2014-08-0914:
21:
49.056:
[CSSD][3358]clssnmvDHBValidateNCopy:
node1,rac01,hasadiskHB,butnonetworkHB,DHBhasrcfg217016033,wrtcnt,255958164,LATS1518250007,lastSeqNo255958161,uniqueness1406064021,timestamp1407565308/1501760193
2014-08-0914:
21:
49.421:
[CSSD][4900]clssgmWaitOnEventValue:
afterCmInfoStateval3,eval1waited0
2014-08-0914:
21:
50.044:
[CSSD][4129]clssnmvDHBValidateNCopy:
node1,rac01,hasadiskHB,butnonetworkHB,DHBhasrcfg217016033,wrtcnt,255958166,LATS1518250994,lastSeqNo255958163,uniqueness1406064021,timestamp1407565309/1501761090
2014-08-0914:
21:
50.057:
[CSSD][3358]clssnmvDHBValidateNCopy:
node1,rac01,hasadiskHB,butnonetworkHB,DHBhasrcfg217016033,wrtcnt,255958167,LATS1518251007,lastSeqNo255958164,uniqueness1406064021,timestamp1407565309/1501761195
2014-08-0914:
21:
50.421:
[CSSD][4900]clssgmWaitOnEventValue:
afterCmInfoStateval3,eval1waited0
2014-08-0914:
21:
51.046:
[CSSD][4129]clssnmvDHBValidateNCopy:
node1,rac01,hasadiskHB,butnonetworkHB,DHBhasrcfg217016033,wrtcnt,255958169,LATS1518251996,lastSeqNo255958166,uniqueness1406064021,timestamp1407565310/1501762100
2014-08-0914:
21:
51.057:
[CSSD][3358]clssnmvDHBValidateNCopy:
node1,rac01,hasadiskHB,butnonetworkHB,DHBhasrcfg217016033,wrtcnt,255958170,LATS1518252008,lastSeqNo255958167,uniqueness1406064021,timestamp1407565310/1501762205
2014-08-0914:
21:
51.102:
[CSSD][5414]clssnmSendingThread:
sendingjoinmsgtoallnodes
2014-08-0914:
21:
51.102:
[CSSD][5414]clssnmSendingThread:
sent5joinmsgstoallnodes
2014-08-0914:
21:
51.421:
[CSSD][4900]clssgmWaitOnEventValue:
afterCmInfoStateval3,eval1waited0
2014-08-0914:
21:
52.050:
[CSSD][4129]clssnmvDHBValidateNCopy:
node1,rac01,hasadiskHB,butnonetworkHB,DHBhasrcfg217016033,wrtcnt,255958172,LATS1518253000,lastSeqNo255958169,uniqueness1406064021,timestamp1407565311/1501763110
2014-08-0914:
21:
52.058:
[CSSD][3358]clssnmvDHBValidateNCopy:
node1,rac01,hasadiskHB,butnonetworkHB,DHBhasrcfg217016033,wrtcnt,255958173,LATS1518253008,lastSeqNo255958170,uniqueness1406064021,timestamp1407565311/1501763230
2014-08-0914:
21:
52.089:
[CSSD][5671]clssnmRcfgMgrThread:
LocalJoin
2014-08-0914:
21:
52.089:
[CSSD][5671]clssnmLocalJoinEvent:
beginonnode
(2),waittime193000
2014-08-0914:
21:
52.089:
[CSSD][5671]clssnmLocalJoinEvent:
setcurtime(1518253039)formynode
2014-08-0914:
21:
52.089:
[CSSD][5671]clssnmLocalJoinEvent:
scanning32nodes
2014-08-0914:
21:
52.089:
[CSSD][5671]clssnmLocalJoinEvent:
Noderac01,number1,isinanexistingclusterwithdiskstate3
2014-08-0914:
21:
52.090:
[CSSD][5671]clssnmLocalJoinEvent:
takeoverabortedduetoclustermembernodefoundondisk
2014-08-0914:
21:
52.431:
[CSSD][4900]clssgmWaitOnEventValue:
afterCmInfoStateval3,eval1waited0
从上面的信息,很容易给人感觉是心跳的问题。
这么理解也不错,只是这里的心跳不是指的我们说理解的传统的心跳网络。
我让他在crs正常的一个节点查询如下信息,我们就知道原因了,如下:
SQL> select name,ip_address from v$cluster_interconnects;
NAME IP_ADDRESS
--------------- ----------------
en0 169.254.116.242
大家可以看到,这里心跳IP为什么是169网段呢?
很明显跟我们的/etc/hosts设置不匹配啊?
why?
这里我们要介绍下Oracle11gR2引入的HAIP特性,Oracle引入该特性的目的是为了通过自身的技术来实现心跳网络的冗余,而不再依赖于第三方技术,比如Linux的bond等等。
在Oracle11.2.0.2版本之前,如果使用了OS级别的心跳网卡绑定,那么Oracle仍然以OS绑定的为准。
从11.2.0.2开始,如果没有在OS层面进行心跳冗余的配置,那么Oracle自己的HAIP就启用了。
所以你虽然设置的192.168.1.100,然而实际上Oracle使用是169.254这个网段。
关于这一点,大家可以去看下alertlog,从该日志都能看出来,这里不多说。
我们可以看到,正常节点能看到如下的169网段的ip,问题节点确实看不到这个169的网段IP:
OracleMOS提供了一种解决方案,如下:
crsctlstartresora.cluster_interconnect.haip-init
经过测试,使用root进行操作,也是不行的。
针对HAIP的无法启动,OracleMOS文档说通常是如下几种情况:
1)心跳网卡异常
2) 多播工作机制异常
3)防火墙等原因
4)Oraclebug
对于心跳网卡异常,如果只有一块心跳网卡,那么ping其他的ip就可以进行验证了,这一点很容易排除。
对于多播的问题,可以通过Oracle提供的mcasttest.pl脚本进行检测(请参考GridInfrastructureStartupDuringPatching,InstallorUpgradeMayFailDuetoMulticastingRequirement(ID1212703.1),我这里的检查结果如下:
$./mcasttest.pl-nrac02,rac01-ien0
###########Setupfornoderac02##########
Checkingnodeaccess'rac02'
Checkingnodelogin'rac02'
Checking/CreatingDirectory/tmp/mcasttestforbinaryonnode'rac02'
Distributingmcast2binarytonode'rac02'
###########Setupfornoderac01##########
Checkingnodeaccess'rac01'
Checkingnodelogin'rac01'
Checking/CreatingDirectory/tmp/mcasttestforbinaryonnode'rac01'
Distributingmcast2binarytonode'rac01'
###########testingMulticastonallnodes##########
TestforMulticastaddress230.0.1.0
Aug1121:
39:
39|MulticastFailedforen0usingaddress230.0.1.0:
42000
TestforMulticastaddress224.0.0.251
Aug1121:
40:
09|MulticastFailedforen0usingaddress224.0.0.251:
42001
$
虽然这里通过脚本检查,发现对于230和224网段都是不通的,然而这不见得一定说明是多播的问题导致的。
虽然我们查看ocssd.log,通过搜索mcast关键可以看到相关的信息。
实际上,我在自己的11.2.0.3LinuxRAC环境中测试,即使mcasttest.pl测试不通,也可以正常启动CRS的。
由于网友这里是AIX,应该我就排除防火墙的问题了。
因此最后怀疑Bug9974223的可能性比较大。
实际上,如果你去查询HAIP的相关信息,你会发现该特性其实存在不少的Oraclebug。
其中forknownsHAIPissuesin11gR2/12cGridInfrastructure(1640865.1)就记录12个HAIP相关的bug。
由于这里他的第1个节点无法操作,为了安全,是不能有太多的操作的。
对于HAIP,如果没有使用多心跳网卡的情况下,我觉得完全是可以禁止掉的。
但是昨天查MOS文档,具体说不能disabled。
最后测试发现其实是可以禁止掉的。
如下是我的测试过程:
[root@rac1bin]#./crsctlmodifyresora.cluster_interconnect.haip-attr"ENABLED=0"-init
[root@rac1bin]#./crsctlstopcrs
CRS-2791:
StartingshutdownofOracleHighAvailabilityServices-managedresourceson'rac1'
CRS-2673:
Attemptingtostop'ora.crsd'on'rac1'
CRS-2790:
StartingshutdownofClusterReadyServices-managedresourceson'rac1'
CRS-2673:
Attemptingtostop'ora.oc4j'on'rac1'
CRS-2673:
Attemptingtostop'ora.cvu'on'rac1'
CRS-2673:
Attemptingtostop'ora.LISTENER_SCAN1.lsnr'on'rac1'
CRS-2673:
Attemptingtostop'ora.GRID.dg'on'rac1'
CRS-2673:
Attemptingtostop'ora.registry.acfs'on'rac1'
CRS-2673:
Attemptingtostop'ora.rac1.vip'on'rac1'
CRS-2677:
Stopof'ora.rac1.vip'on'rac1'succeeded
CRS-2672:
Attemptingtostart'ora.rac1.vip'on'rac2'
CRS-2677:
Stopof'ora.LISTENER_SCAN1.lsnr'on'rac1'succeeded
CRS-2673:
Attemptingtostop'ora.scan1.vip'on'rac1'
CRS-2677:
Stopof'ora.scan1.vip'on'rac1'succeeded
CRS-2672:
Attemptingtostart'ora.scan1.vip'on'rac2'
CRS-2676:
Startof'ora.rac1.vip'on'rac2'succeeded
CRS-2676:
Startof'ora.scan1.vip'on'rac2'succeeded
CRS-2672:
Attemptingtostart'ora.LISTENER_SCAN1.lsnr'on'rac2'
CRS-2676:
Startof'ora.LISTENER_SCAN1.lsnr'on'rac2'succeeded
CRS-2677:
Stopof'ora.registry.acfs'on'rac1'succeeded
CRS-2677:
Stopof'ora.oc4j'on'rac1'succeeded
CRS-2677:
Stopof'ora.cvu'on'rac1'succeeded
CRS-2677:
Stopof'ora.GRID.dg'on'rac1'succeeded
CRS-2673:
Attemptingtostop'ora.asm'on'rac1'
CRS-2677:
Stopof'ora.asm'on'rac1'succeeded
CRS-2673:
Attemptingtostop'ora.ons'on'rac1'
CRS-2677:
Stopof'ora.ons'on'rac1'succeeded
CRS-2673:
Attemptingtostop'work'on'rac1'
CRS-2677:
Stopof'work'on'rac1'succeeded
CRS-2792:
ShutdownofClusterReadyServices-managedresourceson'rac1'hascompleted
CRS-2677:
Stopof'ora.crsd'on'rac1'succeeded
CRS-2673:
Attemptingtostop'ora.drivers.acfs'on'rac1'
CRS-2673:
Attemptingtostop'ora.ctssd'on'rac1'
CRS-2673:
Attemptingtostop'ora.evmd'on'rac1'
CRS-2673:
Attemptingtostop'ora.asm'on'rac1'
CRS-2673:
Attemptingtostop'ora.mdnsd'on'rac1'
CRS-2677:
Stopof'ora.mdnsd'on'rac1'succeeded
CRS-2677:
Stopof'ora.evmd'on'rac1'succeeded
CRS-2677:
Stopof'ora.ctssd'on'rac1'succeeded
CRS-2677:
Stopof'ora.asm'on'rac1'succeeded
CRS-2673:
Attemptingtostop'ora.cluster_interconnect.haip'on'rac1'
CRS-2677:
Stopof'ora.cluster_interconnect.haip'on'rac1'succeeded
CRS-2673:
Attemptingtostop'ora.cssd'on'rac1'
CRS-2677:
Stopof'ora.cssd'on'rac1'succeeded
CRS-2673:
Attemptingtostop'ora.crf'on'rac1'
CRS-2677:
Stopof'ora.drivers.acfs'on'rac1'succeeded
CRS-2677:
Stopof'ora.crf'on'rac1'succeeded
CRS-2673:
Attemptingtostop'ora.gipcd'on'rac1'
CRS-2677:
Stopof'ora.gipcd'on'rac1'succeeded
CRS-2673:
Attemptingtostop'ora.gpnpd'on'rac1'
CRS-2677:
Stopof'ora.gpnpd'on'rac1'succeeded
CRS-2793:
ShutdownofOracleHighAvailabilityServices-managedresourceson'rac1'hascompleted
CR
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- HAIP 异常 导致 RAC 节点 无法 启动 解决方案