[Linux-ha-jp] 待機系がフェールオーバーし、再起動を繰り返す

Back to archive index

mlus mlus****@39596*****
2013年 6月 3日 (月) 22:09:07 JST


こんばんは。外してるかもしれないのですが・・・

ーーーーーーーーーーー
IPaddr[4427]: 2013/06/03_13:21:44 INFO:  Resource is stopped
ResourceManager[4378]: 2013/06/03_13:21:50 ERROR: Return code 20 from
/etc/ha.d/resource.d/drbddisk
ーーーーーーーーーーーーー

ここの所がエラー箇所だと思います。
/etc/ha.d/resource.d/drbddisk
ここのところも、設定内容を出されたらアドバイスが付くかもしれませんね。

2013/6/3  <delta_syste****@yahoo*****>:
>
> O.Nと申します。
>
> はじめて投稿させていただきます。
> 不足情報があるかも知れません。教えていただけると幸いです。
>
>  物理サーバ上で、Red Hat Enterprise Linux 5.5、Heartbeat-2.1.4-1をインストールし
> httpd,PostgreSQLのクラスタ構成を組みました。
> 待機系サーバ(SERVER2)を起動すると、一旦、クラスタが起動できたと思うと、フェール
> オーバーしてしまい、再起動を繰り返す。
>  ログからフェールオーバーになっていることはわかりますが、なぜ再起動を繰り返すかわかりません。
> 想定される原因を教えてください。
>
> 1.環境
> Red Hat Enterprise Linux 5.5
> heartbeat-2.1.4-1
> SERVER1(物理:eth0)192.168.0.120
> SERVER2(物理:eth0)192.168.0.121
> VIP 192.168.0.110
> SERVER1(物理:eth1)10.10.10.10
> SERVER2(物理:eth1)10.10.10.11
>
> [root @ SERVER2 ~]# tcpdump -i eth1 port 694
> tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
> listening on eth1, link-type EN10MB (Ethernet), capture size 96 bytes
> 21:32:37.577851 IP 10.10.10.11.50029 > 10.10.10.10.ha-cluster: UDP, length 194
> 21:32:37.577868 IP 10.10.10.11.50029 > 10.10.10.10.ha-cluster: UDP, length 188
>
> 2./etc/ha.d/ha.cfの抜粋
> debugfile /var/log/ha-debug
> logfile   /var/log/ha-log
> logfacility    local0
> keepalive 10
> deadtime 60
> warntime 30
> initdead 120
> udpport    694
> ucast eth1 10.10.10.11
> auto_failback off
> watchdog /dev/watchdog
> node SERVER1.domain SERVER2.domain
> ping 192.168.0.1
> respawn hacluster /usr/lib/heartbeat/ipfail
> respawn root /usr/local/sbin/check_active
> apiauth ipfail gid=haclient uid=hacluster
> debug 3
>
> 3.haresourcesの抜粋
> SEVER1 IPaddr::192.168.0.110/24
>
> 4./etc/log/ha-logの抜粋
> heartbeat[3396]: 2013/06/03_13:19:40 info: AUTH: i=1: key = 0x991a168, auth=0x118c80, authname=crc
> heartbeat[3396]: 2013/06/03_13:19:40 info: Version 2 support: false
> heartbeat[3396]: 2013/06/03_13:19:40 WARN: Logging daemon is disabled --enabling logging daemon is recommended
> heartbeat[3396]: 2013/06/03_13:19:40 info: **************************
> heartbeat[3396]: 2013/06/03_13:19:40 info: Configuration validated. Starting heartbeat 2.1.4
> heartbeat[3398]: 2013/06/03_13:19:40 info: heartbeat: version 2.1.4
> heartbeat[3398]: 2013/06/03_13:19:40 info: Heartbeat generation: 1369315411
> heartbeat[3398]: 2013/06/03_13:19:40 info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth1
> heartbeat[3398]: 2013/06/03_13:19:40 info: glib: ucast: bound send socket to device: eth1
> heartbeat[3398]: 2013/06/03_13:19:40 info: glib: ucast: bound receive socket to device: eth1
> heartbeat[3398]: 2013/06/03_13:19:40 info: glib: ucast: started on port 694 interface eth1 to 10.10.10.11
> heartbeat[3398]: 2013/06/03_13:19:40 info: glib: ping heartbeat started.
> heartbeat[3398]: 2013/06/03_13:19:40 info: G_main_add_TriggerHandler: Added signal manual handler
> heartbeat[3398]: 2013/06/03_13:19:40 info: G_main_add_TriggerHandler: Added signal manual handler
> heartbeat[3398]: 2013/06/03_13:19:40 notice: Using watchdog device: /dev/watchdog
> heartbeat[3398]: 2013/06/03_13:19:40 info: G_main_add_SignalHandler: Added signal handler for signal 17
> heartbeat[3398]: 2013/06/03_13:19:40 info: Local status now set to: 'up'
> heartbeat[3398]: 2013/06/03_13:19:40 info: Managed write_hostcachedata process 3431 exited with return code 0.
> heartbeat[3398]: 2013/06/03_13:19:41 info: Link 192.168.0.1:192.168.0.1 up.
> heartbeat[3398]: 2013/06/03_13:19:41 info: Status update for node 192.168.0.1: status ping
> heartbeat[3398]: 2013/06/03_13:19:41 info: Link SEVER1.domain:eth1 up.
> heartbeat[3398]: 2013/06/03_13:19:41 info: Managed write_hostcachedata process 3575 exited with return code 0.
> heartbeat[3398]: 2013/06/03_13:19:42 info: Comm_now_up(): updating status to active
> heartbeat[3398]: 2013/06/03_13:19:42 info: Local status now set to: 'active'
> heartbeat[3398]: 2013/06/03_13:19:42 info: Starting child client "/usr/lib/heartbeat/ipfail" (200,200)
> heartbeat[3398]: 2013/06/03_13:19:43 info: Starting child client "/usr/local/sbin/check_active" (0,0)
> heartbeat[3398]: 2013/06/03_13:19:43 WARN: G_CH_dispatch_int: Dispatch function for read child took too long to execute: 820 ms (> 50 ms) (GSource: 0x991df98)
> heartbeat[3604]: 2013/06/03_13:19:43 info: Starting "/usr/lib/heartbeat/ipfail" as uid 200  gid 200 (pid 3604)
> heartbeat[3605]: 2013/06/03_13:19:43 info: Starting "/usr/local/sbin/check_active" as uid 0  gid 0 (pid 3605)
> heartbeat[3398]: 2013/06/03_13:19:43 info: Managed write_hostcachedata process 3606 exited with return code 0.
> heartbeat[3398]: 2013/06/03_13:19:43 info: Managed write_delcachedata process 3607 exited with return code 0.
> heartbeat[3398]: 2013/06/03_13:19:43 WARN: G_SIG_dispatch: Dispatch function for SIGCHLD took too long to execute: 840 ms (> 30 ms) (GSource: 0x9920b20)
> heartbeat[3398]: 2013/06/03_13:19:43 info: AnnounceTakeover(local 0, foreign 1, reason 'T_RESOURCES' (0))
> heartbeat[3398]: 2013/06/03_13:19:43 info: remote resource transition completed.
> heartbeat[3398]: 2013/06/03_13:19:43 info: AnnounceTakeover(local 0, foreign 1, reason 'T_RESOURCES' (0))
> heartbeat[3398]: 2013/06/03_13:19:43 info: STATE 1 => 3
> heartbeat[3398]: 2013/06/03_13:19:43 info: other_holds_resources: 3
> heartbeat[3398]: 2013/06/03_13:19:43 info: remote resource transition completed.
> heartbeat[3398]: 2013/06/03_13:19:43 info: Local Resource acquisition completed. (none)
> heartbeat[3398]: 2013/06/03_13:19:43 info: AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(them)' (0))
> heartbeat[3398]: 2013/06/03_13:19:43 info: Initial resource acquisition complete (T_RESOURCES(them))
> heartbeat[3398]: 2013/06/03_13:19:43 info: STATE 3 => 4
> heartbeat[3398]: 2013/06/03_13:19:43 WARN: G_SIG_dispatch: Dispatch function for SIGCHLD was delayed 850 ms (> 100 ms) before being called (GSource: 0x9920b20)
> heartbeat[3398]: 2013/06/03_13:19:43 info: G_SIG_dispatch: started at 429409420 should have started at 429409335
> heartbeat[3398]: 2013/06/03_13:19:44 info: other_holds_resources: 3
> heartbeat[3398]: 2013/06/03_13:19:44 WARN: G_WC_dispatch: Dispatch function for client registration took too long to execute: 640 ms (> 20 ms) (GSource: 0x992d878)
> heartbeat[3398]: 2013/06/03_13:19:46 info: Status update for node SEVER1.domain: status active
> ipfail[3604]: 2013/06/03_13:19:46 info: Status update: Node SEVER1.domain now has status active
> ipfail[3604]: 2013/06/03_13:19:46 info: Ping node count is balanced.
> harc[3682]: 2013/06/03_13:19:46 info: Running /etc/ha.d/rc.d/status status
> heartbeat[3398]: 2013/06/03_13:19:46 info: Managed status process 3682 exited with return code 0.
> heartbeat[3398]: 2013/06/03_13:20:50 WARN: Gmain_timeout_dispatch: Dispatch function for send local status took too long to execute: 100 ms (> 50 ms) (GSource: 0x9924788)
> heartbeat[3398]: 2013/06/03_13:21:37 WARN: node SEVER1.domain: is dead
> ipfail[3604]: 2013/06/03_13:21:37 info: Status update: Node SEVER1.domain now has status dead
> heartbeat[3398]: 2013/06/03_13:21:37 WARN: No STONITH device configured.
> heartbeat[3398]: 2013/06/03_13:21:37 WARN: Shared disks are not protected.
> heartbeat[3398]: 2013/06/03_13:21:37 info: Resources being acquired from SEVER1.domain.
> heartbeat[3398]: 2013/06/03_13:21:37 info: Link SEVER1.domain:eth1 dead.
> heartbeat[4324]: 2013/06/03_13:21:37 info: No local resources [/usr/share/heartbeat/ResourceManager listkeys SERVER2.domain] to acquire.
> harc[4323]: 2013/06/03_13:21:37 info: Running /etc/ha.d/rc.d/status status
> heartbeat[4324]: 2013/06/03_13:21:37 info: Writing type [resource] message to FIFO
> heartbeat[4324]: 2013/06/03_13:21:37 info: FIFO message [type resource] written rc=79
> heartbeat[3398]: 2013/06/03_13:21:37 info: Managed req_our_resources process 4324 exited with return code 0.
> heartbeat[3398]: 2013/06/03_13:21:37 info: AnnounceTakeover(local 1, foreign 1, reason 'req_our_resources' (1))
> heartbeat[3398]: 2013/06/03_13:21:37 info: AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1))
> mach_down[4352]: 2013/06/03_13:21:37 info: Taking over resource group drbddisk
> ipfail[3604]: 2013/06/03_13:21:37 info: NS: We are still alive!
> ipfail[3604]: 2013/06/03_13:21:37 info: Link Status update: Link SEVER1.domain/eth1 now has status dead
> ResourceManager[4378]: 2013/06/03_13:21:37 info: Acquiring resource group: SEVER1.domain drbddisk Filesystem::/dev/drbd0::/usr1::ext3 httpd postgresql 192.168.0.110/24 MailTo::test****@yahoo*****::server_FailOver
> ResourceManager[4378]: 2013/06/03_13:21:37 info: Running /etc/ha.d/resource.d/drbddisk  start
> ipfail[3604]: 2013/06/03_13:21:37 info: Asking other side for ping node count.
> ipfail[3604]: 2013/06/03_13:21:37 info: Checking remote count of ping nodes.
> IPaddr[4427]: 2013/06/03_13:21:44 INFO:  Resource is stopped
> ResourceManager[4378]: 2013/06/03_13:21:50 ERROR: Return code 20 from /etc/ha.d/resource.d/drbddisk
> ResourceManager[4378]: 2013/06/03_13:21:50 CRIT: Giving up resources due to failure of drbddisk
> ResourceManager[4378]: 2013/06/03_13:21:50 info: Releasing resource group: SEVER1.domain drbddisk Filesystem::/dev/drbd0::/usr1::ext3 httpd postgresql 192.168.0.110/24 MailTo::test****@yahoo*****::server_FailOver
> ResourceManager[4378]: 2013/06/03_13:21:50 info: Running /etc/ha.d/resource.d/MailTo test****@yahoo***** server_FailOver stop
> MailTo[4514]: 2013/06/03_13:21:50 INFO:  Success
> ResourceManager[4378]: 2013/06/03_13:21:51 info: Running /etc/ha.d/resource.d/IPaddr 192.168.0.110/24 stop
> IPaddr[4570]: 2013/06/03_13:21:51 INFO:  Success
> ResourceManager[4378]: 2013/06/03_13:21:51 info: Running /etc/init.d/postgresql  stop
> ResourceManager[4378]: 2013/06/03_13:21:52 ERROR: Return code 1 from /etc/init.d/postgresql
> ResourceManager[4378]: 2013/06/03_13:21:53 info: Retrying failed stop operation [postgresql]
> ResourceManager[4378]: 2013/06/03_13:21:53 info: Running /etc/init.d/postgresql  stop
> ResourceManager[4378]: 2013/06/03_13:21:53 ERROR: Return code 1 from /etc/init.d/postgresql
> ResourceManager[4378]: 2013/06/03_13:21:54 info: Retrying failed stop operation [postgresql]
> ResourceManager[4378]: 2013/06/03_13:21:54 info: Running /etc/init.d/postgresql  stop
> ResourceManager[4378]: 2013/06/03_13:21:54 ERROR: Return code 1 from /etc/init.d/postgresql
> ResourceManager[4378]: 2013/06/03_13:21:56 info: Retrying failed stop operation [postgresql]
> ResourceManager[4378]: 2013/06/03_13:21:56 info: Running /etc/init.d/postgresql  stop
> ResourceManager[4378]: 2013/06/03_13:21:56 ERROR: Return code 1 from /etc/init.d/postgresql
> ResourceManager[4378]: 2013/06/03_13:21:57 info: Retrying failed stop operation [postgresql]
> ResourceManager[4378]: 2013/06/03_13:21:57 info: Running /etc/init.d/postgresql  stop
> ResourceManager[4378]: 2013/06/03_13:21:57 ERROR: Return code 1 from /etc/init.d/postgresql
> ResourceManager[4378]: 2013/06/03_13:21:58 info: Retrying failed stop operation [postgresql]
> ResourceManager[4378]: 2013/06/03_13:21:58 info: Running /etc/init.d/postgresql  stop
> ResourceManager[4378]: 2013/06/03_13:21:59 ERROR: Return code 1 from /etc/init.d/postgresql
> ResourceManager[4378]: 2013/06/03_13:22:00 info: Retrying failed stop operation [postgresql]
> ResourceManager[4378]: 2013/06/03_13:22:00 info: Running /etc/init.d/postgresql  stop
> ResourceManager[4378]: 2013/06/03_13:22:00 ERROR: Return code 1 from /etc/init.d/postgresql
> ResourceManager[4378]: 2013/06/03_13:22:01 info: Retrying failed stop operation [postgresql]
> ResourceManager[4378]: 2013/06/03_13:22:01 info: Running /etc/init.d/postgresql  stop
> ResourceManager[4378]: 2013/06/03_13:22:02 ERROR: Return code 1 from /etc/init.d/postgresql
> ResourceManager[4378]: 2013/06/03_13:22:03 info: Retrying failed stop operation [postgresql]
> ResourceManager[4378]: 2013/06/03_13:22:03 info: Running /etc/init.d/postgresql  stop
> ResourceManager[4378]: 2013/06/03_13:22:03 ERROR: Return code 1 from /etc/init.d/postgresql
> ResourceManager[4378]: 2013/06/03_13:22:04 info: Retrying failed stop operation [postgresql]
> ResourceManager[4378]: 2013/06/03_13:22:04 info: Running /etc/init.d/postgresql  stop
> ResourceManager[4378]: 2013/06/03_13:22:05 ERROR: Return code 1 from /etc/init.d/postgresql
> ResourceManager[4378]: 2013/06/03_13:22:06 info: Retrying failed stop operation [postgresql]
> ResourceManager[4378]: 2013/06/03_13:22:06 info: Running /etc/init.d/postgresql  stop
> ResourceManager[4378]: 2013/06/03_13:22:06 ERROR: Return code 1 from /etc/init.d/postgresql
> ResourceManager[4378]: 2013/06/03_13:22:06 CRIT: Resource STOP failure. Reboot required!
> ResourceManager[4378]: 2013/06/03_13:22:06 CRIT: Killing heartbeat ungracefully!
> 以上です。
> なにとぞ、よろしくお願い申し上げます。
>
> _______________________________________________
> Linux-ha-japan mailing list
> Linux****@lists*****
> http://lists.sourceforge.jp/mailman/listinfo/linux-ha-japan





Linux-ha-japan メーリングリストの案内
Back to archive index