mlus
mlus****@39596*****
2013年 6月 3日 (月) 22:09:07 JST
こんばんは。外してるかもしれないのですが・・・ ーーーーーーーーーーー IPaddr[4427]: 2013/06/03_13:21:44 INFO: Resource is stopped ResourceManager[4378]: 2013/06/03_13:21:50 ERROR: Return code 20 from /etc/ha.d/resource.d/drbddisk ーーーーーーーーーーーーー ここの所がエラー箇所だと思います。 /etc/ha.d/resource.d/drbddisk ここのところも、設定内容を出されたらアドバイスが付くかもしれませんね。 2013/6/3 <delta_syste****@yahoo*****>: > > O.Nと申します。 > > はじめて投稿させていただきます。 > 不足情報があるかも知れません。教えていただけると幸いです。 > > 物理サーバ上で、Red Hat Enterprise Linux 5.5、Heartbeat-2.1.4-1をインストールし > httpd,PostgreSQLのクラスタ構成を組みました。 > 待機系サーバ(SERVER2)を起動すると、一旦、クラスタが起動できたと思うと、フェール > オーバーしてしまい、再起動を繰り返す。 > ログからフェールオーバーになっていることはわかりますが、なぜ再起動を繰り返すかわかりません。 > 想定される原因を教えてください。 > > 1.環境 > Red Hat Enterprise Linux 5.5 > heartbeat-2.1.4-1 > SERVER1(物理:eth0)192.168.0.120 > SERVER2(物理:eth0)192.168.0.121 > VIP 192.168.0.110 > SERVER1(物理:eth1)10.10.10.10 > SERVER2(物理:eth1)10.10.10.11 > > [root @ SERVER2 ~]# tcpdump -i eth1 port 694 > tcpdump: verbose output suppressed, use -v or -vv for full protocol decode > listening on eth1, link-type EN10MB (Ethernet), capture size 96 bytes > 21:32:37.577851 IP 10.10.10.11.50029 > 10.10.10.10.ha-cluster: UDP, length 194 > 21:32:37.577868 IP 10.10.10.11.50029 > 10.10.10.10.ha-cluster: UDP, length 188 > > 2./etc/ha.d/ha.cfの抜粋 > debugfile /var/log/ha-debug > logfile /var/log/ha-log > logfacility local0 > keepalive 10 > deadtime 60 > warntime 30 > initdead 120 > udpport 694 > ucast eth1 10.10.10.11 > auto_failback off > watchdog /dev/watchdog > node SERVER1.domain SERVER2.domain > ping 192.168.0.1 > respawn hacluster /usr/lib/heartbeat/ipfail > respawn root /usr/local/sbin/check_active > apiauth ipfail gid=haclient uid=hacluster > debug 3 > > 3.haresourcesの抜粋 > SEVER1 IPaddr::192.168.0.110/24 > > 4./etc/log/ha-logの抜粋 > heartbeat[3396]: 2013/06/03_13:19:40 info: AUTH: i=1: key = 0x991a168, auth=0x118c80, authname=crc > heartbeat[3396]: 2013/06/03_13:19:40 info: Version 2 support: false > heartbeat[3396]: 2013/06/03_13:19:40 WARN: Logging daemon is disabled --enabling logging daemon is recommended > heartbeat[3396]: 2013/06/03_13:19:40 info: ************************** > heartbeat[3396]: 2013/06/03_13:19:40 info: Configuration validated. Starting heartbeat 2.1.4 > heartbeat[3398]: 2013/06/03_13:19:40 info: heartbeat: version 2.1.4 > heartbeat[3398]: 2013/06/03_13:19:40 info: Heartbeat generation: 1369315411 > heartbeat[3398]: 2013/06/03_13:19:40 info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth1 > heartbeat[3398]: 2013/06/03_13:19:40 info: glib: ucast: bound send socket to device: eth1 > heartbeat[3398]: 2013/06/03_13:19:40 info: glib: ucast: bound receive socket to device: eth1 > heartbeat[3398]: 2013/06/03_13:19:40 info: glib: ucast: started on port 694 interface eth1 to 10.10.10.11 > heartbeat[3398]: 2013/06/03_13:19:40 info: glib: ping heartbeat started. > heartbeat[3398]: 2013/06/03_13:19:40 info: G_main_add_TriggerHandler: Added signal manual handler > heartbeat[3398]: 2013/06/03_13:19:40 info: G_main_add_TriggerHandler: Added signal manual handler > heartbeat[3398]: 2013/06/03_13:19:40 notice: Using watchdog device: /dev/watchdog > heartbeat[3398]: 2013/06/03_13:19:40 info: G_main_add_SignalHandler: Added signal handler for signal 17 > heartbeat[3398]: 2013/06/03_13:19:40 info: Local status now set to: 'up' > heartbeat[3398]: 2013/06/03_13:19:40 info: Managed write_hostcachedata process 3431 exited with return code 0. > heartbeat[3398]: 2013/06/03_13:19:41 info: Link 192.168.0.1:192.168.0.1 up. > heartbeat[3398]: 2013/06/03_13:19:41 info: Status update for node 192.168.0.1: status ping > heartbeat[3398]: 2013/06/03_13:19:41 info: Link SEVER1.domain:eth1 up. > heartbeat[3398]: 2013/06/03_13:19:41 info: Managed write_hostcachedata process 3575 exited with return code 0. > heartbeat[3398]: 2013/06/03_13:19:42 info: Comm_now_up(): updating status to active > heartbeat[3398]: 2013/06/03_13:19:42 info: Local status now set to: 'active' > heartbeat[3398]: 2013/06/03_13:19:42 info: Starting child client "/usr/lib/heartbeat/ipfail" (200,200) > heartbeat[3398]: 2013/06/03_13:19:43 info: Starting child client "/usr/local/sbin/check_active" (0,0) > heartbeat[3398]: 2013/06/03_13:19:43 WARN: G_CH_dispatch_int: Dispatch function for read child took too long to execute: 820 ms (> 50 ms) (GSource: 0x991df98) > heartbeat[3604]: 2013/06/03_13:19:43 info: Starting "/usr/lib/heartbeat/ipfail" as uid 200 gid 200 (pid 3604) > heartbeat[3605]: 2013/06/03_13:19:43 info: Starting "/usr/local/sbin/check_active" as uid 0 gid 0 (pid 3605) > heartbeat[3398]: 2013/06/03_13:19:43 info: Managed write_hostcachedata process 3606 exited with return code 0. > heartbeat[3398]: 2013/06/03_13:19:43 info: Managed write_delcachedata process 3607 exited with return code 0. > heartbeat[3398]: 2013/06/03_13:19:43 WARN: G_SIG_dispatch: Dispatch function for SIGCHLD took too long to execute: 840 ms (> 30 ms) (GSource: 0x9920b20) > heartbeat[3398]: 2013/06/03_13:19:43 info: AnnounceTakeover(local 0, foreign 1, reason 'T_RESOURCES' (0)) > heartbeat[3398]: 2013/06/03_13:19:43 info: remote resource transition completed. > heartbeat[3398]: 2013/06/03_13:19:43 info: AnnounceTakeover(local 0, foreign 1, reason 'T_RESOURCES' (0)) > heartbeat[3398]: 2013/06/03_13:19:43 info: STATE 1 => 3 > heartbeat[3398]: 2013/06/03_13:19:43 info: other_holds_resources: 3 > heartbeat[3398]: 2013/06/03_13:19:43 info: remote resource transition completed. > heartbeat[3398]: 2013/06/03_13:19:43 info: Local Resource acquisition completed. (none) > heartbeat[3398]: 2013/06/03_13:19:43 info: AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(them)' (0)) > heartbeat[3398]: 2013/06/03_13:19:43 info: Initial resource acquisition complete (T_RESOURCES(them)) > heartbeat[3398]: 2013/06/03_13:19:43 info: STATE 3 => 4 > heartbeat[3398]: 2013/06/03_13:19:43 WARN: G_SIG_dispatch: Dispatch function for SIGCHLD was delayed 850 ms (> 100 ms) before being called (GSource: 0x9920b20) > heartbeat[3398]: 2013/06/03_13:19:43 info: G_SIG_dispatch: started at 429409420 should have started at 429409335 > heartbeat[3398]: 2013/06/03_13:19:44 info: other_holds_resources: 3 > heartbeat[3398]: 2013/06/03_13:19:44 WARN: G_WC_dispatch: Dispatch function for client registration took too long to execute: 640 ms (> 20 ms) (GSource: 0x992d878) > heartbeat[3398]: 2013/06/03_13:19:46 info: Status update for node SEVER1.domain: status active > ipfail[3604]: 2013/06/03_13:19:46 info: Status update: Node SEVER1.domain now has status active > ipfail[3604]: 2013/06/03_13:19:46 info: Ping node count is balanced. > harc[3682]: 2013/06/03_13:19:46 info: Running /etc/ha.d/rc.d/status status > heartbeat[3398]: 2013/06/03_13:19:46 info: Managed status process 3682 exited with return code 0. > heartbeat[3398]: 2013/06/03_13:20:50 WARN: Gmain_timeout_dispatch: Dispatch function for send local status took too long to execute: 100 ms (> 50 ms) (GSource: 0x9924788) > heartbeat[3398]: 2013/06/03_13:21:37 WARN: node SEVER1.domain: is dead > ipfail[3604]: 2013/06/03_13:21:37 info: Status update: Node SEVER1.domain now has status dead > heartbeat[3398]: 2013/06/03_13:21:37 WARN: No STONITH device configured. > heartbeat[3398]: 2013/06/03_13:21:37 WARN: Shared disks are not protected. > heartbeat[3398]: 2013/06/03_13:21:37 info: Resources being acquired from SEVER1.domain. > heartbeat[3398]: 2013/06/03_13:21:37 info: Link SEVER1.domain:eth1 dead. > heartbeat[4324]: 2013/06/03_13:21:37 info: No local resources [/usr/share/heartbeat/ResourceManager listkeys SERVER2.domain] to acquire. > harc[4323]: 2013/06/03_13:21:37 info: Running /etc/ha.d/rc.d/status status > heartbeat[4324]: 2013/06/03_13:21:37 info: Writing type [resource] message to FIFO > heartbeat[4324]: 2013/06/03_13:21:37 info: FIFO message [type resource] written rc=79 > heartbeat[3398]: 2013/06/03_13:21:37 info: Managed req_our_resources process 4324 exited with return code 0. > heartbeat[3398]: 2013/06/03_13:21:37 info: AnnounceTakeover(local 1, foreign 1, reason 'req_our_resources' (1)) > heartbeat[3398]: 2013/06/03_13:21:37 info: AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1)) > mach_down[4352]: 2013/06/03_13:21:37 info: Taking over resource group drbddisk > ipfail[3604]: 2013/06/03_13:21:37 info: NS: We are still alive! > ipfail[3604]: 2013/06/03_13:21:37 info: Link Status update: Link SEVER1.domain/eth1 now has status dead > ResourceManager[4378]: 2013/06/03_13:21:37 info: Acquiring resource group: SEVER1.domain drbddisk Filesystem::/dev/drbd0::/usr1::ext3 httpd postgresql 192.168.0.110/24 MailTo::test****@yahoo*****::server_FailOver > ResourceManager[4378]: 2013/06/03_13:21:37 info: Running /etc/ha.d/resource.d/drbddisk start > ipfail[3604]: 2013/06/03_13:21:37 info: Asking other side for ping node count. > ipfail[3604]: 2013/06/03_13:21:37 info: Checking remote count of ping nodes. > IPaddr[4427]: 2013/06/03_13:21:44 INFO: Resource is stopped > ResourceManager[4378]: 2013/06/03_13:21:50 ERROR: Return code 20 from /etc/ha.d/resource.d/drbddisk > ResourceManager[4378]: 2013/06/03_13:21:50 CRIT: Giving up resources due to failure of drbddisk > ResourceManager[4378]: 2013/06/03_13:21:50 info: Releasing resource group: SEVER1.domain drbddisk Filesystem::/dev/drbd0::/usr1::ext3 httpd postgresql 192.168.0.110/24 MailTo::test****@yahoo*****::server_FailOver > ResourceManager[4378]: 2013/06/03_13:21:50 info: Running /etc/ha.d/resource.d/MailTo test****@yahoo***** server_FailOver stop > MailTo[4514]: 2013/06/03_13:21:50 INFO: Success > ResourceManager[4378]: 2013/06/03_13:21:51 info: Running /etc/ha.d/resource.d/IPaddr 192.168.0.110/24 stop > IPaddr[4570]: 2013/06/03_13:21:51 INFO: Success > ResourceManager[4378]: 2013/06/03_13:21:51 info: Running /etc/init.d/postgresql stop > ResourceManager[4378]: 2013/06/03_13:21:52 ERROR: Return code 1 from /etc/init.d/postgresql > ResourceManager[4378]: 2013/06/03_13:21:53 info: Retrying failed stop operation [postgresql] > ResourceManager[4378]: 2013/06/03_13:21:53 info: Running /etc/init.d/postgresql stop > ResourceManager[4378]: 2013/06/03_13:21:53 ERROR: Return code 1 from /etc/init.d/postgresql > ResourceManager[4378]: 2013/06/03_13:21:54 info: Retrying failed stop operation [postgresql] > ResourceManager[4378]: 2013/06/03_13:21:54 info: Running /etc/init.d/postgresql stop > ResourceManager[4378]: 2013/06/03_13:21:54 ERROR: Return code 1 from /etc/init.d/postgresql > ResourceManager[4378]: 2013/06/03_13:21:56 info: Retrying failed stop operation [postgresql] > ResourceManager[4378]: 2013/06/03_13:21:56 info: Running /etc/init.d/postgresql stop > ResourceManager[4378]: 2013/06/03_13:21:56 ERROR: Return code 1 from /etc/init.d/postgresql > ResourceManager[4378]: 2013/06/03_13:21:57 info: Retrying failed stop operation [postgresql] > ResourceManager[4378]: 2013/06/03_13:21:57 info: Running /etc/init.d/postgresql stop > ResourceManager[4378]: 2013/06/03_13:21:57 ERROR: Return code 1 from /etc/init.d/postgresql > ResourceManager[4378]: 2013/06/03_13:21:58 info: Retrying failed stop operation [postgresql] > ResourceManager[4378]: 2013/06/03_13:21:58 info: Running /etc/init.d/postgresql stop > ResourceManager[4378]: 2013/06/03_13:21:59 ERROR: Return code 1 from /etc/init.d/postgresql > ResourceManager[4378]: 2013/06/03_13:22:00 info: Retrying failed stop operation [postgresql] > ResourceManager[4378]: 2013/06/03_13:22:00 info: Running /etc/init.d/postgresql stop > ResourceManager[4378]: 2013/06/03_13:22:00 ERROR: Return code 1 from /etc/init.d/postgresql > ResourceManager[4378]: 2013/06/03_13:22:01 info: Retrying failed stop operation [postgresql] > ResourceManager[4378]: 2013/06/03_13:22:01 info: Running /etc/init.d/postgresql stop > ResourceManager[4378]: 2013/06/03_13:22:02 ERROR: Return code 1 from /etc/init.d/postgresql > ResourceManager[4378]: 2013/06/03_13:22:03 info: Retrying failed stop operation [postgresql] > ResourceManager[4378]: 2013/06/03_13:22:03 info: Running /etc/init.d/postgresql stop > ResourceManager[4378]: 2013/06/03_13:22:03 ERROR: Return code 1 from /etc/init.d/postgresql > ResourceManager[4378]: 2013/06/03_13:22:04 info: Retrying failed stop operation [postgresql] > ResourceManager[4378]: 2013/06/03_13:22:04 info: Running /etc/init.d/postgresql stop > ResourceManager[4378]: 2013/06/03_13:22:05 ERROR: Return code 1 from /etc/init.d/postgresql > ResourceManager[4378]: 2013/06/03_13:22:06 info: Retrying failed stop operation [postgresql] > ResourceManager[4378]: 2013/06/03_13:22:06 info: Running /etc/init.d/postgresql stop > ResourceManager[4378]: 2013/06/03_13:22:06 ERROR: Return code 1 from /etc/init.d/postgresql > ResourceManager[4378]: 2013/06/03_13:22:06 CRIT: Resource STOP failure. Reboot required! > ResourceManager[4378]: 2013/06/03_13:22:06 CRIT: Killing heartbeat ungracefully! > 以上です。 > なにとぞ、よろしくお願い申し上げます。 > > _______________________________________________ > Linux-ha-japan mailing list > Linux****@lists***** > http://lists.sourceforge.jp/mailman/listinfo/linux-ha-japan