Skip to main content

heartbeat - Which technique should be chosen for IP failover with manual control

I have the following setup, Linux stack, with front-end running nginx proxy and static assets and back-end running Ruby on Rails and MySQL in master-master replication:




  • Primary site: front-end.a, back-end.a

  • Secondary site: front-end.b, back-end.b

  • A router sitting on a shared network that can route to both primary and secondary sites




The primary site serves requests most of the time. The secondary site is redundant. back-end.b is in master-master replication with back-end.a but is read-only.



When the primary site goes down, requests need to be redirected to the secondary site. This will show a service unavailable 503 page until manual intervention ensures that the primary site won't come back and hits the big switch that makes the secondary site live and read-write.



The primary site can then be brought back in a controlled fashion, with back-end.a becoming a read-only replication slave of back-end.b. When everything on the primary site is ready again, front-end.b will start serving service unavailable, back-end.b will switch to read-only, requests need to be redirected to the primary site again, and finally the primary site needs to become read-write.



The priorities:




  • The site must not become completely dead and unreachable


  • Switchover to a live working site must be fairly fast

  • Preventing data loss / inconsistency is more important than absolute reliability



Now, the current approach being used is Linux-HA / Heartbeat / Pacemaker, using a virtual IP shared between front-end.a and front-end.b with a location preference set to front-end.a.



This works excellently for failing over the IP if the primary site disappears. However, the level of manual control thereafter is rather lacking.



For example, after the primary site has failed and the secondary site needs to be brought up, we need to ensure the primary site doesn't try to steal back the IP address when it comes back up. However, Linux-HA doesn't seem to support this very well. crm resource move is the documented command to move a resource (it works by adding an infinite weight location rule), but if the resource has already failed over, this command fails saying that the resource has already been moved. Adding an explicit higher weight location preference doesn't seem to work reliably. So far the most reliable thing to do has been to remove the existing location rule and replace it with a new rule preferring the secondary site. This feels like we're fighting the tool and trying to make it do something it wasn't designed to.




And there are other oddities with Linux-HA. Frequently the cluster gets stuck in a split-brain state while booting up - both nodes are sending out heartbeat packets (verified with packet sniffing), both nodes can ping one another, but crm_mon on both reports the other node as offline. The heartbeat service needs to be restarted on one or the other nodes to get it to work - and sometimes it needs a SIGKILL rather than SIGTERM to bring it down. Also, crm_mon shows the CIB (cluster database) is replicated pretty much instantaneously when configuration is altered on either front-end.a or front-end.b, but Pacemaker takes its time actually moving the IP resource - it can take several minutes for it to move across, potentially putting our SLAs at risk.



So I'm starting to look at other options that are more focused on virtual IPs and IP failover rather than general clustered resources. The two other options I see are ucarp and keepalived.



However, given the amount of time I've spent setting up heartbeat etc. and trying to make it work, I'd like feedback on the best approach for this setup.

Comments

Popular posts from this blog

linux - iDRAC6 Virtual Media native library cannot be loaded

When attempting to mount Virtual Media on a iDRAC6 IP KVM session I get the following error: I'm using Ubuntu 9.04 and: $ javaws -version Java(TM) Web Start 1.6.0_16 $ uname -a Linux aud22419-linux 2.6.28-15-generic #51-Ubuntu SMP Mon Aug 31 13:39:06 UTC 2009 x86_64 GNU/Linux $ firefox -version Mozilla Firefox 3.0.14, Copyright (c) 1998 - 2009 mozilla.org On Windows + IE it (unsurprisingly) works. I've just gotten off the phone with the Dell tech support and I was told it is known to work on Linux + Firefox, albeit Ubuntu is not supported (by Dell, that is). Has anyone out there managed to mount virtual media in the same scenario?

hp proliant - Smart Array P822 with HBA Mode?

We get an HP DL360 G8 with an Smart Array P822 controller. On that controller will come a HP StorageWorks D2700 . Does anybody know, that it is possible to run the Smart Array P822 in HBA mode? I found only information about the P410i, who can run HBA. If this is not supported, what you think about the LSI 9207-8e controller? Will this fit good in that setup? The Hardware we get is used but all original from HP. The StorageWorks has 25 x 900 GB SAS 10K disks. Because the disks are not new I would like to use only 22 for raid6, and the rest for spare (I need to see if the disk count is optimal or not for zfs). It would be nice if I'm not stick to SAS in future. As OS I would like to install debian stretch with zfs 0.71 as file system and software raid. I have see that hp has an page for debian to. I would like to use hba mode because it is recommend, that zfs know at most as possible about the disk, and I'm independent from the raid controller. For us zfs have many benefits, ...

linux - Awstats - outputting stats for merged Access_logs only producing stats for one server's log

I've been attempting this for two weeks and I've accessed countless number of sites on this issue and it seems there is something I'm not getting here and I'm at a lost. I manged to figure out how to merge logs from two servers together. (Taking care to only merge the matching domains together) The logs from the first server span from 15 Dec 2012 to 8 April 2014 The logs from the second server span from 2 Mar 2014 to 9 April 2014 I was able to successfully merge them using the logresolvemerge.pl script simply enermerating each log and > out_putting_it_to_file Looking at the two logs from each server the format seems exactly the same. The problem I'm having is producing the stats page for the logs. The command I've boiled it down to is /usr/share/awstats/tools/awstats_buildstaticpages.pl -configdir=/home/User/Documents/conf/ -config=example.com awstatsprog=/usr/share/awstats/wwwroot/cgi-bin/awstats.pl dir=/home/User/Documents/parced -month=all -year=all...