Skip to main content

networking - DNS resolution failing over to secondary DNS - why?



We have large number of branch offices connected via VPN, but without any kind of server infrastructure. The client machines in each office get their network configuration from an ASA 5505, which is also used for the VPN connection.



The Windows XP client machines are configured to use one of our corporate DNS servers as the primary, with the DNS server of the ISP as the secondary. The idea is that if the VPN connection fails for any reason, staff in the office will still be able to access the internet, and access our webmail and home access portal. In the majority of cases this works fine.



However, for offices based in South America we are seeing DNS resolution on the client machines regularly being done against the ISP DNS server - this results in our corporate resources being effectively unavailable to staff in the offices.




The client machines are able to ping the corporate DNS server ok. When doing an nslookup of a corporate hostname, I get a reply.



I'm thinking one of the following (or a combination) is happening:




  • our corporate DNS server is not always replying to requests in a timely fashion (although why this would only affect clients in one geographic region I don't know)

  • DNS queries from Latin America are somehow delayed, causing the client to treat it as failed (although we have offices at the end of much slower VSAT connections which do not have this issue)

  • a single failure is resulting in a DNS cache entry in Windows that somehow results in the lookups not happening on subsequent tries




Has anyone else come across this issue? Any ideas for resolutions?


Answer



Windows queries DNS in this order:




  • hosts file

  • local DNS cache

  • Preferred DNS servers

  • Other DNS servers




MS also has an article describing how the DNS server list is obtained:




The DNS Client service uses a server search list, ordered by preference. This list includes all preferred and alternate DNS servers configured for each of the active network connections on the system.



The list is arranged based on the following criteria:




  • Preferred DNS servers are given first priority.


  • If no preferred DNS servers are available, then alternate DNS servers are used.

  • Unresponsive servers are removed temporarily from these lists.




Windows has an escalating timeout for DNS requests:



Value      Default value  Attempt
1st limit 1 second Query the preferred DNS server on a preferred connection.
2nd limit 2 seconds Query the preferred DNS server on all connections.

3rd limit 2 seconds Query all DNS servers on all connections (1st attempt).
4th limit 4 seconds Query all DNS servers on all connections (2nd attempt).
5th limit 8 seconds Query all DNS servers on all connections (3rd attempt).
6th value (Must be 0.)


I could not find a clear answer on this exact point, but it sounds like if it doesn't get a response from your primary DNS in 1 or 2 seconds (1st or 2nd attempt, respectively), then that server will be removed from the DNS server lookup list for 15 minutes, and so it will use the secondary DNS servers. Since those servers have up to an 8 second timeout, they are much more likely to respond. (It's unclear to me if it continues to query the preferred DNS server during the 3rd+ attempt if it's already failed).



I also suspect that you do indeed have a WAN latency issue for this geographical area, as it would explain why the timeouts are working.







One solution is to change the DNS query timeouts, using the DNSQueryTimeouts registry parameter. See also http://drewthaler.blogspot.com/2005/09/changing-dns-query-timeout-in-windows.html






Another solution is to put a local caching DNS server on the network, and have the clients use that. You can use a DNS server that may be built in to a router, or install something like dnsmasq.


Comments

Popular posts from this blog

linux - iDRAC6 Virtual Media native library cannot be loaded

When attempting to mount Virtual Media on a iDRAC6 IP KVM session I get the following error: I'm using Ubuntu 9.04 and: $ javaws -version Java(TM) Web Start 1.6.0_16 $ uname -a Linux aud22419-linux 2.6.28-15-generic #51-Ubuntu SMP Mon Aug 31 13:39:06 UTC 2009 x86_64 GNU/Linux $ firefox -version Mozilla Firefox 3.0.14, Copyright (c) 1998 - 2009 mozilla.org On Windows + IE it (unsurprisingly) works. I've just gotten off the phone with the Dell tech support and I was told it is known to work on Linux + Firefox, albeit Ubuntu is not supported (by Dell, that is). Has anyone out there managed to mount virtual media in the same scenario?

hp proliant - Smart Array P822 with HBA Mode?

We get an HP DL360 G8 with an Smart Array P822 controller. On that controller will come a HP StorageWorks D2700 . Does anybody know, that it is possible to run the Smart Array P822 in HBA mode? I found only information about the P410i, who can run HBA. If this is not supported, what you think about the LSI 9207-8e controller? Will this fit good in that setup? The Hardware we get is used but all original from HP. The StorageWorks has 25 x 900 GB SAS 10K disks. Because the disks are not new I would like to use only 22 for raid6, and the rest for spare (I need to see if the disk count is optimal or not for zfs). It would be nice if I'm not stick to SAS in future. As OS I would like to install debian stretch with zfs 0.71 as file system and software raid. I have see that hp has an page for debian to. I would like to use hba mode because it is recommend, that zfs know at most as possible about the disk, and I'm independent from the raid controller. For us zfs have many benefits,

apache 2.2 - Server Potentially Compromised -- c99madshell

So, low and behold, a legacy site we've been hosting for a client had a version of FCKEditor that allowed someone to upload the dreaded c99madshell exploit onto our web host. I'm not a big security buff -- frankly I'm just a dev currently responsible for S/A duties due to a loss of personnel. Accordingly, I'd love any help you server-faulters could provide in assessing the damage from the exploit. To give you a bit of information: The file was uploaded into a directory within the webroot, "/_img/fck_uploads/File/". The Apache user and group are restricted such that they can't log in and don't have permissions outside of the directory from which we serve sites. All the files had 770 permissions (user rwx, group rwx, other none) -- something I wanted to fix but was told to hold off on as it wasn't "high priority" (hopefully this changes that). So it seems the hackers could've easily executed the script. Now I wasn't able