Skip to main content

hardware - Something is burning in the server room; how can I quickly identify what it is?

itemprop="text">

The other day, we notice a terrible
burning smell coming out of the server room. Long story short, it ended up being one of
the battery modules that was burning up in the UPS unit, but it took a good couple of
hours before we were able to figure it out. The main reason we were able to figure it
out is that the UPS display finally showed that the module needed to be
replaced.




Here was the problem: the
whole room was filled with the smell. Doing a sniff test was very difficult because the
smell had infiltrated everything (not to mention it made us light headed). We almost
mistakenly took our production database server down because it's where the smell was the
strongest. The vitals appeared to be ok (CPU temps showed 60 degrees C, and fan speeds
ok), but we weren't sure. It just so happened that the battery module that burnt up was
about the same height as the server on the rack and only 3 ft away. Had this been a real
emergency, we would have failed
miserably.



Realistically, the chances that
actual server hardware is burning up is a fairly rare occurrence and most of the time
we'll be looking at the UPS the culprit. But with several racks with several pieces of
equipment, it can quickly become a guessing game. How does one quickly and
accurately determine what piece of equipment is actually burning up?
I
realize this question is highly dependent on the environment variables such as room
size, ventilation, location, etc, but any input would be
appreciated.


itemprop="text">
class="normal">Answer



The
general consensus seems to be that the answer to your question comes in two
parts:



How do we find the source of the funny
burning smell?



You've got the "How" pretty well
nailed down:





  • The "Sniff
    Test"

  • Look for visible
    smoke/haze

  • Walk the room with a thermal (IR) camera to
    find hot spots

  • Check monitoring and device panels for
    alerts



You can improve
your chances of finding the problem quickly in a number of ways - improved monitoring is
often the easiest. Some questions to
ask:





  • Do you get
    temperature and other health alerts from your
    equipment?

  • Are your UPS systems reporting faults to your
    monitoring system?

  • Do you get current-draw alarms from
    your power distribution equipment?

  • Are the room smoke
    detectors reporting to the monitoring system? (and can
    they?
    )



/>

When should we troubleshoot versus hitting the Big
Red Switch?




This is a more
interesting question.
Hitting the big red switch can cost your company a huge
amount of money in a hurry: Clean agent releases can be into the tens of thousands of
dollars, and the outage / recovery costs after an emergency power off (EPO, "dropping
the room") can be devastating.
You do not want to drop a datacenter because a
capacitor in a power supply popped and made the room
smell.



Conversely, a fire in a server room can
cost your company its data/equipment, and more importantly your staff's lives. />Troubleshooting "that funny burning smell" should never take
precedence over safety
, so it's important to have some clear rules about
troubleshooting "pre-fire" conditions.



The
guidelines that follow are my personal limitations that I apply in
absence of (or in addition to) any other clearly defined procedure/rules - they've
served me well and they may help you, but they could just as easily get me killed or
fired tomorrow, so apply them at your own
risk.




  1. If
    you see smoke or fire, drop the room

    This should go without
    saying but let's say it anyway: If there is an active fire (or smoke indicating that
    there soon will be) you evacuate the room, cut the power, and discharge the fire
    suppression system.
    Exceptions may exist (exercise some common sense), but
    this is almost always the correct
    action.


  2. If you're
    proceeding to troubleshoot, always have at least one other person
    involved

    This is for two reasons. First, you do not want to be
    wandering around in a datacenter and all of a sudden have a rack go up in the row you're
    walking down and nobody knows you're there. Second, the other person is your sanity
    check on troubleshooting versus dropping the room, and should you make the call to hit
    the Big Red Switch you have the benefit of having a second person concur with the
    decision (helps to avoid the career-limiting aspects of such a decision if someone
    questions it
    later).



  3. Exercise
    prudent safety measures while troubleshooting

    Make sure you
    always have an escape path (an open end of a row and a clear path to an exit). />Keep someone stationed at the EPO / fire suppression release.
    Carry a
    fire extinguisher with you (Halon or other clean-agent, please).
    Remember
    rule #1 above.
    When in doubt, leave the
    room
    .
    Take care about your breathing: use a respirator or an oxygen
    mask. This might save your health in case of chemical
    fire.


  4. Set a limit and
    stick to it

    More accurately, set two
    limits:




    • Condition
      ("How much worse will I let this get?"),
      and

    • Time ("How long will I keep
      trying to find the problem before its too
      risky?").



    The limits you
    set can also be used to let your team begin an orderly shutdown of the affected area, so
    when you DO pull power you're not crashing a bunch of active
    machines, and your recovery time will be much shorter, but remember that if the orderly
    shutdown is taking too long you may have to let a few systems crash in the name of
    safety.



  5. Trust
    your gut

    If you are concerned about safety at any time, call
    the troubleshooting off and clear the room.
    You may or may not drop the room
    based on a gut feeling, but regrouping outside the room in (relative) safety is
    prudent.




If
there isn't imminent danger you may elect bring in the local fire department before
taking any drastic actions like an EPO or clean-agent release. (They may tell you to do
so anyway: Their mandate is to protect people, then property, but they're obviously the
experts in dealing with fires so you should do what they
say!)




We've
addressed this in comments, but it may as well get summarized in an answer too --
@DeerHunter, @Chris, @Sirex, and many others contributed to the
discussion



Comments

Popular posts from this blog

linux - iDRAC6 Virtual Media native library cannot be loaded

When attempting to mount Virtual Media on a iDRAC6 IP KVM session I get the following error: I'm using Ubuntu 9.04 and: $ javaws -version Java(TM) Web Start 1.6.0_16 $ uname -a Linux aud22419-linux 2.6.28-15-generic #51-Ubuntu SMP Mon Aug 31 13:39:06 UTC 2009 x86_64 GNU/Linux $ firefox -version Mozilla Firefox 3.0.14, Copyright (c) 1998 - 2009 mozilla.org On Windows + IE it (unsurprisingly) works. I've just gotten off the phone with the Dell tech support and I was told it is known to work on Linux + Firefox, albeit Ubuntu is not supported (by Dell, that is). Has anyone out there managed to mount virtual media in the same scenario?

hp proliant - Smart Array P822 with HBA Mode?

We get an HP DL360 G8 with an Smart Array P822 controller. On that controller will come a HP StorageWorks D2700 . Does anybody know, that it is possible to run the Smart Array P822 in HBA mode? I found only information about the P410i, who can run HBA. If this is not supported, what you think about the LSI 9207-8e controller? Will this fit good in that setup? The Hardware we get is used but all original from HP. The StorageWorks has 25 x 900 GB SAS 10K disks. Because the disks are not new I would like to use only 22 for raid6, and the rest for spare (I need to see if the disk count is optimal or not for zfs). It would be nice if I'm not stick to SAS in future. As OS I would like to install debian stretch with zfs 0.71 as file system and software raid. I have see that hp has an page for debian to. I would like to use hba mode because it is recommend, that zfs know at most as possible about the disk, and I'm independent from the raid controller. For us zfs have many benefits,

apache 2.2 - Server Potentially Compromised -- c99madshell

So, low and behold, a legacy site we've been hosting for a client had a version of FCKEditor that allowed someone to upload the dreaded c99madshell exploit onto our web host. I'm not a big security buff -- frankly I'm just a dev currently responsible for S/A duties due to a loss of personnel. Accordingly, I'd love any help you server-faulters could provide in assessing the damage from the exploit. To give you a bit of information: The file was uploaded into a directory within the webroot, "/_img/fck_uploads/File/". The Apache user and group are restricted such that they can't log in and don't have permissions outside of the directory from which we serve sites. All the files had 770 permissions (user rwx, group rwx, other none) -- something I wanted to fix but was told to hold off on as it wasn't "high priority" (hopefully this changes that). So it seems the hackers could've easily executed the script. Now I wasn't able