The other day, we notice a terrible
burning smell coming out of the server room. Long story short, it ended up being one of
the battery modules that was burning up in the UPS unit, but it took a good couple of
hours before we were able to figure it out. The main reason we were able to figure it
out is that the UPS display finally showed that the module needed to be
replaced.
Here was the problem: the
whole room was filled with the smell. Doing a sniff test was very difficult because the
smell had infiltrated everything (not to mention it made us light headed). We almost
mistakenly took our production database server down because it's where the smell was the
strongest. The vitals appeared to be ok (CPU temps showed 60 degrees C, and fan speeds
ok), but we weren't sure. It just so happened that the battery module that burnt up was
about the same height as the server on the rack and only 3 ft away. Had this been a real
emergency, we would have failed
miserably.
Realistically, the chances that
actual server hardware is burning up is a fairly rare occurrence and most of the time
we'll be looking at the UPS the culprit. But with several racks with several pieces of
equipment, it can quickly become a guessing game. How does one quickly and
accurately determine what piece of equipment is actually burning up? I
realize this question is highly dependent on the environment variables such as room
size, ventilation, location, etc, but any input would be
appreciated.
The
general consensus seems to be that the answer to your question comes in two
parts:
How do we find the source of the funny
burning smell?
You've got the "How" pretty well
nailed down:
- The "Sniff
Test" - Look for visible
smoke/haze - Walk the room with a thermal (IR) camera to
find hot spots - Check monitoring and device panels for
alerts
You can improve
your chances of finding the problem quickly in a number of ways - improved monitoring is
often the easiest. Some questions to
ask:
- Do you get
temperature and other health alerts from your
equipment? - Are your UPS systems reporting faults to your
monitoring system? - Do you get current-draw alarms from
your power distribution equipment? - Are the room smoke
detectors reporting to the monitoring system? (and can
they?)
/>
When should we troubleshoot versus hitting the Big
Red Switch?
This is a more
interesting question.
Hitting the big red switch can cost your company a huge
amount of money in a hurry: Clean agent releases can be into the tens of thousands of
dollars, and the outage / recovery costs after an emergency power off (EPO, "dropping
the room") can be devastating.
You do not want to drop a datacenter because a
capacitor in a power supply popped and made the room
smell.
Conversely, a fire in a server room can
cost your company its data/equipment, and more importantly your staff's lives.
/>Troubleshooting "that funny burning smell" should never take
precedence over safety, so it's important to have some clear rules about
troubleshooting "pre-fire" conditions.
The
guidelines that follow are my personal limitations that I apply in
absence of (or in addition to) any other clearly defined procedure/rules - they've
served me well and they may help you, but they could just as easily get me killed or
fired tomorrow, so apply them at your own
risk.
If
you see smoke or fire, drop the room
This should go without
saying but let's say it anyway: If there is an active fire (or smoke indicating that
there soon will be) you evacuate the room, cut the power, and discharge the fire
suppression system.
Exceptions may exist (exercise some common sense), but
this is almost always the correct
action.If you're
proceeding to troubleshoot, always have at least one other person
involved
This is for two reasons. First, you do not want to be
wandering around in a datacenter and all of a sudden have a rack go up in the row you're
walking down and nobody knows you're there. Second, the other person is your sanity
check on troubleshooting versus dropping the room, and should you make the call to hit
the Big Red Switch you have the benefit of having a second person concur with the
decision (helps to avoid the career-limiting aspects of such a decision if someone
questions it
later).Exercise
prudent safety measures while troubleshooting
Make sure you
always have an escape path (an open end of a row and a clear path to an exit).
/>Keep someone stationed at the EPO / fire suppression release.
Carry a
fire extinguisher with you (Halon or other clean-agent, please).
Remember
rule #1 above.
When in doubt, leave the
room.
Take care about your breathing: use a respirator or an oxygen
mask. This might save your health in case of chemical
fire.Set a limit and
stick to it
More accurately, set two
limits:- Condition
("How much worse will I let this get?"),
and - Time ("How long will I keep
trying to find the problem before its too
risky?").
The limits you
set can also be used to let your team begin an orderly shutdown of the affected area, so
when you DO pull power you're not crashing a bunch of active
machines, and your recovery time will be much shorter, but remember that if the orderly
shutdown is taking too long you may have to let a few systems crash in the name of
safety.- Condition
Trust
your gut
If you are concerned about safety at any time, call
the troubleshooting off and clear the room.
You may or may not drop the room
based on a gut feeling, but regrouping outside the room in (relative) safety is
prudent.
If
there isn't imminent danger you may elect bring in the local fire department before
taking any drastic actions like an EPO or clean-agent release. (They may tell you to do
so anyway: Their mandate is to protect people, then property, but they're obviously the
experts in dealing with fires so you should do what they
say!)
We've
addressed this in comments, but it may as well get summarized in an answer too --
@DeerHunter, @Chris, @Sirex, and many others contributed to the
discussion
Comments
Post a Comment