[BlueOnyx:16811] Re: Server memory fault

George F. Nemeyer tigerwolf at tigerden.com
Mon Jan 5 04:06:02 -05 2015


On Sun, 4 Jan 2015, Richard Morgan :: Morgan Web wrote:

> I suspect this is hardware, but I'm sure someone has seen this before so
> a spot of guidance would be appreciated.

Memory, CPU, and other logic, can get flakey if the temp goes up.

Can you determine the temperature of the box and CPU?  If the fans have
RPM sensors, are the speeds normal?  If speeds are low, it could be
bearings are failing, and if high, may be a hit of dust clogged filters or
vent openings (since they're not actually pushing air, they speed up).

Sometimes you can run smartctl -a to look at the hard drive temps as a
proxy for the rest of the unit if there's no CPU or other general sensors.

Check with the hosting company about ambient temp in the area...it could
be the input air is too warm from build AC failures or air flow.

If all the temps are nominal, and there's good air flow through the box,
then it could be memory itself. If there's real ECC ram in the machine,
I'd *tend* to see that as a warning that specific memory cells are
failing.

Many PC-based machines can run memtest86 as a boot-up option.  This is a
routine that runs without any OS and pounds memory hard with lots of
various patterns that can help spot intermittant issues.  See the
documentation if it's available for uses of the various options that can
help focus the testing in different ways.

If the errors tend to show up in one segment or 'block' of memory, that's
likely some memory chip.  Try swapping the sticks from one socket to
another and re-run the tests.  If the errors follow one stick, you'e found
the problem.  If it stays put in the same overall addresses, there's
likely some addressing issue that may be motherboard related.

Perhaps some bad power supply filtering or caps on the motherboard are
going bad and letting random voltage spikes corrupt memory accesses.
This tends to be somewhat intermittant behavior, and most often happens on
older boards with lots of hours on them.  Power supply noise-induced
errors usually start out gradually, but will increase over time, until the
board may not even boot at all.

I'd recommend the steps above to try to prove what's at fault first rather
then just blindly replacing the ram and hoping it's fixed.

George Nemeyer
Tigerden Internet Services



More information about the Blueonyx mailing list