[BlueOnyx:21466] Bizarre 5209R loses network config

Tue Oct 3 15:13:47 -05 2017

Hi Gang,

This isn't really a question - more of a report.  I'm posting this here 
in case it comes up again and saves anyone from any extended brain damage.

Earlier today one of our customers on a 5209R server issued a reboot and 
then reported the system failed to come back online.    We got console 
on the box and found it to be at the login prompt so an investigation began.

At first it appeared that the IP addresses were active on the box, but 
there was no network communication.   ifconfig reported the primary IP 
and all of the associated aliases, but there was no link at the NIC.

After trouble-shooting the physical connections and eliminating the 
switch as a problem, we moved on to diagnosing the network connection at 
the OS level.  One curious thing we found was that 
/etc/udev/rules.d/70-persistent-net.rules had two distinct MAC entries 
listed for eth0.   Two entries, just as you might expect when there are 
2 interfaces (typically eth0 and eth1) but both were eth0.

Our typical operation is to then flush out the contents of 
70-persistent-net.rules and reboot in order to have udev rebuild.   But 
after doing this, there were NO entries in 70-persistent-net.rules. 
Since we knew the MAC address on the interface, we added a manual entry 
and rebooted again.   No joy.

When that happened, I began to wonder why the system did not see an 
eth0.   We searched for other possible interface names but didn't see 
any.   I poked my head into the grub configuration, as on a RHEL7 box we 
typically need to force the ethX naming convention there by specifying 
the net.ifnames=0 variable.   It was missing.

To fix, we edited the kernel line (begins with GRUB_CMDLINE_LINUX) and 
appended "net.ifnames=0" (no quotes) onto the tail end of that line. 
After that, the changes were compiled in with the grub2-mkconfig 
command.  (We backed up the existing conf file first, just in case.)

After another reboot, the system came online just like it always did.

It's unknown to us whether the customer or any of the customer's users 
may have made edits to the grub config, or if something got changed in a 
recent yum update, or some other fluke.

FWIW, the customer reported the reboot was initiated because httpd had 
stopped responding.  Rather than tackle that, the box got rebooted in a 
bid to restore service faster than doing an investigation on the initial 
problem.   My guess is the httpd lock-up is related to other reports 
that have been made here and isn't directly connected to the grub issue.

The only thing that gives me concern, though, is what if those boxes 
that had Apache lock up on them late last week / early this week are 
going to do the same thing when they are rebooted.  (Obviously this will 
apply to physical bare-metal installs and not Aventurin{e} virtuals.)

-- 
Chris Gebhardt
VIRTBIZ Internet Services
Access, Web Hosting, Colocation, Dedicated
www.virtbiz.com | toll-free (866) 4 VIRTBIZ