[BlueOnyx:24258] Newlinq outage - Post Mortem

Greg Kuhnert gkuhnert at compassnetworks.com.au
Tue Sep 8 17:16:22 -05 2020


I thought I’d share the post mortem, now that its fixed for the benefit of others.

I use Aventurine for the systems that run the shop, newline, and many other bits and pieces. I have a NAS that does external storage which is NFS mounted to the aventurine node CT’s. Generally things work fine but occasionally, there is a kernel NFS bug that causes some grief - and I notice high CPU and unable to unmount or access some NFS shares either in the host node or in one or more virtual servers.

If its a virtual server, I just reboot it - but sometimes, there are problems at the host level… and that requires a physical reboot of the aventurine node.

First problem: My initial reboot failed: It was unable to unmount the NFS volume, and refused to shutdown.

Next problem: Later, I was unaware of kernel updates that were recently released. The weird thing about them was that things generally worked fine, but only when mounting some virtual servers - the whole host node and all virtual servers would reboot.

And just to add insult to injury, the reboots were going VERY slow. It was doing FSCK on large volumes all the time. Also, NTP was VERY slow, taking about 10-15 mins each boot. In addition to that, something was borked with OpenVZ quotas, which was causing them to recalculate taking about 10-15 mins per virtual server each time it rebooted.

To get things under control, before I knew about the kernel problem - I disabled auto-start of the VPS’s to at least get the host node stable. I was manually fixing the OpenVZ quotas, and got it all stable. Then I started two virtual servers - and it rebooted again. (Before I knew of the kernel problem).

I disabled NTP and fsck on reboot to help get things faster. And I observed that the OpenVZ quota problem happened even on a virtual machine that was previously clean when the host node rebooted. After a bit of digging, I found that a disable / enable of quotas on each VPS before a restart fixed that, and it was then an instant restart of a virtual server.

Final piece of the jigsaw was the kernel. I did a rollback to a known stable version, re-enabled NFS on VPS’s (where I turned it off for testing), and all was back to normal, and I went to bed at 3:30 in the morning.

Anyway. There it is folks. Gotta love kernel patches.

GK



> On 9 Sep 2020, at 3:58 am, Michael Stauber <mstauber at blueonyx.it> wrote:
> 
> Hi all,
> 
> Dirk wrote:
>> in the Moment I cannot reach NewLinQ-Server and it is also not possible
>> to reach https://shop.blueonyx.it/for file a ticket.
> 
> Yeah, we had an issue there and Greg has been working on it all night.
> At this time we should have everything up again.
> 
> -- 
> With best regards
> 
> Michael Stauber
> _______________________________________________
> Blueonyx mailing list
> Blueonyx at mail.blueonyx.it
> http://mail.blueonyx.it/mailman/listinfo/blueonyx





More information about the Blueonyx mailing list