View Single Post
Old 03-05-2018, 12:38 AM
John John is offline
Join Date: Aug 1999
Location: NJ, USA
Posts: 2,101
Originally Posted by AZTheta View Post
What you do is totally voodoo to me.
My first instinct was to respond that it's not as complicated as it seems... but it probably is. I've just been doing this stuff for so long that much of it has become second nature.

Originally Posted by John View Post
I'll post details regarding these other issues either later tonight or tomorrow.

The forum software we use here at GC, similar to most forum type software, uses the MySQL database software. MySQL, at least when this version of the forum software we are on was developed, defaulted to the MyISAM database storage engine.

And it turns out that the MyISAM database storage engine is not particularly resilient to sudden power loss as has happened with GC's server quite a few times in the past month.

Essentially, if the database server was in the process of saving any pertinent information when the power was disrupted, only part of the data may have saved and the other part lost/corrupted. Which may or may not cause corruption to various important data in the database.

Up until March 1st this, as far as I can tell, wasn't a big issue since problems seemed to always impact non essential areas of the database. But, on March 1st the two reboots crashed the user database table. After checking with the forum software developer, this sort of crash (despite being "repaired" using MySQL's repair functions) may have corrupted some GCer account records which may then not be recoverable and for impacted accounts, they would need to start a new account.

I'm definitely not okay with that, so will be doing everything I can to ensure GC data is minimally impacted once all the server issues are sorted out. Nobody has emailed me so far about problems accessing their GC account, so maybe no account corruptions so far.

Also, I don't know for certain that the MySQL repair functions leave data without issues untouched. So maybe there is data corruption that is currently undetected. This is something that I'll be looking into.


What I'll be doing:

1. Stabilizing the GC hosting environment.

Currently I'm waiting for the datacenter to replace a faulty/failing power strip/distribution unit. After that I'll test the server hardware to determine if these problems are due to the server going bonkers or if it's the datacenter's PDU that caused the problems.

2. I've been researching what changes to make and I will either reinstall the current server or setup a new server in such a way where GC's database will be resilient (or at least significantly more resilient) to future power disruptions.

3. Possible data corruption. I'll try to determine if there is data corruption. If not, then we should be good from that point. However, if there is data corruption I might restore the last trusted database backup (which is from just before the first hard reboot back in December) and will merge all of the new stuff from then to current back into that known good copy of the database.

What that will do is limit any potential resulting data corruption issues to only the past 3 months rather than the entire history of GC.

Unsure about that part but it's something I'm considering.


And one last piece of info in this extra long message:

# ls -f | wc -l
That's a Linux command to list the number of files in a directory. That number (1,160,443) is the number of database email error messages that have been sent to an email account on the server related to all these issues. Probably just from the times immediately after the reboots and before I repaired the database. I haven't seen that number increase for hours, so chances are it might mean there aren't any (or aren't many) lingering database problems due to the reboots.

All those emails also aren't likely unique errors. There may just be a few dozen errors each repeated thousands of times each. If it becomes necessary for me to look through the errors I'll write a software program to sort through all that and return just one message for each unique error.


That's it for now. Thanks for staying tuned in to GC!
John Hammell
Network Admin,
Reply With Quote