Testimonials
  • Our project required a deep knowledge of hardware design, expertise in virtual server configurations, and the ability to quickly understand how to plug into our existing back-up/recovery processes. It was immediately evident that you had the experience and skills to pull it off.

Blue Screen, Purple Screen, and Kernel Panics: How To Revive Your System

Introduction

Most of us have been there, one minute you are working along diligently on an important document or presentation and BAHM!  It happens.  You are presented with a blue screen with “information” that might as well be in Greek.  Even worse, if you manage servers, you may notice one day that your server applications no longer work and when investigating you see the dreaded screen.  Now not only can you not work, but now the whole office is offline!

This Windows screen, also known as the “Blue Screen of Death” or “BSOD”, indicates a serious system error that could be damaging to your system and data, so the system shuts down abruptly to stop even further problems.  While this means anything you were working on, but not saved, is lost, it is a safety mechanism to avoid as much data loss as possible.

Usually the error (if you can manage to decipher it) basically says something happened that should never happen and is therefore strictly prohibited.  You could think of a traffic cop seeing multiple cars entering the intersection contrary to his instructions… the safest thing to do is stop all traffic, then figure out why it happened.  This is what Windows is doing; it doesn’t know exactly what happened, but it knows it is not good.

XP Blue Screen of Death

Image 1: Example blue screen from Windows XP

Other operating systems have similar screens, such as the “Purple Screen of Death” on older VMware ESX systems and kernel panics on Linux.  While all three error screens are essentially analogous, we will focus on Windows issues here.  The same methodology will work on all systems.

VMware PSOD

Image 2: VMware "Purple Screen of Death"

Linux Kernel Panic

Image 3: Linux Kernel Panic

 

I have to say that in all my years of managing servers, nothing makes me more uneasy than one of these screens.  Hopefully this article will ease some of these fears by helping you get to the bottom of these errors.

Causes

Lots of different problems can cause blue screens including file system/application corruption, bad/incompatible drivers, or hardware problems.  From my experience the two most common issues are bad drivers and hardware problems.  Here are some suggestions to narrow down the cause of your blue screen.

Recent Changes

Take a deep breath, slow down, and relax.  Then think to yourself:  what was the last thing changed on the computer?  Did you install a new application?  Install the latest patches from Microsoft?  Did you install a new video card, mouse, or other peripheral?  Did you remove a peripheral recently?

If the blue screen happened at boot time, think of changes made since the last time you rebooted your computer, the triggering change could have been made weeks or even months ago.

If you have a suspect (such as a new video driver), try to undo that change.  By pressing F8 at bootup, for example, you can often choose “Last Know Good …” to boot without the most recent change.  You can also try selecting Safe Mode from the F8 menu to get some minimal access to your system and perhaps remedy the issue.

If you are in the boat of boot/driver issues, this video may help you resolve them.

 

Hardware

If you cannot think of changes that were made to your system, the next most likely suspect is faulty hardware.  Before you assume hardware, however, make sure your companies’ IT person has not installed something, and that Windows Automatic Updates did not just install something without your knowledge.  If either of those is true, try the suggestions made above.

In my experience on severs, hardware is almost always the issue if something has not changed recently.  Server software these days is very reliable and the blue screen or error message is likely telling you some physical server component is going bad.

The number one hardware related problem seems to be memory.  Try running a memory scanner from a boot CD, such as Memtest86.  If any errors are found while running Memtest86, you’ve found the culprit.  Try reseating your memory (take all the memory out, and put it back in, in different slots); sometimes memory can become loose and reseating the memory will fix this.  If Memtest86 still shows errors, replace your memory.

Other common problems include failing disk drives, failing disk/RAID controllers, and failing power supplies.  Each of these is harder to put your finger on, but see gather more information below for some ideas.  If this is a critical system and you have the resources (money), I suggest you replace the system with new or other known good hardware then repair the hardware when time is less critical (or send back to the manufacturer for a replacement).

If there is a drive problem you can run CHKDSK to try and correct disk errors (see the video above).  However, if this becomes a repeat occurrence you need to figure out the root cause, which is likely bad drives or bad disk controller hardware.

Gather More Information

If you’ve made it this far without resolution we are in for more involved work.  The name of the game here is keeping immaculate and contemporaneous notes on each blue screen.

  • What application where you using when the crash happened?
  • What exact error did the screen show?  Is it different each time?
  • When did the error occur (date and time)?
  • Was the system under load or idle during the time of crash (for example were you running an intensive report, was the hard drive light flashing violently, or was all calm)?

The information you gather may help you form a hypothesis such as:

The blue screen occurs whenever I start Microsoft Word AND a report is running.

If the system seems to crash under heavy disk usage, for example, the problem may be disk related.  If the problem appears completely random, the system power supply is a culprit.  If the crashes happen when power sags/surges in the office install an Uninterruptable Power Supply (UPS) on the system (a good idea for all servers anyway).

Checking the system logs before and after the crash may also give you some insight due to the cause.  In Windows you can review logs by going to Start -> Run and typing eventvwr.  In Linux the dmesg command or /var/log/messages is a good place to look for errors.  On VMware ESXi, pressing ALT+F12 on a “crashed” system often gives you insight (see image below).

Image 4: VMware ESXi 4.1 server showing faulty storage system

 

Conclusion

I hope this article has given you a few tips to resolve your most recent blue screen or kernel panic.  At the end of the day most of these situations are different and often a solid conclusion is difficult to find.  In many cases you may end up replacing hardware until the problems stop, and thus before you start consider the cost of your time to fix the problem versus replacing the system (or getting professional help).

For desktops or laptops, often it is not worth the effort to fix blue screens.  Just take your laptop or PC to your technician and have Windows reloaded.  If you continue to experience blue screens, purchase new equipment.

For servers, however, it is often worth the effort.  Servers are expensive to replace, and often can take days or weeks to get a replacement.  As servers often run critical applications, most companies cannot afford for a server to be offline for this amount of time waiting for a fix.  If servers are truly critical, you should pre-purchase another server to stand in when failures occur.  Our standard and comprehensive data protection plans solve this problem for our clients; learn more about our data protection plans.

If you need help with a blue screen or other system failure, see our recovery services.

This entry was posted in Off The Wire. Bookmark the permalink.

2 Responses to Blue Screen, Purple Screen, and Kernel Panics: How To Revive Your System

  1. After posting this article, I was sent a few software tools that can help administrators more easily figure out the cause of blue screens on windows systems, here are a couple resource links:

    Blue Screen View
    Resplendence WhoCrashed

    Disclaimer: I have not used any of these tools, try at your own risk. Another reader commented you can simply search on the error and read Microsoft’s KB article on that error. In my experience there is almost never a helpful KB article for blue screens, but it is worth a shot.

  2. Tim Collyer says:

    Great post Nick and excellent tips. I’ve used Blue Screen View before and it’s a nice tool. For quick and dirty troubleshooting though, I usually reference the stop error from the BSOD (just the first group of hex codes) and look it up here: http://www.aumha.org/a/stop.php

    I certainly agree about the RAM being the first point of hardware failure. I’ve seen it many a time. The good news is that it’s a cheap fix!

Leave a Reply

Your email address will not be published. Required fields are marked *