Open Heart Surgery Lessons Learned for IT Part 3 – Monitoring Health and Disaster Recovery

18 Jul
Open Heart Surgery Lessons Learned for IT Part 3 – Monitoring Health and Disaster Recovery

One of the biggest issues we have in health is poor monitoring.  This is also the case in health of many of our systems.  As a DBA, I’m always concerned about the health of my database servers; mainly because I would like to keep my job and not get the 3am call that something has gone terribly wrong. However, with that said, no matter how careful we are there is always the probability of something bad going wrong.

This is part three of my lessons learned and will have more of a focus on Database monitoring and Disaster recovery.

Observation: First of all of us need to monitor our health. Not to sound too much like a doctor, but this is generally the biggest reason for the patient to need heart surgery; health problems. We have so much knowledge about what causes heart attacks yet we are negligent. Of course not all heart problems are preventable, but for the most part there are many aspects one can monitor to make sure that they are as healthy as they can be. For example, there are cholesterol tests, tests for hypertension, stress tests, echo-cardiograms and many other tools that can assist in health monitoring.

There are also certain lifestyle choices that one can make to maintain a healthy heart like exercise, not eating too much junk food, not smoking and others. Of course there are factors that are out of our control such as our genetic make up. This particular patient had a history of heart disease (one sibling died of a heart attack), quit smoking only two years prior, and was in the obese category weight weight wise. The patient also had a mild heart attack before going through with this procedure and a diagnostic Cath (Diagnostic Cardiac Catherization) was done to determine the level of blockage of the coronary arteries. The patient was also a senior citizen.

Take Back: First of all, let’s look at the disaster recovery part of this. When doctors realized that there was a problem, they immediately prepared the patient for the best possible procedure for the patient. In today’s IT world we are plagued with problems like security, hardware failures, lack of backups, and many more. We also see the signs of failure and many chose to ignore them or use the wrong method to prevent the disaster from getting worse.

As a DBA one of the most fundamental parts of disaster recovery is knowing your options very well.  Developers do not get into the details of disaster recovery and high availability unless they really have an interest in it so it is your job to make sure you know the tools that you have at your disposal. I have been told in the past that our disaster recovery response is to restore all backups. However, this does not provide high availability on servers since it could potentially take hours, days or even weeks to recover all databases.

Disasters come in all shapes and forms and the team at Microsoft has made sure we have many different options. Full server fail-over in the form of Cluster has been significantly improved with SQL 2012 allowing for Muti-Subnet clustering so that servers can be separated geographically at different sites. If you lose power at your data center locally, your remote data center 200 miles away can pick up the slack. Of course this is one of the more expensive forms of high-availability options the next one being Always-On Availability groups. I like to think of it as a hybrid between clustering and database mirroring with the advantage of allowing for groups of databases to be replicated to other servers and failed over if necessary. There are many other options available but I thought I would just mention the newer features and in future posts go into some detail on each one.

On the other side of disaster recovery is actually being proactive and monitoring the health of your database servers. One issue that always gets a DBA is running out of space; because it is very expensive to get good disk arrays that are both RAID and on a SAN. There are many tools available for monitoring disk space across all servers, I’ve only used a handful but I’m sure a Google search will bring up some good results for you. There are many other internal alerting systems within SQL Server that can also notify you of a possible disaster on your server. There are regular forms of maintenance that can be done to make sure that all your databases are healthy and functioning.

A last point that I wanted to touch on is finding other options when you feel like your system is at its end. With the patient I mentioned at the beginning of this post, they had the option of doing bypass surgery but others may even require a full heart transplant. Similarly, a good DBA knows when to let go and replace old hardware; especially when warranties are going to expire. If you love for warranties and support to expire from hardware manufacturers then start enjoying late nights in the data center. Sometimes we have to do difficult server migrations which are time consuming in planning and executing. Late last year, I was privileged to lead a database server migration for our main website ( I’m glad that I had finished SQL 2012 with Gethyn Ellis who gave me good advice with the process. It was a good experience of course but what came out of it was excellent for the company, a set of brand new servers! It’s nice to start new sometimes and it is the only solution often. Just remember to take care of the new environment you have!

Finally, the key to all of this is being proactive. Disasters are hardest to avert when one is not prepared for them.


Tags: , , ,

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: