It started a little more than an hour after the ball dropped on New Year’s Day 2016. We had had a few friends over to celebrate and we were just saying goodbye to the last of our company when I started getting hit with a wave of texts right at 1:35AM. I glanced down at my watch:
SEV 20 – The client was unable to reuse a session…
I had seen the error before. It always indicated some momentary saturation of the network. I would usually get a cluster of them, then the moment would pass and everything would be fine again. The condition would only last long enough for the client to have to retry the connection, due to connection pooling failing to maintain the session. The users wouldn’t even see a blip. It was one of those errors that I knew I should get around to solving, and it was on my backlog, but let’s be honest: with everything else I have to do every day, and the fact that there was zero user impact, this one had slipped pretty far down the list.
Normally, I would get a cluster of these errors, and everything would subside. This time, however, they just kept coming; hundreds of them; my phone chirping and buzzing with each new arrival. Then my custom SQL Mirroring job spat out a hundred or so e-mails telling me that all of my critical database mirrors had just disconnected! Uh-oh! This wasn’t just some blip…
I quickly explained to everyone that something was going on and rushed upstairs to log in and see what was causing this. Everything seemed fine on the Primary site’s servers, no SAN failure: good; no memory pressure: good; no crazy CPU: good. I checked a few databases mirroring statuses: DISCONNECTED, DISCONNECTED, DISCONNECTED. I RDPed to the DR servers and was able to connect in, but as I started to inspect more, *blip*, my RDP session dropped. I reconnected, started looking around again and, *blip*, disconnected again.
So, it starts to dawn on me that something is going on between the two sites. I disconnect my VPN session from the Primary site and reconnect to our DR VPN. Now my RDP session to the DR servers is nice and stable. I try to connect to a Primary server, just to test my hypothesis: after a few seconds, *blip*, disconnected. This is when I jump on the phone and call our infrastructure team. I tell one of the guys that I think something is going on with our microwave connection between the sites and he starts digging in to see what might be the cause.
Our Beloved Microwave
For a little bit of background, we’ve had our DR site for about 15 years. Around ten years ago, we started to outgrow the 3 Mbps connection we had between our primary and DR sites. We decided to go for a microwave connection because we could operate it in unlicensed frequencies for no annual cost and get about 100 Mbps Full-Duplex for just the initial equipment purchase, installation, and then annual service warranty. While our use of this connection has grown dramatically over the past ten years, we have rarely bumped up against that upper limit for more than short bursts. This has been a great solution for us.
Tonight, however, it seemed it was starting to show its age. We had fought to get a replacement pair put into the budget each year for the past several years. We knew it was getting to be past the “best by” date and our projections were showing that before long the 100 Mbps wasn’t going to cut it any longer. Upper management, however, had been reluctant to replace it proactively. No new story there.
About an hour later, the microwave started to calm down and become more stable once again. We were still seeing enormous latency, but we were at least maintaining connectivity. My mirrors were getting back in sync and it looked like we might be ok through the weekend, at least. We had already put in a call to our warranty vender, but since the connectivity had been partially restored, they didn’t want to send anyone out and our infrastructure team didn’t really want to make it too much of an issue either. We would handle it on Monday.
Down for the Count
Well, it didn’t last that long. Just 36 hours later my phone starts blowing up again with alerts. I contact infrastructure again and they start scrambling to see what’s going on: they confirm, it’s the microwave. This time it seems it has gone down for good. I go ahead and disable my alerts so I can stop getting buzzed every few seconds and I start talking to the rest of my team about how we’re going to handle what appears to be an extended outage.
We quickly assess our status: the microwave connection is down, our database servers are still being actively hit by production users and working just fine, but our transaction logs are growing and our data at the DR site is getting more stale by the minute. Also, it’s the weekend, so we are running some of our most aggressive index maintenance. We decide to stop the index maintenance for now and see how long it will be before infrastructure is going to have the link back up.
Then comes the bad news: our warranty vender is not able to fix the problem and we need new hardware, BUT, they are having trouble sourcing a replacement. Now we see that this is going to be more than a 4-hour or even 24-hour turn-around. We realize that we are going to have to do something, or our DR site is going to become so stale that it will be essentially useless.
Maintain…
So, first, I mention to infrastructure that we have redundant internet connections at both sites. There’s a whole 10 Mbps Internet connection at the DR site and a 30 Mbps Internet connection at the Primary that aren’t usually being used, they’re just for backup if the primary connections fail. I suggest that we take a couple of routers and setup a VPN between the two sites over the Internet so that we will at least have SOME connection between the two sites. They set it up and route just the database VLAN over the new VPN tunnel. Now we have connectivity for our databases and they start to slowly attempt to get back in sync.
This is fine for our smaller, less-active databases. But we have several very large databases and one in particular is first on my mind. It is a 20 TB critical database and it’s creating transaction log records at a rate of about 12,000 KB / sec. There is no way our 10 Mbps connection is going to be able to catch up. The DR site is already 8 hours stale by this point and the roughly 240 GB of transaction logs that are unsent are stating they’re going to take over two days to sync up, if we were to stop transacting right now, which, of course, is not going to happen. We do ask a few of our heaviest users if we can delay some processes, but this only helps a bit, and they cannot hold off their processes for very long, and we’re not really sure how long it will be before our 100 Mbps link is back up.
Sneaker-Net Rides Again
With Monday coming up quick and business getting ready to swing into “full-on” mode, we have to have a solution. I grab a 2TB USB hard drive and plug it into one of the nodes of our cluster. I know that if I can get all of the transaction logs from the point of our last sent transaction up until our most recent transaction log backup, I can apply them at the DR site and keep this party going. Sneaker-Net has taken many forms over my career: it used to be floppy disks, when there either WAS NO network, or the 10Base2 connections had become unplugged for some reason; it then evolved into Zip Disks, packing a 100 MB punch of data at a time; and eventually burnable CDs helped get me through tough times; now, it was 2 TB I could fit in my pocket. I almost got nostalgic as I grabbed the DR site keys and jumped in my car. The robocopy command hadn’t taken very long and I was on my way with a fist full of data and a plan.
Now, I must admit, breaking the mirror on my 20 TB database gave me pause. Sure, it SHOULD all work as planned, but what if I messed something up? I didn’t have time to consider if for long: I broke the mirror and started apply transaction log backups en masse. When the script finally finished, I re-established the mirror and after just a few moments, we were “Synchronizing” again. The mirror was still stale by about an hour and a half, but that was, of course, due to the time it took to copy the transaction log backups from the Primary site to the external hard drive, drive them over to the DR site, copy the backups to the DR site server, and apply the transaction log backups there. It wasn’t going to be in sync, that just wasn’t possible. But it is way easier to tell the president of the company that we are a couple hours stale, rather than the alternative.
A Very Long Week
Our Infrastructure team ran into roadblock after roadblock getting the microwave replaced. I won’t go into the details at this time, but suffice it to say that you REALLY need to make sure your warranty vendor can provide the level of service they have signed to; especially on specialty equipment. Arbitration and compensation after the fact is NOT going to save your data during the fire. My team and I took turns running data between the Primary and DR sites a couple times a day for the next several days while we waited for a resolution. We all breathed a sigh of relief when at 10:25 PM on Tuesday, January 12th, I sent out the e-mail to our Business Continuity group that, “All Production databases are fully synchronized with the DR site.”. Short and simple, but that one sentence represented one crazy outage for us.
Steps Taken, For Reference
I’m going to enumerate the steps here, just in case anyone needs the specifics:
- Temporarily pause your transaction log backups at the Primary site so that new transaction log backups are not taken while you are couriering your data from one site to the other.
- Copy all transaction logs from your Primary site to your media of choice, I prefer using robocopy for speed and simplicity. In command prompt issue something similar to: robocopy C:\TransactionLogBackups\ D:\SneakerNet\
- Physically transport the backups to the DR site and copy them to the DR server.
- Break the mirror from the DR site, by issuing ALTER DATABASE dbname SET PARTNER OFF; (in some cases, I had to issue this command twice, otherwise the first transaction log backup restoration would complain that the database was busy)
- Start applying the transaction log backups WITH NORECOVERY, by issuing RESTORE LOG dbname FROM ‘D:\LocationOnDisk\backup051.bak’ WITH NORECOVERY;
- After all transaction logs have been restored WITH NORECOVERY, re-connect the mirror from the Primary site.
- Don’t forget to Re-Enable your transaction log backups at the Primary site.
I hope you’ve enjoyed reading about this adventure I had. It is way easier to look back on it now that it is over than when we were right in the thick of it. If you’ve never had to do any sort of emergency recovery, I would recommend you practice doing so. Bring up a test instance and then break it and try to fix it. Going through the stress of figuring out how to work through the situation when there isn’t company data on the line is way better than figuring it out once you’re already in the situation.