How a DDOS attack makes me a better DevOp ;-)


Frequent readers of my blog may know that I like to learn new things. And meandering from being a developer to becoming a DevOp gives you lots of opportunities for learning.

Today I learnt a lot about the things you usually shy away from as a developer. I woke up to the bad news that our server (www.kontolino.de) and application are  practically useless. No responses, neither from the wordpress site nor from the application. Whatever you do, you’d never see anything happen. Sometimes, a request would be answered after minutes. Most requests, however, would just get nowhere.

So I decided to skip morning coffee, fire up my Mac and ssh to the server to see what’s going on. Most of the times when things seem to stand still, it is time to either restart the application (about once a month if there is no update). To my surprise, ssh didn’t get any response. I couldn’t log onto the machine. Ping and trace route were fine, the server was there. The monitor application of our ISP also said all monitored services are available – nothing unusual.

So I thought I’d best start by rebooting the machine.

The bad news: I couldn’t. Our ISP’s admin interface would send a request to the server to reboot, but the machine wouldn’t go down, even after 20 minutes or so.

Then I had to learn another lesson: if your ISP offers a remote console, make sure you can use it before anything goes wrong. Just to be prepared. The remote console, as it turns out, is a java application. Until this morning, my Mac was Java-free. Which I liked and was proud of. Until today.

So while our server was offline or broken somehow, I had the fun of installing Java on our server and learn how to administrate some Java Security Settings that would allow the xvnc viewer to run on my machine. The whole process just took about 20 minutes, but, boy, losing 20 minutes for such a bull%&$% while your server out there is going nuts feels really bad😉

As it turns out, a remote console is pretty similar to an ssh session when your server has a cpu load of more than 200%. It lags just the same as a terminal session with ssh. Useless crap.

I could try another thing: the management console of our ISP allows us to switch the server off instead of rebooting it (It’s a VServer). So I tried that: After about a 10 minute wait, the server was down. I could only tell by an open terminal that was constantly pinging the server and had 0% packet loss and very good access times during the whole morning. I knew the server was running, and now I knew it was offline or even down (which I was hoping for).

Restarting was easy, The machine was back up after about 40 seconds.

Uff, restart accomplished – we’re back to business…

Time for some coffee and breakfast (about 10:30am) – it seemed all was well. I would just keep an eye on the server and see if it behaves now.

For the first ten minutes or so, our Seaside Application would respond nicely, the wordpress pages were served and response times were normal. But performance degraded. By that time I had one terminal constantly pinging, one runing top and one  to search for things in several places like the database or log files.

I also found out that several cron jobs that would be running at night and save backups onto other servers and stuff had not finished, but also not logged any return codes or anything. DB2 backup just was started hours ago but had not finished for hours. I even started a backup manually and it would run about 30 minutes, while it usually is done in less than 2 minutes.

The best is yet to come…

Suddenly all was back to the sad state. None of the web browsers were getting any responses any more, and in the terminals nobody would listen to key presses like q or ctrl+c for minutes. top would fill up with lots (meaning more than 30) of apache2 processes, all being in the “D” status, which means they’re waiting for Disk I/O. The load average was constantly between 190 and 200, while it usually is between 0.03 and 0.6. The CPU percentages were low…. ????

At least there was some hint now.

The first thing to do was to stop apache2. service apache2 stop would take about 25 minutes to finish, while it usually is done in a few seconds. Once Apache was down, I could ssh and work on the machine again immediately.

The apache access logs were quite interestingly very large compared to the ones from two days ago. We usually are somewhere around 1 Meg per site per day. Yesterday, we had 19 MB for one of the sites on the server alone. So this was the thing to look after: what kind of attacks do we get?

It turned out we’re getting lots of POST requests to /xmlrpc.php. I mean, lots and lots and lots of them. All of them resulting in a spawned PHP process that mainly does … eat CPU cycles. So now I had something to search for. What could this beast be used for and how can I react to this attack?

It turns out there is not much I would need it for. WordPress uses it for Pingbacks and remote posting with certain tools or services, none of which I really use. So the question is: why not remove it and let the attackers stare at 404 codes. Well, I tend to keep our machines as current as possible, from Linux kernels to WordPress, so chances are, the next WordPress update will bring the file back and we’ll see the same problems again.

So I found a tip on Perishable Press that sounded better: just make Apache hide the file from the outside world. I just added this line to the Directory entry in the site’s httpd.conf:

RedirectMatch 403 /xmlrpc.php

And restarted Apache.

The server is now back online and seems to work just fine. The access logs are now still filled with entries for the POST requests, but since Apache will just respond with a 403 instead of kicking off some PHP code, the machine is a lot faster now. top rarely shows more than 2 or 3 apache2 processes, the load is down to between 0.1 and 0.45, and the application and all sites on the server respond quickly now.

I’ve also learned I could/should use fail2ban for such things, but I’ve wasted enough time for today and need to get some work done. And keep an eye on our server to see if this nightmare is over now.

Today I am glad we’re using not the cheapest ISP but one that also takes the time to help you with problems that are not their business. They usually are not responsible for what is going on on your server as long as what they provide is running. So thank you 1&1 for helping me with this today. I would have lost a lot more time without your help.

I’ve learned a lot today – not that I wanted to, but I feel like I climbed another step of my DevOp learning curve.

So back to business.

 

[Update]

I’m now trying to find out what the attackers might have been looking for and what kind of damage they might have made. So far we seem to be lucky because our WordPress is up to date.

Here is an interesting forum thread about this very same problem (and why our solution might hurt ourselves): http://wordpress.org/support/topic/xmlrpcphp-attack-on-wordpress-38

And here is an article describing a solution using fail2ban: http://xplus3.net/2013/05/09/securing-xmlrpc-wordpress/

[/Update]