A DDoS mitigated, some minor issues as a result

On Friday, November 18th, 2022 we were hit with a massive DDoS (Distributed Denial of Service) attack. We mitigated the bulk of the attack but there were a few minor issues as a result throughout Friday and Saturday.

The attack

On Friday, November 18th, 2022 at approximately 08:30 EET we received an email extorting us with a DDoS¹ attack if we did not pay $5000 in Bitcoin. It's not the first and, we suspect, not the last email of that kind that we receive. Just in 2022 we have received a dozen of them. As a matter of principle, we refuse to negotiate with criminals and we made that very clear in our pretty immediate reply to the miscreants who sent that email.

Barely two minutes after sending our email reply we observed a massive traffic spike on our server, with the server load² increasing from the typical 0.5 to over 12. At this point it was pretty clear that the attack was the real deal so we did what any network security person in our position would do: we worked on defence, as clearly the mitigations already in place were not enough for this kind of attack. This practically meant that I went into battle mode.

I immediately switched over our traffic to go through CloudFlare, a global CDN which is renowned for its excellent capacity and its DDoS-mitigating services. Since this requires a DNS change —sped up by me directly editing the DNS zone file to set the SOA TTL to just 5 minutes instead of the regular 1 day, if you're interested in the gory details— it created a small time window during which the real IP address of our web server was resolved by the attacker. This will become important later.

To counter the attack I also enabled a lot of “paranoid-level” mitigation options in CloudFlare which I knew would be cutting some of our legitimate traffic and make the life of our legitimate site visitors harder. Inconveniencing some people is better than being completely knocked off-line; as I've been saying for well over a decade: convenience, security, choose one.

Right away, I started observing the nature of the incoming junk traffic and modifying the filtering rules on CloudFlare in realtime, fending off the attacks. The attacker responded by escalating the attack and attacking from different networks and types of devices. This cat and mouse game lasted for four hours and the attack was waning.

The mitigations I put in place and the constant refining of the security measures fended off 99.49% of the junk traffic. The rest reached our server because of that small window of opportunity I described three paragraphs above. In hindsight, that was the weak spot inner defences which I've now patched up.

Unfortunately, that 0.51% of the junk traffic which made it through was so bad that it brought down the entire cloud infrastructure of our host for a fairly large geographic area for approximately 40'. The upshot of that was that the host's upstream providers³ were alerted to the DDoS and took mitigation measures on their end to kill that traffic dead in its tracks, meaning that the miscreants cannot attack any other sites.

In the end it was a nightmarish day but we survived it. Our site was very slow and some people had a hard time purchasing or renewing subscriptions, plus a few problems with taking backups with certain remote storage methods. We'll get to that in the next section.

Not everything was rosy

Mitigating a supermassive DDoS is war, and war requires making some tactical decisions. Trying to keep the total of our services online during that period would have led to swift defeat. Therefore, I knowingly made some configuration choices to fight the DDoS even though they meant that some issues would occur with our site.

Resolved: cached pages. For a (very brief) period of time you might have seen someone else's username when logging into our site and/or being unable to download the software you are subscribed to from our Downloads page. This lasted about 10' and it was necessary to reduce the server load enough so I could make other, more meaningful changes (no, setting the site completely offline would unfortunately not have the same result for reasons that have to do with the internals of how Joomla, the CMS we are using, works when you set the site off-line). Moreover, some pages —notably the home page and the products pages— may have not reflected the fact that you are logged in, something resolved around 21:30 EET.

Resolved: subscriptions. Everyone purchasing or renewing a subscription did not see that being reflected on their user accounts until after November 18th, 22:30 EET. This was on purpose. I had paused the service which processes payment notifications because the increased server load would not have guaranteed successful processing of payments. I re-enabled the service on November 18th, 2022 at 21:30 EET when the attack had died down enough for the server load to return to near-nominal levels. I also had to actually go into our reseller's dashboard and manually trigger resending the payment notifications for November 18th (because they do an exponential back-off when our site is not responding to the notifications, e.g. if it's down or I have indeed turned off the service processing the payment notifications), hence the delay.

Resolved: connecting to, backing up to and downloading existing backups from Dropbox, OneDrive, Google Drive, Google Storage, and Box. All of these remote storage providers use the OAuth2 authentication scheme. Connecting your site to one of these services is a three-party process. Your browser sends a request to our site, which adds a service identifier and sends it to the remote storage provider. The remote storage provider responds with a temporary code which our site signs with its secret key, sends it back to the remote storage provider, which responds with an Access Token and a Refresh Token; that's what you copy into your Akeeba Backup or Akeeba Solo installation. The Access Token is used to upload and download files to/from the remote storage BUT it only lasts for one hour. Whenever you need to do another upload/download operation beyond that time window you need to go through our site's OAuth2 mediation service again. We don't store these tokens, we just relay them to your site. Hence this service being called a mediation service. This service was completely unavailable until Sunday, November 20th, 2022 00:30 EET when I allowed traffic to reach it again. This was necessary because the attack had not subsided enough until 21:00 EET and I was carefully re-enabling one service at a time, while monitoring the server load. However, further restrictions which were present to fend off the DDoS did not allow your site to reach the service until they were removed on Sunday, November 20th, 2022 at 11:00 EET.

Resolved: updates. Updating to a new version of our Professional versions requires that your site downloads the update package from our site using your Download ID. Due to the increased protections we had enabled to fend off the DDoS this was not possible until these restrictions were completely removed on Sunday, November 20th, 2022 at 11:00 EET.

Resolved: some emails from our site may have ended up in spam. When I switched the DNS over to CloudFlare I forgot to copy two secondary DKIM records which are used by receiving mail servers to verify that the message was really sent from us. Depending on which of our email sending services processed the email you might have received an email with a DKIM signature which cannot be validated, resulting in the email ending up in spam — especially if you are using Gmail or G Suite. This was resolved on November 20th, 2022 at 11:00 EET.

But why us?

I honestly have no idea. We are a tiny company of two people. It's not like we have $5000 in Bitcoin lying around. Even more so when our products are priced realistically, i.e. we sell at a very small profit.

Maybe the criminals who attacked us never went past the fact that our company is in Cyprus, a country known for having been a tax haven (prior to joining the European Union in 2007 and definitely prior to the banking crash in 2013 which almost killed our company as collateral damage…)? Cyprus is the fourth most attacked country with DDoS in the world.

Maybe they thought we are the typical small business which will bleed the cash to get rid of a problem it doesn't understand how to mitigate for a fraction of the price. 42% of attacks affect small to medium businesses for this very reason. Well, we're not the typical small business, are we? Our small company is entirely comprised of software engineers who write site security software for a living and understand how something like DDoS works and how to mitigate it. The only thing they achieved is that it's now far harder for them to extort anyone else.

In any case, this happened.

Future steps

Being engineers we're not content on patting ourselves in the back for a job well done, saying “oh, it could've been worse”. This attack, like any other problem we have faced over the years, is a chance for retrospection and improvement.

I have identified some key areas which need to be addressed.

Reaction time. Don't get me wrong. From the start of the attack to the first mitigations being in place it was 10'. It took another 20' to apply filters to cut off enough junk traffic to make the server somewhat usable, and another 30' to get to a point where our site was okay-ish. This about 55' too long. The monitoring which is in place is adequate to alert me to these problems, no change required there. I have, however, a new SOP for what to do when under attack which takes about 5' to deploy. This will help fight any future attacks. Furthermore, there are now mitigations in place to ensure that the initial wave of the attack is nowhere near as disruptive as it's been this time.

The traffic which went around CloudFlare. The small window of opportunity which allowed a small but significant portion of the DDoS traffic to hit our servers directly, bypassing CloudFlare, was the result of a misconfiguration when changing hosting providers two years ago. I have now closed that and added information about avoiding it in our internal documentation for future hosting or server migrations, if that's ever needed. I have also taken other measures to ensure that reaction times are better next time.

Mitigating DDoS traffic that makes it through. Unfortunately, that's not really in my hands or even really in the hands of our host. By the time it reaches the server, even an aggressive server firewall will still fail to cope with well over 10,000 connections per second (there's only ~50,000 usable TCP/IP ports and the massive ingress of traffic will not necessarily close the junk connections within 5 seconds). Fortunately, the host is really good at what they are doing and work with the upstream providers to block this kind of traffic in the future. This is an ongoing battle, of course. DDoS attacks are not fun for the site owner, the host, or the upstream provider.

Pausing the payment notifications processing service. This is really dependent on other mitigations working correctly. If the server load is not that bad we don't have to take this kind of action. Therefore, this should be a mostly solved issue, with this action reserved as a Hail Mary measure if all else fails.

The OAuth2 mediation service being off-line for a day. This is one of the toughest problems to solve. On the face of it, we can host it on a different server, put it behind CloudFlare with adequate caching, and call it a day. The problem with that is two-fold:

Using a different server would break older versions of Akeeba Backup. A new server would mean a new endpoint (subdomain) for this service. Trying to proxy the new service from the old location would still run into the same problem if we are under a DDoS attack; the old location would still be effectively inaccessible and backups would fail.
We need to check if the user is allowed access to the service depending on their subscription status. We don't want to have remote access to the MySQL database because a. that's an attack surface we are not comfortable with and b. if the main server is under huge load (as is the case of a DDoS) we achieved nothing; the service will timeout and the backups will still fail.

This is something which we will be addressing over time with a combination of measures. We will create a new server to host the OAuth2 mediation service, in its own separate server, with its own separate subdomain. New versions of Akeeba Backup and Akeeba Solo will be using it, while the older version will still be using the existing service hosted with our main site. We will turn off the old service over a period of several years, ensuring that older versions of Akeeba Backup and Akeeba Solo are not left in the lurch for a meaningful amount of time. Beyond that period of time we will try to see if it's possible to proxy the old service endpoints to the new ones for several more years, essentially letting old versions of our software work until the sites they run on are old enough to be practically unmaintained and unusable.

Of course this leaves us with the question, how are we ever going to check for subscription access? I am experimenting with several methods. Most likely it will be a combination of periodically pushing all subscription data to the OAuth2 mediation service, pushing changes as they happen, and allowing the mediation service to also pull changes if the information appears to be stale (with a fallback to the cached data if the main site does not respond fast enough).

This was a planned change. The recent attack just bumped it far higher in the endless list of things to do.

I hope you appreciate our commitment to full transparency and the hard work we put into our software and services.

Nicholas K. Dionysopoulos
Director and lead developer,
Akeeba Ltd.

Definitions

Some of the terms used above are a bit technical but necessary to understand what happened. I tried to give a "beginner's" definition of them. The other, even more technical terms, are not necessary to follow along and just link directly to Wikipedia articles for those who are keen to understand a little bit better how things work under the hood.

DDoS: A Distributed Denial of Service attack is when a miscreant is using a large number of compromised or rented servers, computers, or other Internet-connected devices to send a massive amount of traffic to a target server. The idea is that the target server will not be able to cope with the massive influx of traffic, its operating system will be overwhelmed, and the server will be unable to serve any legitimate requests or even be knocked completely offline.
Server load: The server load is a number which roughly represents how many processes (computer programmes) demand execution time compared to the number of available CPU threads. A server with a two-core, hyper-threaded processor can reach a load of around 4.0 before significant delays are experienced in the execution of further programmes, such as handling additional HTTP requests in the case of a web server.
Upstream provider: Internet-connected computers do not exist in a vacuum. They need to be able to route traffic to and from the computer through someone else's network, ultimately reaching the networks which make up the global Internet backbone. These networks which help us route traffic away from our own network are called upstream providers. Hosting companies are typically connected to a multitude of upstream providers who are either operating parts of the Internet backbone or are direct clients of backbone operators.