Hi,
I know this is quite impossible to diagnose from afar, but I came across the posting from lemmy.world admins talking about the attacks they are facing where the database will get overwhelmed and the server doesn’t respond anymore. And something similar seemed to have happened to my own servers.
Now, I’m running my own self-hosted Lemmy and Mastodon instances (on 2 seperate VPS) and had them become completely unresponsive yesterday. Mastodon and Lemmy both showed the “there is an internal/database error” message and my other services (Nextcloud and Synapse) didn’t load or respond.
Login into my VPS console showed me that both servers ran at 100% CPU load since a couple of hours. I can’t currently SSH into these servers, as I’m away for a couple of days and forgot to bring my private SSH key on my Laptop. So, for now I just switched the servers off.
Anyway, the main question is: what should I look at in troubleshooting when I’m back home? I’m a beginner in selfhosting and I run these instances just for myself and don’t mind if I’d have to roll them back a couple days (I have backups). But I would like to learn from this and get better at running my own services.
For reference: I run everything in docker containers behind Nginx Proxy Manager as my reverse proxy. I have only ports 80, 443 and 22 open to the outside. I have fail2ban set up. The Mastodon and Lemmy instances are not open for registration and just have 2 users each (admin + my account).
Sounds like you were out of resources. That is the goal of a DoS attack, but you’d need connection logs to detect if that was the case.
DDoS attacks are very tricky to defend. (Source: I work in DDoS defence). There’s two sections to defense, detection and mitigation.
Detection is very easy, just look at packets. A very common DDoS attack uses UDP services to amplify your request to a bigger response, but then spoof your src ip to the target. So large amounts of traffic is likely an attack, out of band udp traffic is likely an attack. And large amount of inband traffic could be an attack.
Mitigation is trickier. You need something that can handle a massive amount of packet inspection and black holing. That’s done serious hardware. A script kiddie can buy a 20Gbe/1mpps attack with their moms credit card very easily.
Your defence options are a little limited. If your cloud provider has WAF, use it. You may be able to get rules that block common botnets. Cloudflare is another decent option, they’ll man in the middle your services, and run detection and mitigation on all traffic. They also have a decent WAF.
Best of luck!
A very common DDoS attack uses UDP services to amplify your request to a bigger response, but then spoof your src ip to the target.
Having followed many reports of denial of service activity of Lemmy, I don’t think this is the common mode. Attacks I’d heard of involve:
- Using regular lemmy APIs backed by heavy database queries. I haven’t heard discussion of query rates, but Lemmy instances are typically single-machine deployments on modest 4-core to 32-core hardware. Dozens to thousands of queries per second to the heaviest API endpoints are sufficient to saturate them. There’s no need for distributed attack networks to be involved.
- Uploading garbage images to fill storage.
Essentially the low-hanging fruit is low enough that distributed attacks, amplification, and attacks on bandwidth or the networking stack itself are just unnecessary. A WAF is still a good if indeed OPs instance is getting attacked, but I’d be surprised if wafs has built-in rules for lemmy yet. I somewhat suspect one would have to do the DB query analysis to identify slow queries and then write custom waf rules to rate limit the corresponding API calls. But it’s worth noting that OP has provided no evidence of an attack. It’s at least equally likely that they dos’ed themselves by running too many services on a crappy VPS and running out of ram. The place to start is probably basic capacity analysis.
Some recent sources:
I’ve heard that enabling CloudFlare DDoS protection on Lemmy breaks federation due to the amount of ActivityPub traffic.
You could always add them to the allow list so they don’t get blocked.
I run a lemmy server. If you ban a bot and remove content (even if the bot is from another instance), if you’re removing more than a few comments the think will lock up, the server will error, and you’ll pretty much have to restart it. This could also cause other services to be unresponsive as the CPU will be sitting at 100% for the thread.
If you think it’s genuinely a DDOS (which is unlikely if you’re a small fry, but possible), then try putting cloudflare in front of your service (it’s free) which will mitigate many types of DOS attacks.
Acronyms, initialisms, abbreviations, contractions, and other phrases which expand to something larger, that I’ve seen in this thread:
Fewer Letters More Letters DNS Domain Name Service/System IP Internet Protocol SSH Secure Shell for remote terminal access UDP User Datagram Protocol, for real-time communications VPN Virtual Private Network VPS Virtual Private Server (opposed to shared hosting)
6 acronyms in this thread; the most compressed thread commented on today has 9 acronyms.
[Thread #29 for this sub, first seen 12th Aug 2023, 08:45] [FAQ] [Full list] [Contact] [Source code]
Very good bot
Very Very good bot
The best you can do to know if it was an attack is to inspect the logs when you have time. There are a lot of things that can cause a process going wild without being an attack. Sometimes, even filling the RAM can cause the CPU to appear overloaded (and will freeze the system anyway). One simple way to figure out if it’s an attack : reboot. If it’s a bug, everything will get back to normal. If it’s a DDoS, the problem will reappear up to a few minutes after reboot. If it’s a simple DoS (someone exploiting a bug of a software to overload it), it will reappear or not given if the exploit was automated and recurring, or was just a one-shot.
The fact that both your machines fell at the same time would tend to make think it’s an attack. On the other hand, it may just be a surge of activity on the network with VPSes with way not enough resources to handle it. Or it may even be a noisy neighbor problem (the other people sharing with you the real hardware on which your VPSes run who will orverload it).
Don’t copy your private key to your laptop, generate a new one and add its public key to the same user. Also use a different key for every remote host. (or don’t, but that’s kind of like using the same password for all your accounts)
@ErwinLottemann @Solvena …or take it one step further and store your private key, only(except your offline backup), on an hsm/smartcard such as; yubikey.
Lemmy has the disadvantage of being opensource. In the long run this can be good for security but in the short term this gives your enemies a blueprint of your software and they know exactly how to attack you.
The only time I have every been compromised was when I was running 3rd party code open to the internet. I have been running my own code open to the internet for 20+ years and have been safe with it. I don’t think I am some kind of god coder or anything but I am mindful of best practices and most importantly I am a small fish in a big pond.
Long story short is that running popular 3rd party code open to the internet exposes you to unique threats that you should be prepared for. Subnet/vlan it, vpn it, lock it down,
I can’t help much regarding the service denial issue.
However Port 22 should never be open to the outside world. Limiting to key authentication is a really good first step.
To avoid automated scans you should also change the port to a higher number, maybe something above 10,000.
This both saves traffic and CPU. And if a security bug in sshd exists this helps, too.
Moving off from port 22 is effectively just security by obscurity. It will save you some logs but the bandwidth and CPU time saving is negligible - especially with fail2ban.
However Port 22 should never be open to the outside world.
Wat. How do you connect with ssh, then? You can bind openssh to an other port, but the only thing it changes is that you have less noise in your logs. The real most important security measure is to make sure your softwares are always up to date, as old vulnerable software is the first cause of penetration (and yes, it’s better to deactivate password login to only use ssh keys).
As long as you do not allow password logins for ssh you can let the silly idiots beat their heads against it or you could use a script to ban them. They will not brute force a properly secured ssh server.
I would be mostly annoyed about the log entries. That would be my primary motivation to ban script kiddy hack attempts.
One could setup a VPN and expose the SSH port to the VPN network only. It think tailscale operates this way?
The issue with this is that if the VPN breaks, you can’t SSH in to fix it, which is a problem if it’s a remote host.
Instead, disable password authentication, use a strong (Ed25519) key, and configure two-factor auth (TOTP or FIDO2).
I’m not sure about the feasibility of this (my first thought would be that ssh on the host can be accessed directly by IP, unless maybe the VPN software creates its own network interface and sshd binds on it?), but this does not remove the need for frequent updates anyway, as openssh is not the only software that could have bugs : every software that opens a port should be protected as well, and you can’t hide your webserver on port 80 behind a VPN if you want it to be public. And it’s anyway a way more complicated setup than just doing updates weekly. :)
No, this doesn’t remove the need to stay up to date.
However, it works on my server and was very easy to setup: a few ufw rules so that port 22 is blocked everywhere, allowed only on the VPN IP range and my local network range. Nmapping from outside does not show port 22 accessible, and indeed you can’t SSH to it without the VPN.
Security is quite tough to get right eh? I tried my best to harden everything opening ports on my server, having a fail2ban, VPN for maintenance, webserver to expose some personal services…
Oh, ok, you whitelist IPs in your firewall. That certainly works, if a bit brutal. :) (then again, I blacklist everyone who is triggering a 404 on my webserver, maybe I’m not the one to speak about brutality :P ) You don’t even need a VPN, then, unless you travel frequently (or your ISP provides dynamic IP, I guess).
Well that’s a bit of both: I need to be able to get on my server from work (with my phone… Yeah not great but that works), because I often break stuff haha ; also a nice thing to have when I’m on the bus and want to add more music or movies to listen to during the travel!
Are there ISPs that don’t provide dynamic IPs? I had to setup a script and get some API keys for different services to ensure the IP is properly updated on my DNS servers.
Speaking of brutality, I considered doing the same but then I would have banned myself from testing the APIs of my services 🤧
Oh, I see. Totally makes sense. :)
I guess it depends on the country, but here in France, yes, most landline ISPs provide static IPs (maybe all? there are a couple I haven’t try ; mobile IPs are always dynamic, though). It was not always the case, but I haven’t had a dynamic IP since the 2000’. I feel you, dealing with pointing a domain to a dynamic IP is a PITA.
Ahah, yeah, I protected myself against accidentally banning my own IPs. First, my server is a Pi at home, so I can just plug a keyboard and a screen to it in case of problem. But more importantly, as I do that blacklisting through fail2ban, I just whitelisted my IPs and those of my relatives (it’s the
ignore_ip
variable in/etc/fail2ban/jail.conf
)., so we never get banned even if we trigger fail2ban rules (hopefully, grandma won’t try to bruteforce my ssh!). It allowed me to do an other cool stuff : I made a script ran through cron that parses logs for 404 and checks if they were generated by one of the IPs in that list, mailing me if it’s the case. That way, I’m made aware of legit 404 that I should fix in my applications.
The points I made should not be used instead of all other security precautions like prohibited password login, fail2ban and updates, I thought that is common knowledge. It’s additional steps to increase security.
I disagree that changing the port is just security by obscurity. Scanning ips on port 22 is a lot easier than probing thousands of ports for every IP.
The reason people do automated exploit attempts on port 22 is because it is fast, cheap and effective. By changing the port you avoid these automated scans. I agree with you, this does not help if someone knows your IP and is targeting you specifically. But if you’re such a valuable target you hopefully have specialized people protecting your IT infrastructure.
Edit: as soon as your sshd answers on port 22, a potential attacker knows that the IP is currently in use and might try to penetrate. As stated above, this information would most likely not be shared with the automated attacks if you used any random port.
If you do not neglect updates, then by all mean, changing ports does not hurt. :) Sorry if I may have strong reaction on that, but I’ve seen way too many people in the past couple decades counting on such anecdotal measures and not doing the obvious. I’ve seen companies doing that. I’ve seen one changing ports, forcing us to use the company certificate to log in, and then not update their servers in 6 months. I’ve seen sysadmins who considered that rotating servers every year made it useless to update them, but employees should all use Jumpcloud “for security reasons”! Beware, though, mentioning port changing without saying it’s anecdotal and the most important thing is updates, because it will encourage such behaviors. I think the reason is because changing ports sounds cool and smart, while updates just sound boring.
That being said, port scanning is not just about targeted pentesting. You can’t just run nmap on a host anymore, because IDS (intrusion detection systems) will detect it, but nowadays automated pentesting tools do distributed port scanning to bypass them : instead of flooding a host to test all their ports, they test a range of hosts for the same port, then start over with a new port. It’s half-way classic port scanning and the “let’s just test the whole IP range for a single vulnerability” that we more commonly see nowadays. But they are way harder to detect, as they scan smaller sets of hosts, and there can be hours before the same host is tested twice.
To avoid automated scans you should also change the port to a higher number, maybe something above 10,000.
This doesn’t really work any more. Port scanning is trivial with IPv4, and tools like masscan can scan the entire IPv4 internet (all IPv4 addresses) in less than 15 minutes.
Very interesting, thanks for sharing!
I know it’s just anecdotal evidence, however fail2ban in my one machine which does need ssh on port 22 to the open internet bans a lot of IPs every hour. All other ones with ssh on a higher port do not. Also their auth log does not show any failed attempts.
Yeah it’s probably script kiddies that are only trying port 22 because it’s easier than port scanning.
Make sure you disable
PasswordAuthentication
in the OpenSSH configuration (/etc/ssh/sshd_config
on Debian at least). People won’t be able to try to brute force their way in if the only way in is without a password (ie using an SSH key) :)
I don’t see anyone else actually telling you how to figure out if you’re being DoSed, so I’ll start:
Check your logs. Look at what process is eating your CPU in htop and then look at the logs for that process. If it’s a web application, that means the error and access logs for it. If you see a flood of requests to a single URL, or some other suspicious pattern in the log, then you can try blocking the IPs associated with them temporarily and see if it alleviates the load. Repeat until the load goes down.
If your application uses a database, check your database logs too. IIRC postgres logs queries that take longer than 5 seconds by default, which can make it easy to spot a slow query especially during a time of high load.
I don’t think DNS amplification attacks over UDP are likely to be a problem as I think most cloud providers filter traffic with forged src addresses (correct me if I’m wrong). You can also try blocking all inbound UDP traffic if you suspect a UDP flood but this will likely break DNS lookups for you temporarily. (your machine should not have any open UDP ports in any case though if you’re just running Lemmy).
If you want to go next level, you can use “perf” to generate a system-wide profile and flamegraph which will show you where you’re burning CPU cycles. This can be extremely useful for troubleshooting performance or optimizing applications. (you’ll find that even ipfilters takes CPU power, which is why most DDoS protection happens on dedicated hardware upstream)
You could consider running Crowdsec as well on yr firewall that way known bots will be blocked
Besides the CPU how was the other resource usage on your VPS? Like memory, storage, etc