Post-Mortem: The massive lemmy.world -> lemmy.dbzer0.com federation delays.

db0@lemmy.dbzer0.com · 8 months ago

Post-Mortem: The massive lemmy.world -> lemmy.dbzer0.com federation delays.

nutomic@lemmy.ml · 8 months ago

As someone hosting a service like this, especially when it has 12K people in it, this is very scary! While 2 lemmy core developers were in the chat, the help they provided was very limited overall and this session mostly relied on my own skills to troubleshoot.

This reinforced in my mind that as much as I like the idea of lemmy (or any of the other threadiverse SW), this is only something experts should try hosting. Sadly, this will lead to more centralization of the lemmy community to few big servers instead of many small ones, but given the nature of problems one can encounter and the lack of support to fix them if they’re not experts, I don’t see an option.

I disagree with this conclusion. If you had installed Lemmy according to the official instructions, you would have the database, backend and everything else on the same server and would never have run into this particular issue. Your setup is heavily customized so it is only natural that there are few people who can help with it.

Anyway its an interesting journey, thanks for writing down your experience and for improving the documenation!

taaz@biglemmowski.win · edit-2 8 months ago

I will hop on to this to also point out that there actually were people willing to actively help (me included, see the original post on this community) but if I say it bluntly we were not “invited in on the show”, let me expand that.

The problem is, as @[email protected] points out here, we don’t have the slightest idea how exactly your infrastructure looks, without that there is only the most general stuff we can help with.

From my point of view, joining the matrix chat later in the process, I watched you do/post stuff that I have no idea where it comes from, I don’t have the full context of what has been already tried and crossed out and what’s the current plan.
You @[email protected] would have to stop chopping and start networking with the people - that is definitely not easy to do effectively, especially if more people join later (and too have to be updated with the sate) but we could have fast tracked the docker/compilation stuff ruling lemmy out sooner.

In retrospect, if we had full picture of how the infrastructure looks the chance someone would go “oh you have split backend and database servers, check the latency” would definitely be a lot higher, but we didn’t know (hell I actually assumed your deployment is same or close to the lemmy ansible one). I am aware this is easy to say after the solution has been found but hopefully you get the networking/communication idea.

db0@lemmy.dbzer0.com · edit-2 8 months ago

Wait, hold on, how was help not accepted? I talked with everyone who replied to me me and followed every suggestion. If someone had asked for infra information I gave it.

You know It’s really frustrating to open myself and write about my experiences honestly and then people try to stay that it’s actually my fault I didn’t ask for help “the right way” . What kind of effect to do you think this might have to other potential lemmy hosters?

taaz@biglemmowski.win · 8 months ago

I didn’t want to devalue your communication, I think I have worded my previous comment very badly in that sake, I am sorry about that. (I also really need to go to sleep so I will be blunt here.)

There is a nuance to the internet communication when it comes to asking OSS community for support, at least speaking from my own experience as someone working in tech.
Getting one or two people actively bouncing ideas of off is a already big success - quality of OSS support is often very spotty across projects and it’s understandable because people do it in their free time which is limited (also if the project is complex, there is often less people experienced with it, less total sum of free time for support, I think this currently applies to Lemmy a lot).
With that in mind, when I come asking for support I am mostly prepared to not get any, I am prepared to have to dive into the codebase, debug, deconstruct, debug, swear, swear some more. Maybe this is just me and I had really bad luck mostly, but I don’t know.
Should the devs/owners of any OSS project be ready to provide (some) support for their product if they want it to survive, probably yes, and how much is good depends on the project, you, anyone.

So

What kind of effect to do you think this might have to other potential lemmy hosters?

My opinion is that currently, lemmy is simply not ready for non-tech people.

Also as someone else has commented here, hosting something for myself is easy, hosting for friends is just a slightly bit harder, but hosting something for the public, getting hundreds-thousands of people makes it by a magnitude a lot more difficult (now you need active monitoring, durable backups, …).

db0@lemmy.dbzer0.com · 8 months ago

You surely noticed that I was more than prepared to get my hands dirty during this incident. 😉

When I speak about support, I don’t mean having people doing it for me.

But overall you don’t seem to disagree with me that hosting you lemmy is not for the non-technical. Which is what nutomic took issue with.

kbotc@lemmy.world · 8 months ago

Tossing stuff on the same server is not great as I don’t want to pay for fast storage for my image store, but I want fast for my DB. My web server should have extra CPU and network but is otherwise ephemeral. This is the same stuff people have been running for years and is microservices 101.

The correct thing to do here is build in tracing and profiling hooks, as an example OpenTracing so something like Jaeger can consume and show problems and would have lit this up like a Christmas tree, Pyroscope can show changes over time in where CPU goes, and logs get shuffled off into graylog or some other centralized service for correlation.

nutomic@lemmy.ml · 8 months ago

Images can be stored in S3 so that’s not an issue. And Lemmy has some tracing logs as well as Prometheus stats, not sure if db0 tried looking into those.

db0@lemmy.dbzer0.com · 8 months ago

I don’t think if seen mention of these anywhere or how to use them

db0@lemmy.dbzer0.com · edit-2 8 months ago

The official instructions do not scale nor do they work for all situations. But besides that, the problem is not that my bad setup caused a problem. Shit happens and I didn’t blame anyone but myself. The problems is that when a problem occurs, one has to get lucky to get support. I don’t have to even prove this. I know for sure a fact that there’s lemmy instances that decommissioned because they followed the default setup, run into issues, got no support and gave up.

Edit: Also, man, from one Foss developer to another: You really have to learn to stop the instinct to say ‘it broke because you did it wrong’. I know it feels unfair, but trust me, this is not the way.

nutomic@lemmy.ml · 8 months ago

I’m not saying you did it wrong, it’s open source so of course you can use it in any way you like. But some ways have a higher risk of breaking than others.

KairuByte@lemmy.dbzer0.com · 8 months ago

I’m curious how you think “everything on the same box” scales? You can’t load balance, you can’t ensure resources are being used efficiently, you can’t even reboot a machine without the entire thing going dark.

nutomic@lemmy.ml · 8 months ago

Lemmy.ml runs on a single server and is much bigger than db0. Sure you can’t get 100% availability this way but no one expects that.

KairuByte@lemmy.dbzer0.com · 8 months ago

Do you have a link to something describing their infrastructure?