Admin Log

Here I will post activities related to admin tasks. They may look like nonsense to you, but I want a public record of anything I do relating to this. It also is a help to me to log what I am doing so when I come back to it in several months, I can trace my own steps.

Todo’s this week (sorry zikzak I keep delaying on this, I have been weirdly intimidated by this):

  • Update the twitter/google API keys
  • Edit /var/discourse/containers/app.yml with SMTP credentials for an email account that can be used for outgoing forum email.

I am feeling very confident about updating the twitter keys, I have made a twitter dev account with the same email as my admin email, and this will be my first change. However - google has me a bit confused. I made a google dev account and it’s asking me all this shit about my application, whether I want it to be “external” or “internal” (no idea), etc. Not sure where to find the right keys for oauth. I am worried any change I make here may break the google login, but I guess we’ll find cross that bridge when we get there. Any chance we can just leave these keys as-is? Also, I see a patreon login. I need to investigate that.

For the second item, we have these fields that need to be changed:

DISCOURSE_SMTP_ADDRESS
DISCOURSE_SMTP_PORT
DISCOURSE_SMTP_USERNAME
DISCOURSE_SMTP_PASSWORD (plain text in a file? yeeeeesh)

Will need to consult the documentation on this one. Brief research is telling me I should NOT use a g-suite account for this, for a lot of reasons I didn’t know. Think I am going with mailgun, as recommended by discourse. Concern I have - what if we go above our allotted 10,000 emails/month and my credit card gets charged? Not likely, but there could be an attack or bug or something.

Things that are nice to have I want done soon-ish:

  • set up ssh to server so I can use my mac terminal instead of this godawful browser terminal
  • Create a “preprod” environment where I can load the website and test any changes I make before pushing live - note, this may cost a tiny bit of money, which I will go to the community for, but this is basically gonna be a requirement for me I think. It may also be nice to have for @anon46587892
8 Likes

yea, it’ll just function as somewhat of a duplicate of the site, but in a location that is not publicly accessible.

I would also like to not have us display a “502 bad gateway” for several mins every time we have to update the app or reboot server, but I’ll deal with that when we get there. Ideally this server will live in a kubernetes environment someday, and this stuff becomes much, much easier to handle.

You can get rid of the 502 during rebuilds by temporarily pointing the reverse proxy to a static page. I started to do that once, then got bored and abandoned it.

I’d rather the current google oauth credentials not be used since they’re tied to my personal google account. There’s a guide on Discourse meta for setting it up:

1 Like

CPU load spike for the inauguration. Take it a little easy guys, I’m busy at work. We should be fine. You may get logged out messages.

1 Like

Looks like it settled down.

Leaving this here for me: Good post I found about calculating server load requirements. He mentions a really interesting autoscaling implementation project that could be done.

thread: Performance, Scaling, and HA requirements - #7 by mdekkers - hosting - Discourse Meta

Number of posts made is about the least interesting metric, even more so over such a long time period. Page views is far, far more relevant for determining needed resources. It can be a bit tricky to compare the requests and traffic patterns of a “traditional” mostly-server-generated forum to a Discourse forum, because Discourse is very API-driven, so we often serve multiple HTTP hits per “page view”, but we tend to service each HTTP request a lot quicker, so the forum appears to be a lot more responsive to the user, as shown by this breakdown of just dynamically-generated response times:

Most page-generation-oriented “traditional” forums would, for the same level of user behaviour, probably have a lower volume of requests, but they’d be pushed a lot further to the right; it’s rare for a traditional forum to be generating a majority of responses in under 100 milliseconds.

I’m not putting up that graph to brag about how good Discourse is (although it is rather impressive, IMBO), but rather to highlight that the way you think about provisioning capacity for a Discourse site can be a little different to how you’d figure out how many, say, php-fpm workers to keep in stock.

A typical, say, Magento site (which I dealt with a lot in my previous role) might take 1000ms or more to generate a page (I shit you not; Magento is a dog ). You’d factor on having to have at least one php-fpm worker per pageview-per-second, to guarantee no contention. As soon as you have any sort of request rate in excess of your capacity, user experience goes straight to hell because every queued request is going to be adding a full second to the TTFB because it’s waiting behind another request that’s also taking a whole second to process.

Discourse, on the other hand, is making many smaller requests, so even if (and that’s a big “if”) it took a second’s worth of requests to render a page, with each of them taking somewhere around 100ms, the apparent responsiveness of the site is improved, because each request gets serviced quicker. This is the same principle at work as OS multitasking: keep the time slices as small as possible to improve interactive responsiveness, even if it costs a little more in context switch overhead.

Even then, though, most of the requests that a Discourse site processes via unicorn are purely “async”, tracking activity and so on. For example, here’s a relative breakdown of the routes that are most often hit:

(Y-axis scale deliberately filed off, because it isn’t the exact numbers that matter, it’s the relative weightings)

Leading the pack is topic/timings , which is a purely background (async) route that gets POST ed to to record “this user took this long to read these posts”, which counts towards both the “how long has someone spent reading” (for trust level calcs, amongst other things) and also the little “how long does it take someone to read this topic” data that comes up when you load long topics.

The next route by request volume, showing avatars, is dynamic because avatars come in a ridiculous number of sizes, so we often have to regenerate new ones. Worst case, a single “show me some posts” request could result in 20 requests to the various avatar display routes, but that’s pretty rare because usually most avatars have been seen before and have been cached.

It’s topics/show and topics/posts where we start to get to into what would normally be considered “page views”, and even then, performance is pretty solid, with the majority of responses being made in under 100ms, as shown by this graph of the aggregate of response times for topics/show and topics/posts :

(I split the 100-1000 group in half, just to show we weren’t cheating with a lot of nearly-one-second responses or anything)

One thing you’ll note isn’t on the list of frequently hit routes is posts/create . While draft/update gets hit a fair bit (pretty much any time someone updates a draft post they’re working on, they’ll hit that route in the background ), actual post creation doesn’t happen very often, relative to reading. So, a metric of “we get N posts per day” doesn’t say much at all about actual site traffic. Attempts to extrapolate from number of posts made to total traffic volume are very sensitive to the read/write ratio used in the calculation, and since the read/write ratio varies greatly between different sites, you end up with some very wide ranges of estimated site traffic. You’re far better off just measuring it for your actual site and using those numbers for your scaling calculations.

The rule of thumb I would apply to figure out how big Discourse app servers needed to be, on a dedicated site, would be as follows:

  1. Determine how many page views per second I wanted to cater for, at absolute peak. My definition for “page view” would be something like “viewing a list of a subset of topics, or viewing a subset of posts in a topic”. How to determine that from an existing forum’s traffic data depends on exactly how the existing forum software works. Completely ignore all other requests, because they work very differently in Discourse, and will be accounted for in the rest of these calculations anyway.
  2. Divide your desired peak “page views per second” by two, to get the total number of unicorns you need to run to service that volume of traffic. Looking at some ratios of “total time spent in Unicorns to page view rate”, they seem to vary between about 0.29 and 0.35, on the ridonkulously fast CPUs we use, so on the slower CPUs you usually see in cloud providers, it’s a reasonable estimate that you can service about two concurrent page views per second worth of requests per unicorn.
  3. Now you know how many unicorns you need, divide that by two to get how many CPU cores you need, and multiply it by 300MB to get your unicorn RAM requirements.
  4. Get as many machines of whatever size you need to satisfy those needs. Tack on a maybe a half a GB of RAM and a half a CPU core per machine for “system overhead” and disk cache.

Et voila! App server capacity calculation done.

Running those calculations, for a site with very peaky load, you’ll probably come out with a number that makes you go a bit pale. It’s probably a lot more droplet than you were expecting to need. That’s because you’re calculating based on absolute peak requests, and you only get those maybe 1% of the time. This is where cloud elasticity comes in handy. You don’t need to be paying for all those droplets all the time, so turn 'em on and off as you need to.

The big cloud players, like AWS, give you shiny autoscaling logic for “free” (which mostly seems to involve making the rulesets easy to screw up so you rapidly cycle instances up and down, which makes your bill bigger), but if you’ve got a sensible monitoring system like Prometheus (big plug: we use it here at CDCK and it is delightful ) you can setup your own autoscaling triggers to fire up a new droplet when CPU usage starts to go bananas, and kill off a droplet when things slow down, pretty easily. You need to wire up service discovery and a few other bits and pieces to make it all work, but it can save you a bucketload of money and it’s fun to build. Even if you don’t want to go that wild, if you know when your peaks are likely to be (say you’re running a forum for enthusiasts of a particular sport, and there’s “off-season” traffic levels, “in-season” traffic levels, and “finals” traffic levels) you can setup more droplets when the traffic levels are going to be predictably higher, and then turn 'em off after everyone goes away again. In each case, you work out your droplet requirements based on the peak page views in each group and the above calculations. It won’t save you as much money as doing it dynamically, and if you get a bigger surge of traffic than you expect at finals time you might get overloaded, have a poor user experience for a bit, and need to add some more emergency capacity (assuming your monitoring system let you know that things Went Bad), but it’ll still be a lot cheaper than running peak-capacity droplets 24x7x365.

Or, you just throw your hands in the air, figure you’ve got better things to do than fiddle around with all this stuff yourself, and just drop a small :moneybag: on our doorstep to have me and the rest of the CDCK ops team take care of all this for you. :grinning:

A new version of discourse is available. I just saw we are running a beta version of this software. I only want to run stable versions from this point forward - I am reading their forums and see some upgrades to the new beta version have failed.

IMO it’s best to stick to stable releases, so I am probably going to be updating the software less often than before. I believe this makes the user experience a little more predictable and is also less work (it isnt much to begin with). I am seeing that their recommendations say to run the latest greatest beta version, and “beta doesnt mean what you think it does” but we also run a lot of plugins and these kinds of things tend to break between versions because they are not a primary consideration to discourse developers. I think I will just stick to “tests-passed” branch, but even then, I’m kind of leaning towards the stable branch.

Ditto goes for upgrading the server - doesnt have to be a frequent thing.

Discourse releases in perpetual beta and always has. That’s the default and what they recommend tracking.

Also, how are you coming along on the smtp, google and twitter accounts? Those are really the only 3 things keeping me on right now.

It’s fine if they’re always in beta - but I’m seeing stuff that worries me. There’s really no risk to running a few versions behind. I’m just talking about switching to their tests-passed branch or maybe the stable branch rather than their beta branch.

I have it all ready but keep getting derailed. my main laptop broke last weekend and put me off of it. I’ll break stuff in the morning. SMTP is a little more involved and will require a restart. I will post a banner and try to do my restart in the early hours of sunday so that in case something goes wrong it won’t be too disruptive.

A lot of this is new to me and I want to understand it completely before I just go mashing buttons. Sorry it is taking longer than I thought. Don’t worry about it though, I got it.

2 Likes

One of the things that really pisses me off about this software is the complete lack of documentation. It’s really obvious to me that they’re just trying to push people into their hosting service. I’ve really never seen anything like this before.

This is the best I can find on when to update, from the co-founder:

What is the right time to update?

It just depends on the time you have available and how close to bleeding edge you want to be. If you have non-official plugins, it is highly advisable to utilize a test/staging site. If you do not have any non-official plugins, you can likely upgrade immediately, but even then, some plugins may break for a couple of days as the team fixes them (there are a lot of them).

What is common practice when updating with many plugins installed?

If you have a lot of plugins, testing locally or on a test server is highly advised. Especially if you have non-official plugins, as something could have broken. If you find something does break, then it is a matter of, do you have time to fix it? Does the original plugin author have time to fix it? Either of those could take weeks. So at least this way, you simply have a broken test site and not a broken production site.

Lol no, sorry, beta is a snapshot of tests passed. Sorry I am tired and had kind of a long week.

But I really think running the stable branch is probably what we want. We run some plugins, and I dont know enough about them yet to know whether they’re likely to break or not. I don’t even know what they do yet.

1 Like

Cool, I found more DATA:

Screen Shot 2021-01-22 at 9.59.23 PM

All of our plugins are widely used official ones. They are very unlikely to break on a routine update.

1 Like

There will be a reboot tomorrow morning (Sunday, Jan 24) at approximately 10:30AM EST. The website may become unavailable intermittently for a few hours.

The site launcher is a bash script! (my main language). I like to read these things and understand how they work before I use them. I am a little anal about it unfortunately, but it seems pretty well written. I have some issues with the style, but it’s easy to read. Reading other people’s bash can be like reading someone else’s handwriting, it just makes no sense to anyone but the person who wrote it.

Updating twitter/google keys today, and making sure the SMTP stuff is all ready to go before I flip the switch in the morning.

I definitely understand being nervous for the first rebuild, but they become super routine once you have a few under your belt. Whenever I’m uncertain of something I remind myself that Discourse is in active development by a large team, led by a couple of people who are widely known and well respected in the development community. I’ve learned to trust them over the past year and a half (even if all the image resizing/rehosting stuff is still a mess).

1 Like

I’m more nervous because I’m making changes to a live site. I’ve never really worked that way - there’s always a staging/test site to work with. That’s probably overkill for us with like ~50-100k pageviews and 200-250 daily users, but it’s not nothing.

I am also a little error prone. But you’re right, the more I am reading about this the more comfortable I’m getting.

But this stuff is interesting to me anyway. I’m the type of person that needs to go way down the rabbithole in tech I use. It makes me a little slower, but I gain a deeper understanding that lets me fix issues I otherwise probably couldnt.

adding this to my wishlist:

Ok. I have set up mailgun (I think) on mailgun’s end and in the domain settings. Now I have to wait 24-48 hours for the DNS changes to propagate.

I may have done something wrong - but I have really no way of knowing without just waiting. The restart in the AM may not happen. Hopefully everything went well and the dns changes propagate by then. If not, I’ll have to reschedule.

Going to test twitter/google stuff in the AM for sure but that does not require a restart.

Note: I registered for mailgun and provided a payment method. It is free for 3 months. Then it is literally pennies a month after that for our purposes. I’ll pick up that tab, lol.