Fun Poster Data

jmakin · September 7, 2020, 3:11pm

And then repeating for each ID in that comma separated list? hahaha nightmare.

yes.

Can you export that list of IDs into a single column in SQL?

yes, I think so, although I’ve never worked with a SQL database. Mostly just redis which is a K/V store. In my old job we developed a K/V database too, so it’s what I’m most familiar with. Could be a good opportunity to learn, I suppose.

Jalfrezi · September 7, 2020, 3:14pm

Idk what redis or K/V are. Best use what you’re familiar with if it gets the job done, unless you want a crash course in SQL Server and SSIS lol (you can download Dev edition of SQL Server for free).
Or just use MySQL if you’re only using it to hold the IDs.

jmakin · September 7, 2020, 3:15pm

key value store. Basically all it holds is a unique key with some object as the value. They are the simplest and fastest databases you can use. Great for when you need read/write throughput, not so great when you need to do joins and stuff. Although some kv databases out there are really trying to bridge that gap.

Jalfrezi · September 7, 2020, 3:16pm

Ah ok, we call it a reference (or lookup) table. Maybe that will do the job but I’m not getting how you query post contents and do counts etc.

I had the mad idea of running 100k+ GETs and dumping the output of each into a SQL table, populating 100k rows (nothing these days) but have no idea what the output looks like and if it’s even suitable for SQL (possibly not).

jmakin · September 7, 2020, 3:24pm

it’s just json and not really very nested, so it would probably work pretty well. I could parse the json and just have each value be a column and the post ID as the unique id (or whatever it’s called, I’m forgetting right now).

The thing is I’m not sure @zikzak would want me spamming the server with that many requests - I could multi-thread it and have it be a lot faster. I know we’re a small site. If you think it’ll put strain on server resources let me know. My instinct is that it wouldn’t - but I don’t know what things are like behind the scenes and I’ve learned the hard way very recently not to make assumptions like that.

jmakin · September 7, 2020, 3:26pm

clovis8 · September 7, 2020, 4:09pm

Programming is so awesome. I wish I knew more about it. #jealous.

zikzak · September 7, 2020, 4:32pm

You’re basically running up against the lazy loading, which is why you can’t get all the posts in one shot. If you go to a specific post number you also get 20 full posts of data (although 5 of them will be previous posts just to confuse things).

So this:
https://unstuckpolitics.com/t/2505.json
and this:
https://unstuckpolitics.com/t/2505/1.json
both load posts 1 - 20.

But then if you jump up to post 26:
https://unstuckpolitics.com/t/2505/26.json
it will load posts 21 - 40. That cuts the number of requests you’d need to make way way down to get a full thread.

The server is usually nowhere near capacity, especially in the off hours. The only time we put a strain on it is during live events, and I think that’s more from lots of concurrent DB writes than requests.

The forum runs Redis, but only as a caching DB and I really have no idea how that part works. The actual data is stored in a Postgres DB, which is straight SQL.

idk what limitations there are to the plugin ggoreo was using, if any. I can run reasonable queries from the command line if you want to write them. However, the schema is scary big and complicated so it might be difficult figuring out exactly what to query for.

Jalfrezi · September 7, 2020, 4:35pm

After seeing zz’s post above, it sounds like it might be easier to get the data direct from Postgres into a local Postgres db somehow, if you have rights to connect to it, and then run queries against it if you have the schema (they can always be figured out).

I can probably chip in with some help.

jmakin · September 7, 2020, 4:35pm

OMG awesome. I don’t think that’s documented or maybe I just missed it. I knew there had to be some hacky way to do what I wanted to do, that’s perfect.

This API I give a 5/10. It seems kind of haphazardly designed by someone who never really designed something like this before. I’ve worked with several of these. I found a bug in one of the fields earlier - when I am getting all topics for the forum, the URL for the next page is broken and I had to modify it to get it to work.

clovis8 · September 7, 2020, 4:39pm

In my head right now…

zikzak · September 7, 2020, 4:44pm

Dunno how well it’s designed since I’ve never looked into it much, but I can confirm that Discourse documentation absolutely blows. That’s probably because their business model is selling managed hosting for it, so they don’t have much incentive to make good docs available to the public.

I’d expect things to actually be built pretty well under the hood, though. The two Discourse founders are pretty well respected developers. One was a co-founder of Stack Exchange and the other was a lead developer there.

jmakin · September 7, 2020, 4:49pm

Yes the documentation is pretty bad. Just picking somewhat randomly - the entry on “Liking a post and other actions” specifies there is a parameter you need to put called post_action_type_id and it is an integer, but it doesn’t tell you what any of those are.

Lots of this type of documentation is kind of bad and you gotta go deep diving into forums to find stuff you need, but the reason I say the design is bad is because typically for these kinds of requests you get objects back that more or less look the same. And they’re very simple in structure like:

{ 'values': [
    {
      ....
     },
     {
      ....
     }
  ],
  'next_page': (some url)
}

This kind of structure is extremely easy to parse and program around.

But for discourse, it seems completely random what the structure of the object you get back is. the “next_page” field is not documented, can be literally anywhere in the object, and is never called the same thing, or even a thing that makes sense sometimes. And typically, in these REST-style APIs, when there is no more data to receive, there’s also a field called “has_more” and it will be false. Not so with this one - I have to check that certain fields are empty to know when to stop.

I once had to build an application that updated some databases with UPS info using the UPS api. That was kind of sucky too.

Jalfrezi · September 7, 2020, 4:53pm

It’s not that unusual. Often in real world situations you have to migrate data from an obsolete source and have to spend some time analysing it to work out what the relationships and key values are.

jmakin · September 7, 2020, 5:00pm

After seeing zz’s post above, it sounds like it might be easier to get the data direct from Postgres into a local Postgres db somehow, if you have rights to connect to it, and then run queries against it if you have the schema (they can always be figured out).

I can probably chip in with some help.

Thanks, i’ll let you know.

It’s not that unusual. Often in real world situations you have to migrate data from an obsolete source and have to spend some time analysing it to work out what the relationships and key values are.

I’m really more of a devops guy so playing with data isn’t really one of my strong suits. I’m a lot better at retrieving the data, putting it somewhere, automating the process, etc.

Jalfrezi · September 7, 2020, 5:05pm

Oh good because although ETL was a large part of my job when I was working it was the part I least enjoyed.

RiskyFlush · September 7, 2020, 5:10pm

Really enjoying your coding stories. I’d gobble up a novel where the main character is a coder in an unusual situation. Romantic comedy, maybe. Or a space opera about moon pirates.

RiskyFlush · September 7, 2020, 5:12pm

Pony posts are an art, my dear Scottish brute. I’d sing you a sad lullaby, but I’m understandably feeling a little horse.

zikzak · September 7, 2020, 5:16pm

This might hit the spot for you, if you enjoy 1,000 page doorstops by authors who refuse to edit themselves because they love the smell of their own farts too much.

Jalfrezi · September 7, 2020, 5:17pm

Probably the simplest solution would be to have a job running to export the Postgre db periodically (weekly? monthly?) to a secure site where you can access it and download it into a local Postgre db, and play around with it there.