Fun Poster Data

mental pony posts ftw :metal:

giphy

8 Likes

Hit a little bit of a wall here. The API does not have good support for getting all posts from a topic. Basically, it’ll return the last 20 posts and then a field called “stream” which is simply the post ID’s for every post in the topic. Then you are supposed to query each of those post ID’s to get the post data. You can receive more posts, but only up to 1000.

But, there’s like 100k+ posts on this site. I don’t want to send a GET request for each one, it’ll take all day. Granted, it could be a rare operation, but still, ain’t nobody got time for that. Maybe I’m missing something. Their discussion forums aren’t being helpful on the matter and recommended the method I just suggested.

3 Likes

Sounds like a SQL join, no?

Or would you have to send the SQL individually with each of the 100k IDs paramterised?

caveat: I have nfi how the web stuff works

If this was in a SQL database yea. Maybe I have to use gregorio’s plugin - but I am not an admin here. The problem here, if I did not state it well, is I have to submit a HTTP request for each post I want. The round trip time of one request is like, a half second? That’ll take several hours to get the whole site’s post data. Which is theoretically fine, I could just create a DB and then update the DB once a week or so.

All I am doing, basically, is hitting a specific URL on this discourse site and it spits back data out at me. Then I iterate through all the URL’s that are available on a particular piece of data I’m asking for.

You can probably do that in an SSIS package and dump the output into one large staging SQL table, and work from there.

Once that’s done it’s easy to create an Agent job running periodically.

Wouldn’t blame you one bit if you don’t want to take it on, though.

For instance if you want data on the donald trump thread:

go to this link: The Pozzidency of Donald J. Trump: Typhoid Donnie's Slow Hypoxic Demise **Sweat Thread** (updated 100x/minute)

2505 is the thread id of the trump thread. I’m just doing GETs of these urls.

And then repeating for each ID in that comma separated list? hahaha nightmare.

Yea I feel a weird need to do programming stuff on my days off now. I was hoping the brute force, naive approach would work, but it seems like it probably won’t in a timely manner. I’ve had to engineer crazier things to get around rate limits and stuff for REST APIs (don’t even get me started on rate limits, fucking hate them) so I can probably figure something out but it won’t be like a few hour thing like I was hoping.

Oh well, I like a challenge. I’ve needed a side project for a little while now and I never have any ideas on what to do.

1 Like

Can you export that list of IDs into a single column in SQL?

And then repeating for each ID in that comma separated list? hahaha nightmare.

yes.

Can you export that list of IDs into a single column in SQL?

yes, I think so, although I’ve never worked with a SQL database. Mostly just redis which is a K/V store. In my old job we developed a K/V database too, so it’s what I’m most familiar with. Could be a good opportunity to learn, I suppose.

Idk what redis or K/V are. Best use what you’re familiar with if it gets the job done, unless you want a crash course in SQL Server and SSIS lol (you can download Dev edition of SQL Server for free).
Or just use MySQL if you’re only using it to hold the IDs.

key value store. Basically all it holds is a unique key with some object as the value. They are the simplest and fastest databases you can use. Great for when you need read/write throughput, not so great when you need to do joins and stuff. Although some kv databases out there are really trying to bridge that gap.

Ah ok, we call it a reference (or lookup) table. Maybe that will do the job but I’m not getting how you query post contents and do counts etc.

I had the mad idea of running 100k+ GETs and dumping the output of each into a SQL table, populating 100k rows (nothing these days) but have no idea what the output looks like and if it’s even suitable for SQL (possibly not).

it’s just json and not really very nested, so it would probably work pretty well. I could parse the json and just have each value be a column and the post ID as the unique id (or whatever it’s called, I’m forgetting right now).

The thing is I’m not sure @zikzak would want me spamming the server with that many requests - I could multi-thread it and have it be a lot faster. I know we’re a small site. If you think it’ll put strain on server resources let me know. My instinct is that it wouldn’t - but I don’t know what things are like behind the scenes and I’ve learned the hard way very recently not to make assumptions like that.

image

6 Likes

Programming is so awesome. I wish I knew more about it. #jealous.

You’re basically running up against the lazy loading, which is why you can’t get all the posts in one shot. If you go to a specific post number you also get 20 full posts of data (although 5 of them will be previous posts just to confuse things).

So this:
https://unstuckpolitics.com/t/2505.json
and this:
https://unstuckpolitics.com/t/2505/1.json
both load posts 1 - 20.

But then if you jump up to post 26:
https://unstuckpolitics.com/t/2505/26.json
it will load posts 21 - 40. That cuts the number of requests you’d need to make way way down to get a full thread.

The server is usually nowhere near capacity, especially in the off hours. The only time we put a strain on it is during live events, and I think that’s more from lots of concurrent DB writes than requests.

The forum runs Redis, but only as a caching DB and I really have no idea how that part works. The actual data is stored in a Postgres DB, which is straight SQL.

idk what limitations there are to the plugin ggoreo was using, if any. I can run reasonable queries from the command line if you want to write them. However, the schema is scary big and complicated so it might be difficult figuring out exactly what to query for.

1 Like

After seeing zz’s post above, it sounds like it might be easier to get the data direct from Postgres into a local Postgres db somehow, if you have rights to connect to it, and then run queries against it if you have the schema (they can always be figured out).

I can probably chip in with some help.

OMG awesome. I don’t think that’s documented or maybe I just missed it. I knew there had to be some hacky way to do what I wanted to do, that’s perfect.

This API I give a 5/10. It seems kind of haphazardly designed by someone who never really designed something like this before. I’ve worked with several of these. I found a bug in one of the fields earlier - when I am getting all topics for the forum, the URL for the next page is broken and I had to modify it to get it to work.

1 Like

In my head right now…

3 Likes