Programming

Thanks, i am interested

1 Like

Msging you the first weeks lesson in case you don’t want to wait for it to show up over there. Let me know if there’s any networking stuff I can help with. I’m ~intermediate on this stuff but more focused on security and not a ridiculous amount of firewall experience or anything.

Thanks. Yea i’m much stronger in python than i am in networking but networking is a huge part of my job. Was hoping the course would focus more on that aspect but it won’t hurt to do a little brush up because i’ve been programming in nothing but bash for 3 months now.

1 Like

I had a chance to use a product called netbrain at my last digs and my networking knowledge wasn’t really strong enough to use it. The product was a real pain in the ass to configure and so we didn’t have much setup in there, it was basically a $15k piece of network diagramming software. I’ve learned the most basics of python but used visual basic many suns ago and have academic exposure to C++, java, html etc. Honestly network administration sucks and 65% of the job is picking up the phone listing to someone pissed that they don’t get email on their phone, so wouldn’t hate transitioning to something more devops/strict networking soon.

I’m on like lesson 1.9 and will hang it up for the night soon probably.

So I’m trying to regularly scrape some JSON data from nba.com, and I got it working and deployed it to our Google cloud server and after a little while it came to my attention that it wasn’t working. Specifically, the NBA server is simply ignoring the requests. This also happens if you don’t have the correct headers in the request (in particular the Referer header has to be in line with what it expects), I know that because I ran into it while developing. I assumed it was an IP ban, so I installed VPN software on the cloud capable of cycling through IPs, but this didn’t do anything. So then I assumed it must be an environment difference (as I’m using Windows and the cloud runs on Ubuntu) but I made test requests to a web page that reports your headers back to you and ran the code on my local machine and on the cloud. The headers are identical.

So I’m stumped now. Anyone have any idea what could be causing the code to work locally and fail on the server? It’s not cookies because my local code also just requests JSON files without any preamble and it works.

By “ignoring” - do you get back an HTTP status code, or do the requests just time out?

How often are you hitting nba.com?

Is there anything uniquely identifiable in your headers?

The requests time out. I’m not using an antihammer so it just makes one request at a time as frequently as it processes them. I did that locally for tens of thousands of requests and they didn’t seem to care and now I’m just making a few requests a day, so the IP ban theory never really held water, I just didn’t know what else it could be. BTW I’m making the same requests of wnba.com (which has the same stats backend running) and it’s the same result, they work locally, fail on the cloud.

There’s nothing identifiable, here are all the headers I’m sending, also a Referer header which is different depending on exactly what is to be requested.

Accept: application/json, text/plain, * / *
Accept-Encoding: gzip,deflate
Connection: keep-alive
Host: stats.nba.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36
X-Nba-Stats-Origin: stats
X-Nba-Stats-Token: true

I have no clue what those X-Nba headers are but they’re required and it doesn’t seem to require any sort of actual token. It’s just happy if I say yes, there totally is one. A bit like Building The Wall.

They could be banning a whole block of IPs for some reason. But when we do that with AWS the request gets an instant 503 (or something), not a time out.

Makes you wonder if it’s not google cloud that’s killing the request because it detects something it doesn’t like. You might try heroku, digital ocean or azure just to see if it behaves differently.

Referer is supposed to be the previous page that the new request originated from. google cloud might not like that being gamed. It could be stripping off the header. What happens when you send from local w/o a referer header?

There are a lot of weird hops with stuff like cloudflare and akamai that might be doing something weird.

The VPN I was using cycles different IPs in different countries, none of them worked. I sent a gamed referer to the test site that reports back headers and google cloud passed it along just fine. The NBA server just ignoring the request is what happened in development when my headers were incorrect (before I had added the referer, for example) so I’m hesitant to chalk up exactly the same thing happening now to anything other than it not liking my requests for some reason.

It’s possible google cloud is doing something weird to the headers, but only for nba.com (and maybe other mega sites or other sites they have some kind of connection with).

Definitely curious about other cloud providers.

Are you 100% sure that your code that pings your test server is exactly the same as the code that hits nba.com except for the URL?

I think my next step will be to try to browse the site using some sort of browser on the cloud server (it’s Compute Engine so I can install whatever I want) and see what the request looks like when made in normal fashion as part of loading a page.

That’s a good idea.

Can you provide a curl statement of what exactly you’re doing? I’d like to try on my own to duplicate this. This seems right up my alley! Rare for me in this thread.

Yikes, is it true they don’t have documentation? I can’t find any.

You mean the NBA stats thing? It’s not designed as an API.

I don’t speak Linux really but I’ll figure out the curl syntax as it’s probably a good idea to duplicate the call using something other than Java as it’s possibly some oddity in Java’s HTTP implementation on Linux.

curl is super straightforward - you can add headers

The other thing is maybe this is some SSL thing and whatever google client is making the call has some issue with nba.com.

Here is an example curl command:

curl --compressed -H "Accept: application/json, text/plain, */*" -H "Accept-Encoding: gzip,deflate" -H "Connection: keep-alive" -H "X-Nba-Stats-Origin: stats" -H "X-Nba-Stats-Token: true" -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36" -e "https://stats.nba.com/teams/traditional/?sort=W_PCT&dir=-1&SeasonType=Playoffs&Season=2019-20&DateFrom=09%2F11%2F2020&DateTo=09%2F11%2F2020" "https://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=09%2F11%2F2020&DateTo=09%2F11%2F2020&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2019-20&SeasonSegment=&SeasonType=Playoffs&ShotClockRange=&StarterBench=&TeamID=0&TwoWay=0&VsConference=&VsDivision="

I installed curl for Windows and again this command succeeds locally, fails on the cloud. I also tried my Amazon cloud server and it fails there too.

I’d try the same requests, but instead of hitting stats.nba.com, change it to a URL where you can inspect the request that arrives (e.g. go to webhook.site and use the URL it gives you).

Then you can see if the requests are getting through at all, or if the headers are getting mangled somehow.

Edit: I see a few posts up that you might have already done basically that. My best guess is stats.nba.com is blocking cloud provider IP ranges.

1 Like

Looks likely.

More info

2 Likes

Yeah that’ll do it. They might also know and block the VPN ChrisV tried.

Also be careful running locally:

I even managed to get my local IP addressed blocked due to intermittent stat-retrieval testing which met their threshold to get the IP banned (so much for fostering app development).

So you can proxy it through a local machine. But be prepared to change IPs when it gets banned.

Ah. Thanks @emmpee. Guess I should have googled more. Didn’t seem like it could be an IP issue after the cycling VPN was also failing. Perhaps my VPN provider uses cloud machines for their VPN servers.

Weird that they didn’t block my IP when I was spamming them with historical requests but who knows, maybe they eventually did. I don’t have a static IP at home so I might have gotten one of my ISPs IPs blocked and not even realised.