funny enough on the same calls sometimes it’ll be able to figure it out in some cases but not others. I figured there was something I was missing.
I’m having an issue joining two large tables in Bigquery (Google Cloud) and am throwing this out there in the hope that someone might happen to have an idea.
I want to use the following join condition:
table1.attribute = substring(table2.attribute, 4, length(table1.attribute))
I have to specify the length in the substring because table2.attribute has a load of garbage at the end of indeterminate length.
With the above join condition the query does run but it never completes (I’ve let it run for half an hour before killing it), and the problem seems to be that I’m referring to table1.attribute on both sides of the join.
The only way I can get it to work is to use hardcoded values instead of length(table1.attribute) and union multiple queries together with the known lengths, but I want to avoid this as I can’t guarantee that in the future there won’t be other lengths that I haven’t covered.
I’ve also tried using:
SUBSTRING(table2.attribute, 4) LIKE CONCAT(table1.attribute,’%’)
And:
REGEXP_CONTAINS(table2.attribute, table1.attribute) = true
But am hitting the same issue where the query just keeps running.
I’ve also tried joining to a separate subquery/temp table that contains all of the distinct lengths of table1.attribute in the current data , but this takes over 15 mins to complete, whereas with the hardcoding method it takes a little over one minute, so not viable.
None of my technical colleagues have been able to come up with an alternative, and it’s driving me mad!
Can you truncate the attributes to the length you want and read them into a temp table(s), then do the join from that?
I don’t understand what the big gotcha is with the api call. You are just making api calls to the Star Wars API until next_page is null.
These companies might as well just have you pair with them on the task for an hour or two and it should be obvious how quick you are on your feet.
No because I don’t know how much of table2.attribute I need to retain. It usually has some crap on the end, and the length of that is unknown. Somewhere in the middle of that string is the value I want to join to.
E.g
Table 1 L56374848
Table 2 ABCL56374848282635
Table 1 L174888
Table 2 ABCL174888648362262578
Is it a number to letter transition that you could try to chop by?
Assuming you can’t get false positives, and you could knock out 90% of table 1 with a temp table for table 2 based on the number/letter split, then maybe the last 10% of table 1 against the original table 2 wouldn’t take forever.
How many different possible lengths are there in table 1? It might still be quicker to make 4 temp tables and do the join on those. Start with the longest length temp table, and remove table 1 matches after each join.
No it’s mostly numbers. Sorry I should have been clearer, I only used letters just to make it more visible. I’ve edited the post to use numbers instead.
We can assume no false positives.
There are 5 distinct lengths for table 1, but no guarantee that additional lengths won’t appear in the future, which would mean a code change every time this happens using my hardcoding and union method (even though it runs well).
Where does the data in table 2 come from? I assume you can’t check it against table 1 when it’s entered and make a new column for the table 1 match? You could do the same for table 1 if it’s entered piecemeal and it’s your data.
Also maybe your hardcoded method could be softcoded based on a dynamic list of all the possible lengths in table 1.
Sorry not following what you mean here when you say “check it against table 1 when it’s entered”.
Tried this before, also as a temp table and ran into the same performance issues.
Well if this was say sales data and you controlled the code where it’s entered into the table, you could do stuff there. So one at a time you’re either adding a column to table 1 or table 2, based on a match with the other, when the data is entered. But if it’s tables you’re getting whole from somewhere else you don’t control, then you can’t do much.
What about running a job on table 1 every night where you get a dynamic list of the lengths?
The moderators of this message board, otatop, L.Washington, WichitaDM, Yuv, JonnyA, RiskyFlush, and SvenO, are cowards who let abusers dox and harrass other long time posters.
Ah I see, no i don’t have any control over that.
Re the dynamic lists, I ran the query just using a hatdcoded list and it took 15 mins to run, so I assume a dynamic list wouldn’t improve things?
No it would just mean you don’t have to change your code every time a new length appears. Also if a length disappears you wouldn’t waste time on it.
How long are both of these tables?
Agree, but the 15 min vs 1 min difference is going to fly with the users.
Table 1 - 3 million records
Table 2 - 400 million records
Right - but I’m saying maybe you could run the job to get the lengths every night or something, so it’s ready whenever you want to run the main query.
Is this VIN numbers? I’ve had to work with those before. They’re a lot of fun. Especially all the zeros entered as O, l vs 1, etc.
But haven’t I effectively tested this already by creating a hardcoded list and joining to that, which took 15 mins? Sorry I’m not the most technical so I might be missing something obvious.
Nah they’re bond trade identifiers.
Ok which query was it that only takes a minute?
FWIW a complex join of 3 million into 400 million isn’t the kind of thing you expect to happen super quick. It’s more like the kind of job you put on a schedule and make the results available once a day or something.
The query where I union 5 queries together, where each join has the length hardcoded.
table1.attribute = substring(table2.attribute, 4, 8)
Union
table1.attribute = substring(table2.attribute, 4, 9)
Union
table1.attribute = substring(table2.attribute, 4, 10)
Union
table1.attribute = substring(table2.attribute, 4, 13)
Union
table1.attribute = substring(table2.attribute, 4, 16)
Right. So can’t you dynamically generate that query before running it? If the number of lengths has to be absolutely up to the second, then maybe not. But if you can run the length job on a schedule, then use that to dynamically generate the query then run the query - you should get the same results, right?
I’m not sure i have the permissions to dynamically create the query, but that certainly seems like it would be the option to explore.
Thanks so much for your time and patience!