Having a table "Transaction" that has:
- id (id auto increment)
- title (text)
- description (text)
- vendor (text)
It is asked to have a list of 100 most used words in any of these and their permutations (combinations of 2 words - having their reverse permutation ignored [e.g. permutations of A and B would be AA, AB, BB, BA and we want to exclude cases where A=B and A>B]). For example, if a transaction would have:
- title = PayPal payment
- description =
- vendor = Sony
We would expect to have a distinct list of words [PayPal, payment, Sony]. Please note that in some cases the word might have punctuation and we have to remove those.
So the expected result would be: [Paypal, payment, Sony, Payment PayPal, Paypal Sony, Payment Sony]
I made a SQL query for Postgres to do this and the performance was terrible:
WITH
oneWord as (SELECT t.id, a.word, t.gross_amount
FROM (SELECT * FROM transaction t) t,
unnest(string_to_array(regexp_replace(regexp_replace(
concat(t.vendor, ' ',
t.title, ' ',
t.description),
'[\s+]', ' ', 'g'), '[[:punct:]]', '', 'g'), ' ',
'')) as a(word)
WHERE a.word NOT IN (SELECT word FROM wordcloudexclusion)
),
oneWordDistinct as (SELECT id, word, gross_amount FROM oneWord),
twoWord as (SELECT a.id,CONCAT(a.word, ' ', b.word) as word, a.gross_amount
from oneWord a, oneWord b
where a.id = b.id and a < b),
allWord as (SELECT oneWordDistinct.id as id, oneWordDistinct.word as word, oneWordDistinct.gross_amount as gross_amount
from oneWordDistinct
union all
SELECT twoWord.id as id, twoWord.word as word, twoWord.gross_amount as gross_amount
from twoWord)
SELECT a.word, count(a.id) FROM allWord a GROUP BY a.word ORDER BY 2 DESC LIMIT 100;
And doing the same in python as follows:
text_stats = {}
transactions = (SELECT id, title, description, vendor, gross_amount FROM transactions)
for [id, title, description, vendor, amount] in list(transactions):
text = " ".join(filter(None, [title, description, vendor]))
text_without_punctuation = re.sub(r"[.!?,]+", "", text)
text_without_tabs = re.sub(
r"[\n\t\r]+", " ", text_without_punctuation
).strip(" ")
words = list(set(filter(None, text_without_tabs.split(" "))))
for a_word in words:
if a_word not in excluded_words:
if not text_stats.get(a_word):
text_stats[a_word] = {
"count": 1,
"amount": amount,
"word": a_word,
}
else:
text_stats[a_word]["count"] += 1
text_stats[a_word]["amount"] += amount
for b_word in words:
if b_word > a_word:
sentence = a_word + " " + b_word
if not text_stats.get(sentence):
text_stats[sentence] = {
"count": 1,
"amount": amount,
"word": sentence,
}
else:
text_stats[sentence]["count"] += 1
text_stats[sentence]["amount"] += amount
My question is: Is there a way to improve the performance of the SQL so that it isn't completely obliterated by python? Currently on a 20k record transaction table it takes python ~6-8 seconds and the SQL query 1 min and 10 seconds.
Here is the SQL explain analyse:
Limit (cost=260096.60..260096.85 rows=100 width=40) (actual time=63928.627..63928.639 rows=100 loops=1)
CTE oneword
-> Nested Loop (cost=16.76..2467.36 rows=44080 width=44) (actual time=1.875..126.778 rows=132851 loops=1)
-> Seq Scan on gc_api_transaction t (cost=0.00..907.80 rows=8816 width=110) (actual time=0.018..4.176 rows=8816 loops=1)
Filter: (company_id = 2)
Rows Removed by Filter: 5648
-> Function Scan on unnest a_2 (cost=16.76..16.89 rows=5 width=32) (actual time=0.010..0.013 rows=15 loops=8816)
Filter: (NOT (hashed SubPlan 1))
Rows Removed by Filter: 2
SubPlan 1
-> Seq Scan on gc_api_wordcloudexclusion (cost=0.00..15.40 rows=540 width=118) (actual time=1.498..1.500 rows=7 loops=1)
-> Sort (cost=257629.24..257629.74 rows=200 width=40) (actual time=63911.588..63911.594 rows=100 loops=1)
Sort Key: (count(oneword.id)) DESC
Sort Method: top-N heapsort Memory: 36kB
-> HashAggregate (cost=257619.60..257621.60 rows=200 width=40) (actual time=23000.982..63803.962 rows=1194618 loops=1)
Group Key: oneword.word
Batches: 85 Memory Usage: 4265kB Disk Usage: 113344kB
-> Append (cost=0.00..241207.14 rows=3282491 width=36) (actual time=1.879..5443.143 rows=2868282 loops=1)
-> CTE Scan on oneword (cost=0.00..881.60 rows=44080 width=36) (actual time=1.878..579.936 rows=132851 loops=1)
" -> Subquery Scan on ""*SELECT* 2"" (cost=13085.79..223913.09 rows=3238411 width=36) (actual time=2096.116..4698.727 rows=2735431 loops=1)"
-> Merge Join (cost=13085.79..191528.98 rows=3238411 width=44) (actual time=2096.114..4492.451 rows=2735431 loops=1)
Merge Cond: (a_1.id = b.id)
Join Filter: (a_1.* < b.*)
Rows Removed by Join Filter: 2879000
-> Sort (cost=6542.90..6653.10 rows=44080 width=96) (actual time=1088.083..1202.200 rows=132851 loops=1)
Sort Key: a_1.id
Sort Method: external merge Disk: 8512kB
-> CTE Scan on oneword a_1 (cost=0.00..881.60 rows=44080 width=96) (actual time=3.904..101.754 rows=132851 loops=1)
-> Materialize (cost=6542.90..6763.30 rows=44080 width=96) (actual time=1007.989..1348.317 rows=5614422 loops=1)
-> Sort (cost=6542.90..6653.10 rows=44080 width=96) (actual time=1007.984..1116.011 rows=132851 loops=1)
Sort Key: b.id
Sort Method: external merge Disk: 8712kB
-> CTE Scan on oneword b (cost=0.00..881.60 rows=44080 width=96) (actual time=0.014..20.998 rows=132851 loops=1)
Planning Time: 0.537 ms
JIT:
Functions: 49
" Options: Inlining false, Optimization false, Expressions true, Deforming true"
" Timing: Generation 6.119 ms, Inlining 0.000 ms, Optimization 2.416 ms, Emission 17.764 ms, Total 26.299 ms"
Execution Time: 63945.718 ms
Postgresql verison: PostgreSQL 14.5 (Debian 14.5-1.pgdg110+1) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
from Postgres SQL vs Python - GROUP BY Performance
No comments:
Post a Comment