amazon web services - Redshift Query taking too much time -


in redshift, queries taking time execute. queries keep on running or aborted after time.

i have limited knowledge of redshift , getting difficult understand query plan optimise query.

sharing 1 of queries run, along query plan. query taking 20 seconds execute.

query

select     date_trunc('day',     ti) date,     count(distinct deviceid) count         live_events     brandid = 3927     , ti >= '2017-08-02t00:00:00+00:00'     , ti <= '2017-09-02t00:00:00+00:00' group     1   

primary key
brandid

interleaved sort keys
have set following columns interleaved sort keys -
brandid, ti, event_name

query plan

enter image description here

enter image description here

enter image description here

enter image description here

enter image description here

you have 126 million rows in table. it's going take more second on single dc1.large node.

here's ways improve performance:

more nodes

spreading data across more nodes allows more parallelization. each node adds additional processing , storage. if data volume justifies 1 node, if want more performance, add more nodes.

sortkey

for right type of query, sortkey can best way improve query speed. sorting data on disk allows redshift skip over blocks knows not contain relevant data.

for example, query has where brandid = 3927, having brandid sortkey make extremely efficient because few disk blocks contain data 1 brand.

interleaved sorting best sorting method use because less efficient single or compound sort key , takes long time vacuum. if query have shown typical of type of queries running, use compound sort key of brandid, ti or ti, brandid. more efficient.

sortkeys typically date column, since found in clause , table automatically sorted if data appended in time order.

the interleaved sort causing redshift read many more disk blocks find data, thereby increasing query time.

distkey

the distkey should typically set field used in join statement on table. because data relating same distkey value stored on same slice. won't have such large impact on single node cluster, still worth getting right.

again, have shown 1 type of query, hard recommend distkey. based on query alone, recommend distkey even slices participate in query. (it default distkey if no specific distkey selected.) alternatively, set distkey field not shown -- don't use brandid distkey otherwise 1 slice participate in query shown.

vacuum

vacuum tables regularly data stored in sortkey order , deleted data removed storage.

experiment!

optimal settings depend upon data , queries typically run. perform tests compare sortkey , distkey values , choose settings perform best. then, test again in 3 months see if queries or data has changed enough make other settings more efficient.


Comments

Popular posts from this blog

resizing Telegram inline keyboard -

command line - How can a Python program background itself? -

php - "cURL error 28: Resolving timed out" on Wordpress on Azure App Service on Linux -