amazon web services - Redshift Query taking too much time -
in redshift, queries taking time execute. queries keep on running or aborted after time.
i have limited knowledge of redshift , getting difficult understand query plan optimise query.
sharing 1 of queries run, along query plan. query taking 20 seconds execute.
query
select date_trunc('day', ti) date, count(distinct deviceid) count live_events brandid = 3927 , ti >= '2017-08-02t00:00:00+00:00' , ti <= '2017-09-02t00:00:00+00:00' group 1
primary key
brandid
interleaved sort keys
have set following columns interleaved sort keys -
brandid, ti, event_name
query plan
you have 126 million rows in table. it's going take more second on single dc1.large node.
here's ways improve performance:
more nodes
spreading data across more nodes allows more parallelization. each node adds additional processing , storage. if data volume justifies 1 node, if want more performance, add more nodes.
sortkey
for right type of query, sortkey can best way improve query speed. sorting data on disk allows redshift skip over blocks knows not contain relevant data.
for example, query has where brandid = 3927
, having brandid
sortkey make extremely efficient because few disk blocks contain data 1 brand.
interleaved sorting best sorting method use because less efficient single or compound sort key , takes long time vacuum. if query have shown typical of type of queries running, use compound sort key of brandid, ti
or ti, brandid
. more efficient.
sortkeys typically date column, since found in clause , table automatically sorted if data appended in time order.
the interleaved sort causing redshift read many more disk blocks find data, thereby increasing query time.
distkey
the distkey should typically set field used in join statement on table. because data relating same distkey value stored on same slice. won't have such large impact on single node cluster, still worth getting right.
again, have shown 1 type of query, hard recommend distkey. based on query alone, recommend distkey even
slices participate in query. (it default distkey if no specific distkey selected.) alternatively, set distkey field not shown -- don't use brandid
distkey otherwise 1 slice participate in query shown.
vacuum
vacuum tables regularly data stored in sortkey order , deleted data removed storage.
experiment!
optimal settings depend upon data , queries typically run. perform tests compare sortkey , distkey values , choose settings perform best. then, test again in 3 months see if queries or data has changed enough make other settings more efficient.
Comments
Post a Comment