Amazon Redshift Time Series -
for simplicities sake lets have array following arr[time]=[price,asset] (prices may repeat same asset 12 times per second).
i able view data @ different scales e.g. 5 years 1 sample per 1 hour, 2.5 years 1 sample per 0.5 hours being able access specific ticks in between points.
would more economical/convenient use multiple tables in dynamodb or single database in redshift? schema/tables use redshift?
edit: have entities looking {'time':1505128343,'price':3.141708,'asset':1} when searching between time periods where time>=1504118343 , time<=1505128343 don't want output every single price large ranges (so client doesn't crash) instead output every nth price.
in implementation have multiple tables different scales once range big enough fallback, if when have series of tables prices, prices30s, prices1h, prices1day etc prices30s table have series sampled once per 30 seconds.
if want access strictly every nth record can use window function provides row number according sorting condition can quite bad in terms of performance. work first price every 60 sec interval:
with sorted_entities ( select time ,price ,asset ,row_number() on (partition asset,time/60 order time) entities_table ) select time, price, asset sorted_entities row_number=1 that query break table rows small groups using asset id , integer division of unix timestamp column , give every row 1,2,3 etc. based on precise time, can further select rows number 1. if table's sort key asset,time work faster because rows same asset closer each other on disk.
if accuracy of sampling not priority can easy this:
select * entities_table random()<0.0006944 where 0.0006944 1/24/60 (one day considered 100% probability, , dividing 24 hours , 60 minutes probability of 1 minute). output 1/1440th of rows randomly , taking number of rows, you'll semi-evenly distributed set of values on time.
another way calculate average advised above.
Comments
Post a Comment