R: rolling self-joins on subsets in data.table -
here mwe of problem.
data:
library(data.table) #dates in %y-%m-%d df <- data.table(date=as.date(c("2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02")), dtm=c(18l, 18l, 18l, 18l, 18l, 18l,46l,46l,74l, 74l,74l,74l,165l, 165l,165l,165l), cval=c(1275l, 1300l, 1300l, 1320l, 1325l, 1325l, 1300l, 1300l, 1300l, 1300l, 1325l, 1325l, 1300l, 1300l, 1325l, 1325l), price_in=c(24.125, 24.625, 35.750, 16.250, 14.500, 50.250, 43.625, 49.125, 58.250, 58.250, 45.375, 70.125, 90.750, 74.750, 77.875, 85.500), price_out=c(26.125, 26.625, 36.625, 17.500, 15.500, 52.250, 45.625, 51.125, 60.000, 60.250, 47.375, 72.125, 92.750, 76.750, 79.875, 87.500), type=c("p", "c", "p", "c", "c", "p", "c", "p", "c", "p", "c", "p", "c", "p", "c", "p")) df date dtm cval price_in price_out type 1: 2001-01-02 18 1275 24.125 26.125 p 2: 2001-01-02 18 1300 24.625 26.625 c 3: 2001-01-02 18 1300 35.750 36.625 p 4: 2001-01-02 18 1320 16.250 17.500 c 5: 2001-01-02 18 1325 14.500 15.500 c 6: 2001-01-02 18 1325 50.250 52.250 p 7: 2001-01-02 46 1300 43.625 45.625 c 8: 2001-01-02 46 1300 49.125 51.125 p 9: 2001-01-02 74 1300 58.250 60.000 c 10: 2001-01-02 74 1300 58.250 60.250 p 11: 2001-01-02 74 1325 45.375 47.375 c 12: 2001-01-02 74 1325 70.125 72.125 p 13: 2001-01-02 165 1300 90.750 92.750 c 14: 2001-01-02 165 1300 74.750 76.750 p 15: 2001-01-02 165 1325 77.875 79.875 c 16: 2001-01-02 165 1325 85.500 87.500 p
what want do:
- for each date, want obtain items, divided in type
p
's ,c
's have samedtm
largercval
. second item in example data set, be:
date dtm cval price_in price_out type 2001-01-02 18 1300 24.625 26.625 c #the item 2001-01-02 18 1320 16.250 17.500 c #same dtm, higher cval 2001-01-02 18 1325 14.500 15.500 c #same dtm, higher cval
- now, let
cval1
cval
of current item, i.e. herecval1 = 1300
,cval2
largercval
's of items in subset, i.e. herecval2 = c(1320l, 1325l)
. then, want apply custom exclusion function, example let'sprice_in[cval %in% cval2]-price_out[cval==cval1]-0.5*(cval1-cval2) < 0
- i want exclude item pairs exclusion function returned
true
.
similarly (same procedure, different exclusion criteria) applies p
items.
expected output: original data.table, df
, minus rows excluded in procedure described above. example, using example function above evaluating items 2 , 4 returns true: 16.25-26.625-0.5*(1300-1320) = -0.375 < 0
. thus, expected output df
without rows 2 , 4 (note pair 2 , 5 not return true: 14.5-26.625-0.5*(1300-1325) = 0.375 >= 0
, hence 5 not excluded):
date dtm cval price_in price_out type 1: 2001-01-02 18 1275 24.125 26.125 p 3: 2001-01-02 18 1300 35.750 36.625 p 5: 2001-01-02 18 1325 14.500 15.500 c 6: 2001-01-02 18 1325 50.250 52.250 p 7: 2001-01-02 46 1300 43.625 45.625 c 8: 2001-01-02 46 1300 49.125 51.125 p ... ... ... ...
and on. obviously, in case of items 7 , 8, if there no other item same characteristics (same date, dtm , type), cannot excluded.
what have tried far:
- i have created id each item, i.e.
df[,id:=seq_along(date)]
, iterated viafor loop
through dates , used vectors check custom functions. if result vector containedtrue
's, removed corresponding indices data.table. clearly, approach works runs forever given size of data. - i experimenting data.table methods, because of extreme speed advantages. if understand correctly, want many rolling self-joins each
date/dtm
subset, along lines"df[df,roll=inf,by=.(date,dtm)]"
(since total, rolling self-join, believe, not applicable in case). don't quite work.
question: there way of implementing exclusion procedure via data.table methods? possibly (but not necessarily) via multiple rolling self-joins?
any highly appreciated!
Comments
Post a Comment