r - Find overlapping dates for each ID and create a new row for the overlap -
i find overlapping dates each id , create new row overlapping dates , combine characters (char) lines. possible data have >2 overlaps , need >2 combinations of characters. eg. erm
data:
id date1 date2 char 15 2003-04-05 2003-05-06 e 15 2003-04-20 2003-06-20 r 16 2001-01-02 2002-03-04 m 17 2003-03-05 2007-02-22 17 2005-04-15 2014-05-19 c 17 2007-05-15 2008-02-05 17 2008-02-05 2012-02-14 m 17 2010-06-07 2011-02-14 v 17 2010-09-22 2014-05-19 p 17 2012-02-28 2013-03-04 r output like:
id date1 date2 char 15 2003-04-05 2003-04-20 e 15 2003-04-20 2003-05-06 er 15 2003-05-06 2003-06-20 r 16 2001-01-02 2002-03-04 m 17 2003-03-05 2005-04-15 17 2005-04-15 2007-02-22 ic 17 2005-04-15 2007-05-15 c 17 2007-05-15 2008-02-05 ci 17 2008-02-05 2012-02-14 cm 17 2010-06-07 2011-02-14 cv 17 2010-09-22 2014-05-19 cp 17 2012-02-28 2013-03-04 cr 17 2014-05-19 2014-05-19 p 17 2010-06-07 2012-02-14 mv 17 2010-09-22 2011-02-14 vp 17 2012-02-28 2013-03-04 rp what have tried: have tried using subtracting date 2 current row row below using:
df$diff <- c(na,df[2:nrow(tdf), "date1"] - df[1:(nrow(df)-1), "date2"]) then determine overlaps between rows:
df$overlap[which(df$diff<1)] <-1 df$overlap.up <- c(df$overlap[2:(nrow(df))], "na") df$overlap.final[which(df$overlap==1 | df$overlap.up==1)] <- 1 i selected had overlap.final==1 , put them dataframe , found overlaps each id.
however, have realized way simplistic , flawed, because selects overlaps occur sequentially (using difference in dates in first step). need take series of dates each id , loop through each combination determine if there overlap , then, if so, record start , end date , create new character “char” signalling combined during 2 dates. think need loop this.
i tried create loop find overlap intervals between date1 , date 2
df <- df[which(!duplicated(df$ id)),] (i in 1:nrow(df)) { tmp <- length(which(df $id[i] & (df$date1[i] >df$date1 & df$date1[i]< df$date2) | (df$date2[i] < df$date2& df$date2[i]> df$date1))) >0 df$int[i]<- tmp } however not work.
after identifying overlapping intervals, need create new rows each new start , end date , new character represents overlap.
another version of loop have tried identify overlaps:
for (i in 1:nrow(df)) { if (df$id[i]==ids$id){ tmp <- length(df, df$ id[i]==ids$ & (df$date1[i]> df$date1 & df$date1 [i]< df$date2 | df$date2[i] < df$date2 & df$date2[i]> df$date1)) >0 df$int[i]<- tmp } }
first, create data.table of possible intervals each id.
all possible intervals means take start , end dates of id , combine them in sorted vector tmp. unique values indicate possible intersections (or breaks) of given intervals of id on time axis. later joining, breaks re-arranged in 1 interval per row start , end column:
library(data.table) options(datatable.print.class = true) breaks <- dt[, { tmp <- unique(sort(c(date1, date2))) .(start = head(tmp, -1l), end = tail(tmp, -1l)) }, = id] breaks id start end <int> <idat> <idat> 1: 15 2003-04-05 2003-04-20 2: 15 2003-04-20 2003-05-06 3: 15 2003-05-06 2003-06-20 4: 16 2001-01-02 2002-03-04 5: 17 2003-03-05 2005-04-15 6: 17 2005-04-15 2007-02-22 7: 17 2007-02-22 2007-05-15 8: 17 2007-05-15 2008-02-05 9: 17 2008-02-05 2010-06-07 10: 17 2010-06-07 2010-09-22 11: 17 2010-09-22 2011-02-14 12: 17 2011-02-14 2012-02-14 13: 17 2012-02-14 2012-02-28 14: 17 2012-02-28 2013-03-04 15: 17 2013-03-04 2014-05-19
then, non-equi join performed whereby values aggregated simultaneously on join conditions:
dt[breaks, on = .(id, date1 <= start, date2 >= end), paste(char, collapse = ""), = .eachi, allow.cartesian = true] id date1 date2 v1 <int> <idat> <idat> <char> 1: 15 2003-04-05 2003-04-20 e 2: 15 2003-04-20 2003-05-06 er 3: 15 2003-05-06 2003-06-20 r 4: 16 2001-01-02 2002-03-04 m 5: 17 2003-03-05 2005-04-15 6: 17 2005-04-15 2007-02-22 ic 7: 17 2007-02-22 2007-05-15 c 8: 17 2007-05-15 2008-02-05 ci 9: 17 2008-02-05 2010-06-07 cm 10: 17 2010-06-07 2010-09-22 cmv 11: 17 2010-09-22 2011-02-14 cmvp 12: 17 2011-02-14 2012-02-14 cmp 13: 17 2012-02-14 2012-02-28 cp 14: 17 2012-02-28 2013-03-04 cpr 15: 17 2013-03-04 2014-05-19 cp
the result differs expected result posted op plotting data convinces above result shows possible overlaps:
library(ggplot2) ggplot(dt) + aes(y = char, yend = char, x = date1, xend = date2) + geom_segment() + facet_wrap("id", ncol = 1l) data
library(data.table) dt <- fread( "id date1 date2 char 15 2003-04-05 2003-05-06 e 15 2003-04-20 2003-06-20 r 16 2001-01-02 2002-03-04 m 17 2003-03-05 2007-02-22 17 2005-04-15 2014-05-19 c 17 2007-05-15 2008-02-05 17 2008-02-05 2012-02-14 m 17 2010-06-07 2011-02-14 v 17 2010-09-22 2014-05-19 p 17 2012-02-28 2013-03-04 r" ) cols <- c("date1", "date2") dt[, (cols) := lapply(.sd, as.idate), .sdcols = cols] 
Comments
Post a Comment