r - Find overlapping dates for each ID and create a new row for the overlap -


i find overlapping dates each id , create new row overlapping dates , combine characters (char) lines. possible data have >2 overlaps , need >2 combinations of characters. eg. erm

data:

id    date1         date2       char 15  2003-04-05  2003-05-06      e 15  2003-04-20  2003-06-20      r 16  2001-01-02  2002-03-04      m 17  2003-03-05  2007-02-22         17  2005-04-15  2014-05-19      c 17  2007-05-15  2008-02-05      17  2008-02-05  2012-02-14      m 17  2010-06-07  2011-02-14      v 17  2010-09-22  2014-05-19      p 17  2012-02-28  2013-03-04      r 

output like:

id  date1       date2           char 15  2003-04-05  2003-04-20      e 15  2003-04-20  2003-05-06      er 15  2003-05-06  2003-06-20      r 16  2001-01-02  2002-03-04      m 17  2003-03-05  2005-04-15      17  2005-04-15  2007-02-22      ic 17  2005-04-15  2007-05-15      c    17  2007-05-15  2008-02-05      ci 17  2008-02-05  2012-02-14      cm 17  2010-06-07  2011-02-14      cv 17  2010-09-22  2014-05-19      cp 17  2012-02-28  2013-03-04      cr 17  2014-05-19  2014-05-19      p 17  2010-06-07  2012-02-14      mv 17  2010-09-22  2011-02-14      vp 17  2012-02-28  2013-03-04      rp 

what have tried: have tried using subtracting date 2 current row row below using:

df$diff <- c(na,df[2:nrow(tdf), "date1"] - df[1:(nrow(df)-1), "date2"]) 

then determine overlaps between rows:

df$overlap[which(df$diff<1)] <-1 df$overlap.up <- c(df$overlap[2:(nrow(df))], "na") df$overlap.final[which(df$overlap==1 | df$overlap.up==1)] <- 1 

i selected had overlap.final==1 , put them dataframe , found overlaps each id.

however, have realized way simplistic , flawed, because selects overlaps occur sequentially (using difference in dates in first step). need take series of dates each id , loop through each combination determine if there overlap , then, if so, record start , end date , create new character “char” signalling combined during 2 dates. think need loop this.

i tried create loop find overlap intervals between date1 , date 2

df <- df[which(!duplicated(df$ id)),]  (i in 1:nrow(df)) {   tmp <- length(which(df $id[i] & (df$date1[i] >df$date1 & df$date1[i]< df$date2) | (df$date2[i] < df$date2&  df$date2[i]> df$date1))) >0   df$int[i]<- tmp  } 

however not work.

after identifying overlapping intervals, need create new rows each new start , end date , new character represents overlap.

another version of loop have tried identify overlaps:

for (i in 1:nrow(df)) {   if (df$id[i]==ids$id){   tmp <- length(df, df$ id[i]==ids$ & (df$date1[i]> df$date1 & df$date1 [i]< df$date2 | df$date2[i] < df$date2 &  df$date2[i]> df$date1)) >0   df$int[i]<- tmp   } } 

first, create data.table of possible intervals each id.

all possible intervals means take start , end dates of id , combine them in sorted vector tmp. unique values indicate possible intersections (or breaks) of given intervals of id on time axis. later joining, breaks re-arranged in 1 interval per row start , end column:

library(data.table) options(datatable.print.class = true) breaks <- dt[, {   tmp <- unique(sort(c(date1, date2)))   .(start = head(tmp, -1l), end = tail(tmp, -1l))   }, = id] breaks 
       id      start        end     <int>     <idat>     <idat>  1:    15 2003-04-05 2003-04-20  2:    15 2003-04-20 2003-05-06  3:    15 2003-05-06 2003-06-20  4:    16 2001-01-02 2002-03-04  5:    17 2003-03-05 2005-04-15  6:    17 2005-04-15 2007-02-22  7:    17 2007-02-22 2007-05-15  8:    17 2007-05-15 2008-02-05  9:    17 2008-02-05 2010-06-07 10:    17 2010-06-07 2010-09-22 11:    17 2010-09-22 2011-02-14 12:    17 2011-02-14 2012-02-14 13:    17 2012-02-14 2012-02-28 14:    17 2012-02-28 2013-03-04 15:    17 2013-03-04 2014-05-19 

then, non-equi join performed whereby values aggregated simultaneously on join conditions:

dt[breaks, on = .(id, date1 <= start, date2 >= end), paste(char, collapse = ""),      = .eachi, allow.cartesian = true] 
       id      date1      date2     v1     <int>     <idat>     <idat> <char>  1:    15 2003-04-05 2003-04-20      e  2:    15 2003-04-20 2003-05-06     er  3:    15 2003-05-06 2003-06-20      r  4:    16 2001-01-02 2002-03-04      m  5:    17 2003-03-05 2005-04-15       6:    17 2005-04-15 2007-02-22     ic  7:    17 2007-02-22 2007-05-15      c  8:    17 2007-05-15 2008-02-05     ci  9:    17 2008-02-05 2010-06-07     cm 10:    17 2010-06-07 2010-09-22    cmv 11:    17 2010-09-22 2011-02-14   cmvp 12:    17 2011-02-14 2012-02-14    cmp 13:    17 2012-02-14 2012-02-28     cp 14:    17 2012-02-28 2013-03-04    cpr 15:    17 2013-03-04 2014-05-19     cp 

the result differs expected result posted op plotting data convinces above result shows possible overlaps:

library(ggplot2) ggplot(dt) + aes(y = char, yend = char, x = date1, xend = date2) +    geom_segment() + facet_wrap("id", ncol = 1l)  

enter image description here

data

library(data.table) dt <- fread(   "id    date1         date2       char 15  2003-04-05  2003-05-06      e 15  2003-04-20  2003-06-20      r 16  2001-01-02  2002-03-04      m 17  2003-03-05  2007-02-22         17  2005-04-15  2014-05-19      c 17  2007-05-15  2008-02-05      17  2008-02-05  2012-02-14      m 17  2010-06-07  2011-02-14      v 17  2010-09-22  2014-05-19      p 17  2012-02-28  2013-03-04      r" ) cols <- c("date1", "date2") dt[, (cols) := lapply(.sd, as.idate), .sdcols = cols] 

Comments