text processing - R Cleaning and reordering names/serial numbers in data frame -
let's have data frame follows in r:
data <- data.frame("serialnum" = character(), "year" = integer(), "name" = character(), stringsasfactors = f) data[1,] <- c("983\n837\n424\n ", 2015, "michael\nlewis\npaul\n ") data[2,] <- c("123\n456\n789\n136", 2014, "elaine\njerry\ngeorge\nkramer") data[3,] <- c("987\n654\n321\n975\n ", 2010, "john\npaul\ngeorge\nringo\nna") data[4,] <- c("424\n983\n837", 2015, "paul\nmichael\nlewis") data[5,] <- c("456\n789\n123\n136", 2014, "jerry\ngeorge\nelaine\nkramer")
what want following:
- split each string of names , each string of serial numbers own vectors (or list of string vectors).
- eliminate character
"na"
in either set of vectors or blank spaces denoted"...\n "
. - reorder each list of names alphabetically , reorder corresponding serial numbers according same permutation.
- concatenate each vector in same fashion (i
paste(., collapse = "\n")
).
my issue how without using loop. object-oriented way this? first attempt in direction made list command list <- strsplit(data$name, split = "\n")
, here need loop in order find permutations of names, seems process won't scale according actual data. additionally, once make list list
i'm not sure how go removing na
symbols or blank spaces. appreciated!
using lapply
take each row of data frame , turn new data frame 1 name per row. creates list of 5 data frames, 1 each row of original data frame.
seinfeld = lapply(1:nrow(data), function(i) { # turn strings data frame 1 name per row dat = data.frame(serialnum=unlist(strsplit(data[i,"serialnum"], split="\n")), year=data[i,"year"], name=unlist(strsplit(data[i,"name"], split="\n"))) # rid of empty strings , na values dat = dat[!(dat$name %in% c(""," ","na")), ] # order alphabetically dat = dat[order(dat$name), ] })
update: based on comment, let me know if result you're trying achieve:
seinfeld = lapply(1:nrow(data), function(i) { # turn strings data frame 1 name per row dat = data.frame(serialnum=unlist(strsplit(data[i,"serialnum"], split="\n")), name=unlist(strsplit(data[i,"name"], split="\n"))) # rid of empty strings , na values dat = dat[!(dat$name %in% c(""," ","na")), ] # order alphabetically dat = dat[order(dat$name), ] # collapse single row new sort order dat = data.frame(serialnum=paste(dat[, "serialnum"], collapse="\n"), year=data[i, "year"], name=paste(dat[, "name"], collapse="\n")) }) do.call(rbind, seinfeld) serialnum year name 1 837\n983\n424 2015 lewis\nmichael\npaul 2 123\n789\n456\n136 2014 elaine\ngeorge\njerry\nkramer 3 321\n987\n654\n975 2010 george\njohn\npaul\nringo 4 837\n983\n424 2015 lewis\nmichael\npaul 5 123\n789\n456\n136 2014 elaine\ngeorge\njerry\nkramer
Comments
Post a Comment