r - How to use fread() as readLines() without auto column detection? -

June 15, 2013

i have 5gb .dat file (> 10million lines). format of each line aaaa bb cccc0123 xxx kkkkkkkkkkkkkk or aaaaabbbcccc01234xxxkkkkkkkkkkkkkk example. because readlines has poor performance while reading big file, choose fread() read this, error occurred:

library("data.table") x <- fread("test.dat") error in fread("test.dat") :    expecting 5 cols, line 5 contains text after processing cols. due 1 or more fields having embedded sep=' ' and/or (unescaped) '\n' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases , lines may not have been read in expected. please read section on quotes in ?fread. in addition: warning message: in fread("test.dat") :   unable find 5 lines expected number of columns (+ middle)

how use fread() readlines() without auto column detecting? or there other way solve problem?

here's trick. use sep value know not in file. doing forces fread() read whole line single column. can drop column atomic vector (shown [[1l]] below). here's example on csv use ? sep. way acts similar readlines(), lot faster.

f <- fread("batting.csv", sep= "?", header = false)[[1l]] head(f) # [1] "playerid,yearid,stint,teamid,lgid,g,ab,r,h,2b,3b,hr,rbi,sb,cs,bb,so,ibb,hbp,sh,sf,gidp" # [2] "abercda01,1871,1,tro,na,1,4,0,0,0,0,0,0,0,0,0,0,,,,,"        # [3] "addybo01,1871,1,rc1,na,25,118,30,32,6,0,0,13,8,1,4,0,,,,,"   # [4] "allisar01,1871,1,cl1,na,29,137,28,40,4,5,0,19,3,1,2,5,,,,,"  # [5] "allisdo01,1871,1,ws3,na,27,133,28,44,10,2,2,27,1,1,0,2,,,,," # [6] "ansonca01,1871,1,rc1,na,25,120,29,39,11,3,0,16,6,2,2,1,,,,,"

other uncommon characters can try in sep \ ^ @ # = , others. can see produce same output readlines(). it's matter of finding sep value not present in file.

head(readlines("batting.csv")) # [1] "playerid,yearid,stint,teamid,lgid,g,ab,r,h,2b,3b,hr,rbi,sb,cs,bb,so,ibb,hbp,sh,sf,gidp" # [2] "abercda01,1871,1,tro,na,1,4,0,0,0,0,0,0,0,0,0,0,,,,,"                                   # [3] "addybo01,1871,1,rc1,na,25,118,30,32,6,0,0,13,8,1,4,0,,,,,"                              # [4] "allisar01,1871,1,cl1,na,29,137,28,40,4,5,0,19,3,1,2,5,,,,,"                             # [5] "allisdo01,1871,1,ws3,na,27,133,28,44,10,2,2,27,1,1,0,2,,,,,"                            # [6] "ansonca01,1871,1,rc1,na,25,120,29,39,11,3,0,16,6,2,2,1,,,,,"

note: @cath has mentioned in comments, use line break character \n sep value.

Search This Blog

Enable

r - How to use fread() as readLines() without auto column detection? -

Comments

Post a Comment

Popular posts from this blog

resizing Telegram inline keyboard -

javascript - How to bind ViewModel Store to View? -

python - Alternative to referencing variable before assignment -