r - How to use fread() as readLines() without auto column detection? -


i have 5gb .dat file (> 10million lines). format of each line aaaa bb cccc0123 xxx kkkkkkkkkkkkkk or aaaaabbbcccc01234xxxkkkkkkkkkkkkkk example. because readlines has poor performance while reading big file, choose fread() read this, error occurred:

library("data.table") x <- fread("test.dat") error in fread("test.dat") :    expecting 5 cols, line 5 contains text after processing cols. due 1 or more fields having embedded sep=' ' and/or (unescaped) '\n' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases , lines may not have been read in expected. please read section on quotes in ?fread. in addition: warning message: in fread("test.dat") :   unable find 5 lines expected number of columns (+ middle) 

how use fread() readlines() without auto column detecting? or there other way solve problem?

here's trick. use sep value know not in file. doing forces fread() read whole line single column. can drop column atomic vector (shown [[1l]] below). here's example on csv use ? sep. way acts similar readlines(), lot faster.

f <- fread("batting.csv", sep= "?", header = false)[[1l]] head(f) # [1] "playerid,yearid,stint,teamid,lgid,g,ab,r,h,2b,3b,hr,rbi,sb,cs,bb,so,ibb,hbp,sh,sf,gidp" # [2] "abercda01,1871,1,tro,na,1,4,0,0,0,0,0,0,0,0,0,0,,,,,"        # [3] "addybo01,1871,1,rc1,na,25,118,30,32,6,0,0,13,8,1,4,0,,,,,"   # [4] "allisar01,1871,1,cl1,na,29,137,28,40,4,5,0,19,3,1,2,5,,,,,"  # [5] "allisdo01,1871,1,ws3,na,27,133,28,44,10,2,2,27,1,1,0,2,,,,," # [6] "ansonca01,1871,1,rc1,na,25,120,29,39,11,3,0,16,6,2,2,1,,,,," 

other uncommon characters can try in sep \ ^ @ # = , others. can see produce same output readlines(). it's matter of finding sep value not present in file.

head(readlines("batting.csv")) # [1] "playerid,yearid,stint,teamid,lgid,g,ab,r,h,2b,3b,hr,rbi,sb,cs,bb,so,ibb,hbp,sh,sf,gidp" # [2] "abercda01,1871,1,tro,na,1,4,0,0,0,0,0,0,0,0,0,0,,,,,"                                   # [3] "addybo01,1871,1,rc1,na,25,118,30,32,6,0,0,13,8,1,4,0,,,,,"                              # [4] "allisar01,1871,1,cl1,na,29,137,28,40,4,5,0,19,3,1,2,5,,,,,"                             # [5] "allisdo01,1871,1,ws3,na,27,133,28,44,10,2,2,27,1,1,0,2,,,,,"                            # [6] "ansonca01,1871,1,rc1,na,25,120,29,39,11,3,0,16,6,2,2,1,,,,,"  

note: @cath has mentioned in comments, use line break character \n sep value.


Comments

Popular posts from this blog

resizing Telegram inline keyboard -

command line - How can a Python program background itself? -

php - "cURL error 28: Resolving timed out" on Wordpress on Azure App Service on Linux -