r - How to use fread() as readLines() without auto column detection? -
i have 5gb .dat file (> 10million lines). format of each line aaaa bb cccc0123 xxx kkkkkkkkkkkkkk
or aaaaabbbcccc01234xxxkkkkkkkkkkkkkk
example. because readlines
has poor performance while reading big file, choose fread()
read this, error occurred:
library("data.table") x <- fread("test.dat") error in fread("test.dat") : expecting 5 cols, line 5 contains text after processing cols. due 1 or more fields having embedded sep=' ' and/or (unescaped) '\n' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases , lines may not have been read in expected. please read section on quotes in ?fread. in addition: warning message: in fread("test.dat") : unable find 5 lines expected number of columns (+ middle)
how use fread()
readlines()
without auto column detecting? or there other way solve problem?
here's trick. use sep
value know not in file. doing forces fread()
read whole line single column. can drop column atomic vector (shown [[1l]]
below). here's example on csv use ?
sep
. way acts similar readlines()
, lot faster.
f <- fread("batting.csv", sep= "?", header = false)[[1l]] head(f) # [1] "playerid,yearid,stint,teamid,lgid,g,ab,r,h,2b,3b,hr,rbi,sb,cs,bb,so,ibb,hbp,sh,sf,gidp" # [2] "abercda01,1871,1,tro,na,1,4,0,0,0,0,0,0,0,0,0,0,,,,," # [3] "addybo01,1871,1,rc1,na,25,118,30,32,6,0,0,13,8,1,4,0,,,,," # [4] "allisar01,1871,1,cl1,na,29,137,28,40,4,5,0,19,3,1,2,5,,,,," # [5] "allisdo01,1871,1,ws3,na,27,133,28,44,10,2,2,27,1,1,0,2,,,,," # [6] "ansonca01,1871,1,rc1,na,25,120,29,39,11,3,0,16,6,2,2,1,,,,,"
other uncommon characters can try in sep
\ ^ @ # =
, others. can see produce same output readlines()
. it's matter of finding sep
value not present in file.
head(readlines("batting.csv")) # [1] "playerid,yearid,stint,teamid,lgid,g,ab,r,h,2b,3b,hr,rbi,sb,cs,bb,so,ibb,hbp,sh,sf,gidp" # [2] "abercda01,1871,1,tro,na,1,4,0,0,0,0,0,0,0,0,0,0,,,,," # [3] "addybo01,1871,1,rc1,na,25,118,30,32,6,0,0,13,8,1,4,0,,,,," # [4] "allisar01,1871,1,cl1,na,29,137,28,40,4,5,0,19,3,1,2,5,,,,," # [5] "allisdo01,1871,1,ws3,na,27,133,28,44,10,2,2,27,1,1,0,2,,,,," # [6] "ansonca01,1871,1,rc1,na,25,120,29,39,11,3,0,16,6,2,2,1,,,,,"
note: @cath has mentioned in comments, use line break character \n
sep
value.
Comments
Post a Comment