Automated encoding repair in R for foreign characters -
i have .csv data frame df 100,000+ rows , 2 columns representing city , country names (scraped permission website), subset of data below:
df <- read.csv("country_dat.csv", header = true, sep = ",", stringsasfactors = false) df city country 1 huntsville, alabama 2 nyn_shamn sweden 3 j__li finland the file includes multiple encodings tried following fix encoding errors in post-processing:
guess_encoding(df$city[2]) encoding language confidence 1 iso-8859-1 en 0.30 2 iso-8859-2 hu 0.20 3 utf-8 0.15 4 utf-16be 0.10 5 utf-16le 0.10 6 shift_jis ja 0.10 7 gb18030 zh 0.10 8 euc-jp ja 0.10 9 euc-kr ko 0.10 10 big5 zh 0.10 repair_encoding(df$city[2]) best guess: iso-8859-1 (56% confident) [1] "nyn_shamn" which not working. possible automate repair encoding process without having scrape website again?
edit: desired output below:
city country city_fixed 1 huntsville, alabama huntsville, alabama 2 nyn_shamn sweden nynäshamn 3 j__li finland jääli
could provide desired output? "'s-hertogenbusch' valid dutch city name, not understand why "obviously not working".
another option might use en2cutf8() should harmonize input - however, i'm not sure if might lose information then.
Comments
Post a Comment