Automated encoding repair in R for foreign characters -


i have .csv data frame df 100,000+ rows , 2 columns representing city , country names (scraped permission website), subset of data below:

df <- read.csv("country_dat.csv", header = true, sep = ",", stringsasfactors = false)  df   city                     country 1 huntsville, alabama      2 nyn_shamn                sweden 3 j__li                    finland 

the file includes multiple encodings tried following fix encoding errors in post-processing:

guess_encoding(df$city[2])       encoding language confidence 1  iso-8859-1       en       0.30 2  iso-8859-2       hu       0.20 3       utf-8                0.15 4    utf-16be                0.10 5    utf-16le                0.10 6   shift_jis       ja       0.10 7     gb18030       zh       0.10 8      euc-jp       ja       0.10 9      euc-kr       ko       0.10 10       big5       zh       0.10  repair_encoding(df$city[2])  best guess: iso-8859-1 (56% confident) [1] "nyn_shamn" 

which not working. possible automate repair encoding process without having scrape website again?

edit: desired output below:

  city                     country           city_fixed 1 huntsville, alabama                     huntsville, alabama 2 nyn_shamn                sweden            nynäshamn 3 j__li                    finland           jääli 

could provide desired output? "'s-hertogenbusch' valid dutch city name, not understand why "obviously not working".

another option might use en2cutf8() should harmonize input - however, i'm not sure if might lose information then.


Comments

Popular posts from this blog

Sort a complex associative array in PHP -

vb.net - How to ignore if a cell is empty nothing -

recursion - Can every recursive algorithm be improved with dynamic programming? -