regex - Hadoop Hive SerDe Row Format for String Quoted Space delimited file -


i'm trying create hive table log file following format.

log file:

#software: 1 #version: 1 #start-date: xx #date: xx #fields: date time time-taken c-ip cs-username cs-auth-group x-exception-id sc-filter-result cs-categories cs(referer) sc-status s-action cs-method rs(content-type) cs-uri-scheme cs-host cs-uri-port cs-uri-path cs-uri-query cs-uri-extension cs(user-agent) s-ip sc-bytes cs-bytes x-virus-id x-bluecoat-application-name x-bluecoat-application-operation #remark: 3215330049 "shprod24a" "10.0.16.162" "main" 2016-08-12 00:35:31 2 172.28.212.88 - - authentication_failed denied "unavailable" -  407 tcp_denied connect - tcp psoc.ebayc3.com 443 / - - "mozilla/5.0 (windows nt 6.1; wow64; rv:45.0) gecko/20100101 firefox/45.0" 10.0.10.198 529 296 - "unavailable" "unavailable" 

note:

  • the first 6 lines of every log file comment lines, (starting '#')
  • there 27 fields in every line not comment. fields space delimited strings. other fields space delimited quoted strings, spaces within field. ex: "str ing"

simple space delimited files break because of quirk of having quoted string fields. reason, i'm trying use serde regex pattern in row format.

hiveql create table query:

create table test (date_field string,  time_field  string,  time_taken  string,  c_ip  string,  cs_username  string,  cs_auth_group  string,  x_exception_id  string,  sc_filter_result  string,  cs_categories  string,  csreferer  string,  sc_status  string,  s_action  string,  cs_method  string,  rscontent_type  string,  cs_uri_scheme  string,  cs_host  string,  cs_uri_port  string,  cs_uri_path  string,  cs_uri_query  string,  cs_uri_extension  string,  csuser_agent  string,  s_ip  string,  sc_bytes  string,  cs_bytes  string,  x_virus_id  string,  x_bluecoat_application_name  string,  x_bluecoat_application_operation  string) row format serde 'org.apache.hadoop.hive.serde2.regexserde' serdeproperties (   "input.regex" = "(\"[^\"]*\"|'[^']*'|[\s]+)+" ) stored inputformat    'org.apache.hadoop.mapred.textinputformat' outputformat    'org.apache.hadoop.hive.ql.io.hiveignorekeytextoutputformat' tblproperties ("skip.header.line.count"="6"); 

results: running

select * test limit 10;

gives me error: failed exception

java.io.ioexception:org.apache.hadoop.hive.serde2.serdeexception: number of matching groups doesn't match number of columns

i'm confused because i've got 27 fields in table (verified describe), , i've got 27 matches on regex. i've got table property ignore first 6 lines, comments shouldn't problem here. error message doesn't make sense logic.

i've tested regex on https://regex101.com/ positive results. matches break down fields want them:

regex output

i've tried switching regex pattern various other configurations, without luck.

any suggestions or hints around going wrong here?

thanks in advance!

i tried below regex,i'm able 26 fields not sure field missing

^(\s+\s\s+)\s(\s+)\s(\s+)\s(\s+)\s(\s+)\s(\s+)\s(\s+)\s(\s+)\s(\s+)\s*(\s+)\s(\s+)\s(\s+)\s(\s+)\s(\s+)\s(\s+)\s(\s+)\s(\s+)\s(\s+)\s(\s+)\s(".*")\s(\s+)\s(\s+)\s(\s+)\s(\s)\s(".*")\s(\s*) 

Comments

Popular posts from this blog

Sort a complex associative array in PHP -

vb.net - How to ignore if a cell is empty nothing -

recursion - Can every recursive algorithm be improved with dynamic programming? -