regex - Hadoop Hive SerDe Row Format for String Quoted Space delimited file -
i'm trying create hive table log file following format.
log file:
#software: 1 #version: 1 #start-date: xx #date: xx #fields: date time time-taken c-ip cs-username cs-auth-group x-exception-id sc-filter-result cs-categories cs(referer) sc-status s-action cs-method rs(content-type) cs-uri-scheme cs-host cs-uri-port cs-uri-path cs-uri-query cs-uri-extension cs(user-agent) s-ip sc-bytes cs-bytes x-virus-id x-bluecoat-application-name x-bluecoat-application-operation #remark: 3215330049 "shprod24a" "10.0.16.162" "main" 2016-08-12 00:35:31 2 172.28.212.88 - - authentication_failed denied "unavailable" - 407 tcp_denied connect - tcp psoc.ebayc3.com 443 / - - "mozilla/5.0 (windows nt 6.1; wow64; rv:45.0) gecko/20100101 firefox/45.0" 10.0.10.198 529 296 - "unavailable" "unavailable" note:
- the first 6 lines of every log file comment lines, (starting '#')
- there 27 fields in every line not comment. fields space delimited strings. other fields space delimited quoted strings, spaces within field. ex: "str ing"
simple space delimited files break because of quirk of having quoted string fields. reason, i'm trying use serde regex pattern in row format.
hiveql create table query:
create table test (date_field string, time_field string, time_taken string, c_ip string, cs_username string, cs_auth_group string, x_exception_id string, sc_filter_result string, cs_categories string, csreferer string, sc_status string, s_action string, cs_method string, rscontent_type string, cs_uri_scheme string, cs_host string, cs_uri_port string, cs_uri_path string, cs_uri_query string, cs_uri_extension string, csuser_agent string, s_ip string, sc_bytes string, cs_bytes string, x_virus_id string, x_bluecoat_application_name string, x_bluecoat_application_operation string) row format serde 'org.apache.hadoop.hive.serde2.regexserde' serdeproperties ( "input.regex" = "(\"[^\"]*\"|'[^']*'|[\s]+)+" ) stored inputformat 'org.apache.hadoop.mapred.textinputformat' outputformat 'org.apache.hadoop.hive.ql.io.hiveignorekeytextoutputformat' tblproperties ("skip.header.line.count"="6"); results: running
select * test limit 10;
gives me error: failed exception
java.io.ioexception:org.apache.hadoop.hive.serde2.serdeexception: number of matching groups doesn't match number of columns
i'm confused because i've got 27 fields in table (verified describe), , i've got 27 matches on regex. i've got table property ignore first 6 lines, comments shouldn't problem here. error message doesn't make sense logic.
i've tested regex on https://regex101.com/ positive results. matches break down fields want them:
i've tried switching regex pattern various other configurations, without luck.
any suggestions or hints around going wrong here?
thanks in advance!
i tried below regex,i'm able 26 fields not sure field missing
^(\s+\s\s+)\s(\s+)\s(\s+)\s(\s+)\s(\s+)\s(\s+)\s(\s+)\s(\s+)\s(\s+)\s*(\s+)\s(\s+)\s(\s+)\s(\s+)\s(\s+)\s(\s+)\s(\s+)\s(\s+)\s(\s+)\s(\s+)\s(".*")\s(\s+)\s(\s+)\s(\s+)\s(\s)\s(".*")\s(\s*) 
Comments
Post a Comment