How should I configure Spark to correctly prune Hive Metastore partitions? -


i'm having issues when applying partition filters spark (v2.0.2/2.1.1) dataframes read hive (v2.1.0) table on 30,000 partitions. know recommended approach , what, if anything, i'm doing incorrectly current behaviour source of large performance reliability issues.

to enable pruning, using following spark/hive property:

--conf spark.sql.hive.metastorepartitionpruning=true 

when running query in spark-shell can see partition fetch take place invocation thrifthivemetastore.iface.get_partitions, unexpectedly occurs without filtering:

val mytable = spark.table("db.table") val mytabledata = mytable   .filter("local_date = '2017-09-01' or local_date = '2017-09-02'")   .cache  // hms call invoked is: // #get_partitions('db', 'table', -1) 

if use more simplistic filter, partitions filtered desired:

val mytabledata = mytable   .filter("local_date = '2017-09-01'")   .cache  // hms call invoked is: // #get_partitions_by_filter( //   'db', 'table', //   'local_date = "2017-09-01"', //   -1 // ) 

the filtering works correctly if rewrite filter use range operators instead of checking equality:

val mytabledata = mytable   .filter("local_date >= '2017-09-01' , local_date <= '2017-09-02'")   .cache  // hms call invoked is: // #get_partitions_by_filter( //   'db', 'table', //   'local_date >= '2017-09-01' , local_date <= '2017-09-02'', //   -1 // ) 

in our case, behaviour problematic performance perspective; call times in region of 4 minutes versus 1 second when correctly filtered. additionally, routinely loading large volumes ofpartition objects onto heap per query leads memory issues in metastore service.

it seems though there bug around parsing , interpretation of types of filter constructs, i've not been able find relevant issue in spark jira. there preferred approach or specific spark version filters correctly applied filter variants? or must use specific forms (e.g. range operators) when constructing filters? if so, limitation documented anywhere?


Comments

Popular posts from this blog

Sort a complex associative array in PHP -

vb.net - How to ignore if a cell is empty nothing -

recursion - Can every recursive algorithm be improved with dynamic programming? -