caching - How to lazily build a cache from spark streaming data -

July 15, 2010

i running streaming job , want build lookup map incrementally( track unique items, filter duplicated incomings example), thinking keep 1 dataframe in cache, , union new dataframe created in each batch,

 items.foreachrdd((rdd: rdd[string]) => {       ...            val uf =  rdd.todf          cached_df = cached_df.unionall(uf)          cached_df.cache          cached_df.count   // materialize       ...      })

my concern cached_df seems remember lineages previous rdds appended every batch iteration, in case, if don't care recompute cached rdd if crashes, overhead maintain growing dag?

as alternative, @ beginning of each batch, load lookup parquet file, instead of keeping in memory, @ end of each batch append new rdd same parquet file:

 noduplicateddf.write.mode(savemode.append).parquet("lookup")

this works expected, there straight forward way keep lookup in memory?

thanks wanchun

appending parquet right approach. optimize lookup. if okay in-memory cache delayed (that is, not have latest second data), periodically (say, every 5 minutes) load current "lookup" parquet table in memory (assuming fits). , lookup queries lookup latest 5 min snapshot.

you pipeline loading memory , serving of queries in different thread.

Search This Blog

Enable

caching - How to lazily build a cache from spark streaming data -

Comments

Post a Comment

Popular posts from this blog

resizing Telegram inline keyboard -

javascript - How to bind ViewModel Store to View? -

python - Alternative to referencing variable before assignment -