caching - How to lazily build a cache from spark streaming data -


i running streaming job , want build lookup map incrementally( track unique items, filter duplicated incomings example), thinking keep 1 dataframe in cache, , union new dataframe created in each batch,

 items.foreachrdd((rdd: rdd[string]) => {       ...            val uf =  rdd.todf          cached_df = cached_df.unionall(uf)          cached_df.cache          cached_df.count   // materialize       ...      })  

my concern cached_df seems remember lineages previous rdds appended every batch iteration, in case, if don't care recompute cached rdd if crashes, overhead maintain growing dag?

as alternative, @ beginning of each batch, load lookup parquet file, instead of keeping in memory, @ end of each batch append new rdd same parquet file:

 noduplicateddf.write.mode(savemode.append).parquet("lookup") 

this works expected, there straight forward way keep lookup in memory?

thanks wanchun

appending parquet right approach. optimize lookup. if okay in-memory cache delayed (that is, not have latest second data), periodically (say, every 5 minutes) load current "lookup" parquet table in memory (assuming fits). , lookup queries lookup latest 5 min snapshot.

you pipeline loading memory , serving of queries in different thread.


Comments

Popular posts from this blog

resizing Telegram inline keyboard -

command line - How can a Python program background itself? -

php - "cURL error 28: Resolving timed out" on Wordpress on Azure App Service on Linux -