caching - How to lazily build a cache from spark streaming data -
i running streaming job , want build lookup map incrementally( track unique items, filter duplicated incomings example), thinking keep 1 dataframe
in cache, , union new dataframe
created in each batch,
items.foreachrdd((rdd: rdd[string]) => { ... val uf = rdd.todf cached_df = cached_df.unionall(uf) cached_df.cache cached_df.count // materialize ... })
my concern cached_df seems remember lineages previous rdd
s appended every batch iteration, in case, if don't care recompute cached rdd
if crashes, overhead maintain growing dag?
as alternative, @ beginning of each batch, load lookup parquet file, instead of keeping in memory, @ end of each batch append new rdd
same parquet file:
noduplicateddf.write.mode(savemode.append).parquet("lookup")
this works expected, there straight forward way keep lookup in memory?
thanks wanchun
appending parquet right approach. optimize lookup. if okay in-memory cache delayed (that is, not have latest second data), periodically (say, every 5 minutes) load current "lookup" parquet table in memory (assuming fits). , lookup queries lookup latest 5 min snapshot.
you pipeline loading memory , serving of queries in different thread.
Comments
Post a Comment