design patterns - Apache Spark distributed sql -
i use spark dataframereader perform sql query database. each query performed sparksession required. is: each of javapairrdds perform map, invoke sql query parameters rdd. means need pass sparksession in each lambda, seems bad design. common approach in such problems?
it like:
roots.map(r -> dbloader.getdata(sparksession, r._1)); how load data now:
javardd<row> javardd = sparksession.read().format("jdbc") .options(options) .load() .javardd();
the purpose of big data have data locality , able execute code data resides, ok big load of table memory or local disk (cache/persist), continuous remote jdbc queries defeat purpose.
Comments
Post a Comment