scala - update dataframe if not equal spark dataframe -
i have 2 dataframes , in want know whether different based on column key other wisei update 1 different become equal
val tmp_site = spark.load("jdbc", map("url" -> "jdbc:oracle:thin:system/maher@//localhost:1521/xe", "dbtable" -> "iptech.tmp_site")) .withcolumn("site",'site.cast(longtype)) val local_pos = spark.load("jdbc", map("url" -> url, "dbtable" -> "pos")).select("id","name") tmp_site.printschema() local_pos.printschema() val join = tmp_site.join(local_pos, 'site === 'id, "inner") root |-- site: long (nullable = true) |-- libelle: string (nullable = false) root |-- id: long (nullable = false) |-- name: string (nullable = true)
the result of joining
|id |name |site|libelle | +---+----------------------+----+----------------------+ |51 |ezzahra |51 |ezzahra | |7 |benikhalled |7 |benikhalled | |15 |kram |15 |kram | |54 |el mourouj |54 |el mourouj | |11 |le bardo |11 |le bardo | |29 |mini m ksar said |29 |mini m ksar said | |69 |zaghouan |69 |zaghouan | |42 |beb el khadhra |42 |beb el khadhra | |73 |zaouit kontech |73 |zaouit kontech | |87 |aouina |87 |aouina | |64 |sousse i |64 |sousse i | |3 |sahra confort : korba |3 |sahra confort : korba | |34 |soukra square |34 |soukra square | |59 |sahra confort : zarzis|59 |sahra confort : zarzis| |8 |jerba |8 |jerba | |22 |moknine |22 |moknine | |28 |rdayef |28 |rdayef | |85 |monastir absorba |85 |monastir absorba | |16 |bardo hanaya |16 |bardo hanaya | |35 |mini m agba |35 |mini m agba | +---+----------------------+----+----------------------+
i did
val temp = join.withcolumn("changes", when($"libelle" === $"name", lit("nothing")).otherwise("need update")) got
|id |name |site|libelle |changes | +---+----------------------+----+----------------------+--------------+ |51 |ezzahra |51 |ezzahra |nothing | |7 |benikhalled |7 |benikhalled |nothing | |15 |kram |15 |kram |nothing | |54 |el mourouj |54 |el mourouj |nothing | |11 |le bardo |11 |le bardo |nothing | |29 |mini m ksar said |29 |mini m ksar said |nothing | |69 |zaghouan |69 |zaghouan |nothing | |42 |beb el khadhra |42 |beb el khadhra |nothing | |73 |zaouit kontech |73 |zaouit kontech |need update| |87 |aouina |87 |aouina |nothing | |64 |sousse i |64 |sousse i |nothing | |3 |sahra confort : korba |3 |sahra confort : korba |nothing | |34 |soukra square |34 |soukra square |nothing | |59 |sahra confort : zarzis|59 |sahra confort : zarzis|nothing | |8 |jerba |8 |jerba |nothing | |22 |moknine |22 |moknine |need update| |28 |rdayef |28 |rdayef |nothing | |85 |monastir absorba |85 |monastir absorba |nothing | |16 |bardo hanaya |16 |bardo hanaya |nothing | |35 |mini m agba |35 |mini m agba |nothing | +---+----------------------+----+----------------------+--------------+
i not why said need updated because same. altough should nothing of them , beause equal . should appreciated
once have dataframe
, quite easy play columns
, rows
.
so have following dataframe
after join
+----+---------------------+----+---------------------+ |site|libelle |id |name | +----+---------------------+----+---------------------+ |48 |mini m boumhel |48 |mini m boumhel | |67 |lac |67 |lac | |992 |test2 |992 |test | |44 |kairouan |44 |kairouan | |61 |tunis |61 |tunis | |9001|monoprix |9001|monoprix | |3 |sahra confort : korba|3 |sahra confort : korba| |37 |mini m borj lozir |37 |mini m borj lozir | |83 |jendouba |83 |jendouba | |12 |bigro |12 |bigro | +----+---------------------+----+---------------------+
you can create column logic have written using when
function
import org.apache.spark.sql.functions._ val temp = join.withcolumn("changes", when($"libelle" === $"name", lit("nothing")).otherwise("need update"))
temp
dataframe
be
+----+---------------------+----+---------------------+--------------+ |site|libelle |id |name |changes | +----+---------------------+----+---------------------+--------------+ |48 |mini m boumhel |48 |mini m boumhel |nothing | |67 |lac |67 |lac |nothing | |992 |test2 |992 |test |need update| |44 |kairouan |44 |kairouan |nothing | |61 |tunis |61 |tunis |nothing | |9001|monoprix |9001|monoprix |nothing | |3 |sahra confort : korba|3 |sahra confort : korba|nothing | |37 |mini m borj lozir |37 |mini m borj lozir |nothing | |83 |jendouba |83 |jendouba |nothing | |12 |bigro |12 |bigro |nothing | +----+---------------------+----+---------------------+--------------+
now can use filter
method on dataframe
temp.filter($"changes" === "need update").show(false)
which should give
+----+-------+---+----+--------------+ |site|libelle|id |name|changes | +----+-------+---+----+--------------+ |992 |test2 |992|test|need update| +----+-------+---+----+--------------+
you need play columns using select
, groupby
, aggregations
, filters
, other inbuilt functions or using udf
functions etc etc. can convert rdd
, tuples
did in example.
Comments
Post a Comment