google cloud dataflow - How to sample per key in a PCollection using different sampling rates per key? -
i going through , switching few spark jobs on cloud dataflow/apache beam 2.0.
one of these jobs uses pairrdd.samplebykey(samplerates)
samplerates
map key match key in pairrdd
, value rate @ key should sampled.
i've found beam has sample.fixedsizeperkey(samplecount)
seems closest equivalent. however, samples @ fixed amount (as method name implies), every key.
i've dug sample
class bit see if can modified accept map , different count per key, can't find way access key inside pcollection<kv<k,v>
.
how can access key inside pcollection
in ptransform
in order this?
Comments
Post a Comment