google cloud dataflow - How to sample per key in a PCollection using different sampling rates per key? -
i going through , switching few spark jobs on cloud dataflow/apache beam 2.0.
one of these jobs uses pairrdd.samplebykey(samplerates) samplerates map key match key in pairrdd , value rate @ key should sampled.
i've found beam has sample.fixedsizeperkey(samplecount) seems closest equivalent. however, samples @ fixed amount (as method name implies), every key.
i've dug sample class bit see if can modified accept map , different count per key, can't find way access key inside pcollection<kv<k,v>.
how can access key inside pcollection in ptransform in order this?
Comments
Post a Comment