python - Create list containing proportionate number of data from n other lists -
new pyspark programming. have been trying problem, , want know if there's more efficient way solve it.
have 15 dataframes, each containing 2 columns: website name , hits. each of these dataframes have different number of records. want final dataframe in end, have taken top records each dataframe (based on hits) , added them there. catch is, taking proportionate number of records each. example, if want 1500 records in end, , have 2 dataframes of size 10,000 , 5000 each, take 1000 first , 500 second.
so currently, have implementation this:
1. call count() on each dataframe. have length of each can determine how many records want each dataframe.
2. call orderby() on each dataframe, based on hits. call limit() on ordered data frame can limit total number of records need per dataframe.
the above implementation works, it's pretty slow. sounds greedy approach, appreciate hint make better. thank you!
you're approach seems correct although still have iterate through list of dataframe. can try approach parallelize computations:
first let's create sample dataframes of varying lengths:
import random length_list = [10, 15, 20, 30] df_list = [] l in length_list: df = spark.createdataframe( sc.parallelize([[chr(ord("a") + i), random.randint(0, 100), l] in range(l)]), ["name", "hits", "df_name"] ) df_list.append(df) note created column called containing length of dataframe name.
we'll create union of dataframes have 1 work table:
from functools import reduce pyspark.sql import dataframe df = reduce(dataframe.unionall, df_list) now we'll compute percent_rank within each df_name group using window
from pyspark.sql import window w = window.partitionby("df_name").orderby(psf.desc("hits")) df = df.withcolumn("pct_rn", psf.percent_rank().over(w)) you can filter dataframe ever proportion of each group want, 1/3 instance
res = df.filter(df.pct_rn < 1/3.) finally check final lengths 1/3 of original ones:
res.groupby("df_name").count().sort("df_name").show() +-------+-----+ |df_name|count| +-------+-----+ | 10| 3| | 15| 6| | 20| 7| | 30| 10| +-------+-----+
Comments
Post a Comment