python - Pandas: How to filter for items that occur more than once in a dataframe -
i have pandas dataframe contains duplicate entries. items listed twice or 3 times. filter shows items listed @ least n times. in final table items should shown once. dataframe contains 3 columns: [cola, colb, colc]. should consider colb in determining whether item listed multiple times. note: not drop_duplicates. it's opposite, drop items in dataframe less n times.
the end result should list each item once.
you can use value_counts
item count , construct boolean mask , reference index , test membership using isin
:
in [3]: df = pd.dataframe({'a':[0,0,0,1,2,2,3,3,3,3,3,3,4,4,4]}) df out[3]: 0 0 1 0 2 0 3 1 4 2 5 2 6 3 7 3 8 3 9 3 10 3 11 3 12 4 13 4 14 4 in [8]: df[df['a'].isin(df['a'].value_counts()[df['a'].value_counts()>2].index)] out[8]: 0 0 1 0 2 0 6 3 7 3 8 3 9 3 10 3 11 3 12 4 13 4 14 4
so breaking above down:
in [9]: df['a'].value_counts() > 2 out[9]: 3 true 4 true 0 true 2 false 1 false name: a, dtype: bool in [10]: # construct boolean mask df['a'].value_counts()[df['a'].value_counts()>2] out[10]: 3 6 4 3 0 3 name: a, dtype: int64 in [11]: # we're interested in index here, pass isin df['a'].value_counts()[df['a'].value_counts()>2].index out[11]: int64index([3, 4, 0], dtype='int64')
edit
as user @jonclements suggested simpler , faster method groupby
on col of interest , filter
it:
in [4]: df.groupby('a').filter(lambda x: len(x) > 2) out[4]: 0 0 1 0 2 0 6 3 7 3 8 3 9 3 10 3 11 3 12 4 13 4 14 4
edit 2
to single entry each repeat call drop_duplicates
, pass param subset='a'
:
in [2]: df.groupby('a').filter(lambda x: len(x) > 2).drop_duplicates(subset='a') out[2]: 0 0 6 3 12 4
Comments
Post a Comment