python - Pandas: How to filter for items that occur more than once in a dataframe -

April 15, 2015

i have pandas dataframe contains duplicate entries. items listed twice or 3 times. filter shows items listed @ least n times. in final table items should shown once. dataframe contains 3 columns: [cola, colb, colc]. should consider colb in determining whether item listed multiple times. note: not drop_duplicates. it's opposite, drop items in dataframe less n times.

the end result should list each item once.

you can use value_counts item count , construct boolean mask , reference index , test membership using isin:

in [3]: df = pd.dataframe({'a':[0,0,0,1,2,2,3,3,3,3,3,3,4,4,4]}) df  out[3]:     0   0 1   0 2   0 3   1 4   2 5   2 6   3 7   3 8   3 9   3 10  3 11  3 12  4 13  4 14  4  in [8]: df[df['a'].isin(df['a'].value_counts()[df['a'].value_counts()>2].index)]  out[8]:     0   0 1   0 2   0 6   3 7   3 8   3 9   3 10  3 11  3 12  4 13  4 14  4

so breaking above down:

in [9]: df['a'].value_counts() > 2  out[9]: 3     true 4     true 0     true 2    false 1    false name: a, dtype: bool  in [10]: # construct boolean mask df['a'].value_counts()[df['a'].value_counts()>2]  out[10]: 3    6 4    3 0    3 name: a, dtype: int64  in [11]: # we're interested in index here, pass isin df['a'].value_counts()[df['a'].value_counts()>2].index  out[11]: int64index([3, 4, 0], dtype='int64')

edit

as user @jonclements suggested simpler , faster method groupby on col of interest , filter it:

in [4]: df.groupby('a').filter(lambda x: len(x) > 2)  out[4]:     0   0 1   0 2   0 6   3 7   3 8   3 9   3 10  3 11  3 12  4 13  4 14  4

edit 2

to single entry each repeat call drop_duplicates , pass param subset='a':

in [2]: df.groupby('a').filter(lambda x: len(x) > 2).drop_duplicates(subset='a')  out[2]:     0   0 6   3 12  4

Search This Blog

Enable

python - Pandas: How to filter for items that occur more than once in a dataframe -

Comments

Post a Comment

Popular posts from this blog

resizing Telegram inline keyboard -

command line - How can a Python program background itself? -

php - "cURL error 28: Resolving timed out" on Wordpress on Azure App Service on Linux -