Identifying Duplicates in a Data Frame using Pandas

1 min readJun 23, 2019

In a data frame, it’s very essential to take care of the duplicates while pre-processing your data. This is done using pandas.DataFrame.duplicated()

Now, let’s see how it works:

import pandas as pd#create a data frame
df = pd.DataFrame({‘Name’:[‘Jack’,’Rose’,’Kate’,’Daniel’,’David’,’Rose’,’Kate’], ‘Age’: [11,21,19,34,22,24,19]})

pd.duplicated() allows us to identify the duplicates & return a boolean series

In the above data frame, we see that row 2 & row 6 are duplicates. Hence df.duplicated() returned a boolean of True at 6th row keeping the 2nd one as it is.

Applying this boolean to the data frame df[df.duplicated()] returns the duplicate rows.

More on this function can be found here:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html

Identifying Duplicates in a Data Frame using Pandas

Written by David Gladson