Identifying Duplicates in a Data Frame using Pandas
1 min readJun 23, 2019
In a data frame, it’s very essential to take care of the duplicates while pre-processing your data. This is done using pandas.DataFrame.duplicated()
Now, let’s see how it works:
import pandas as pd#create a data frame
df = pd.DataFrame({‘Name’:[‘Jack’,’Rose’,’Kate’,’Daniel’,’David’,’Rose’,’Kate’], ‘Age’: [11,21,19,34,22,24,19]})
pd.duplicated() allows us to identify the duplicates & return a boolean series
In the above data frame, we see that row 2 & row 6 are duplicates. Hence df.duplicated() returned a boolean of True at 6th row keeping the 2nd one as it is.
Applying this boolean to the data frame df[df.duplicated()] returns the duplicate rows.
More on this function can be found here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html