Skip to content

Duplicate Data

对于重复数据的处理主要是通过来两个函数来实现的:

  • duplicated(): 返回一个布尔列表,其中标识了每行是否重复
  • drop_duplicates(): 删除重复行,他相当于删除所有 duplicated() 为 True 的行

这两个函数都接受一个 keep 属性来表示保留的行:

  • keep = 'first': 保留第一行
  • keep = 'last': 保留最后一行
  • keep = False: 删除所有重复行
Python
In [298]: df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'two', 'two', 'three', 'four'],
   .....:                     'b': ['x', 'y', 'x', 'y', 'x', 'x', 'x'],
   .....:                     'c': np.random.randn(7)})
   .....:

In [299]: df2
Out[299]:
       a  b         c
0    one  x -1.067137
1    one  y  0.309500
2    two  x -0.211056
3    two  y -1.842023
4    two  x -0.390820
5  three  x -1.964475
6   four  x  1.298329

In [300]: df2.duplicated('a')
Out[300]:
0    False
1     True
2    False
3     True
4     True
5    False
6    False
dtype: bool

In [301]: df2.duplicated('a', keep='last')
Out[301]:
0     True
1    False
2     True
3     True
4    False
5    False
6    False
dtype: bool

In [302]: df2.duplicated('a', keep=False)
Out[302]:
0     True
1     True
2     True
3     True
4     True
5    False
6    False
dtype: bool

In [303]: df2.drop_duplicates('a')
Out[303]:
       a  b         c
0    one  x -1.067137
2    two  x -0.211056
5  three  x -1.964475
6   four  x  1.298329

In [304]: df2.drop_duplicates('a', keep='last')
Out[304]:
       a  b         c
1    one  y  0.309500
4    two  x -0.390820
5  three  x -1.964475
6   four  x  1.298329

In [305]: df2.drop_duplicates('a', keep=False)
Out[305]:
       a  b         c
5  three  x -1.964475
6   four  x  1.298329

参考