Duplicate Data
对于重复数据的处理主要是通过来两个函数来实现的:
duplicated()
: 返回一个布尔列表,其中标识了每行是否重复drop_duplicates()
: 删除重复行,他相当于删除所有duplicated()
为 True 的行
这两个函数都接受一个 keep 属性来表示保留的行:
keep = 'first'
: 保留第一行keep = 'last'
: 保留最后一行keep = False
: 删除所有重复行
Python
In [298]: df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'two', 'two', 'three', 'four'],
.....: 'b': ['x', 'y', 'x', 'y', 'x', 'x', 'x'],
.....: 'c': np.random.randn(7)})
.....:
In [299]: df2
Out[299]:
a b c
0 one x -1.067137
1 one y 0.309500
2 two x -0.211056
3 two y -1.842023
4 two x -0.390820
5 three x -1.964475
6 four x 1.298329
In [300]: df2.duplicated('a')
Out[300]:
0 False
1 True
2 False
3 True
4 True
5 False
6 False
dtype: bool
In [301]: df2.duplicated('a', keep='last')
Out[301]:
0 True
1 False
2 True
3 True
4 False
5 False
6 False
dtype: bool
In [302]: df2.duplicated('a', keep=False)
Out[302]:
0 True
1 True
2 True
3 True
4 True
5 False
6 False
dtype: bool
In [303]: df2.drop_duplicates('a')
Out[303]:
a b c
0 one x -1.067137
2 two x -0.211056
5 three x -1.964475
6 four x 1.298329
In [304]: df2.drop_duplicates('a', keep='last')
Out[304]:
a b c
1 one y 0.309500
4 two x -0.390820
5 three x -1.964475
6 four x 1.298329
In [305]: df2.drop_duplicates('a', keep=False)
Out[305]:
a b c
5 three x -1.964475
6 four x 1.298329