pyspark.pandas.DataFrame.drop_duplicates¶
-
DataFrame.
drop_duplicates
(subset: Union[Any, Tuple[Any, …], List[Union[Any, Tuple[Any, …]]], None] = None, keep: str = 'first', inplace: bool = False) → Optional[pyspark.pandas.frame.DataFrame][source]¶ Return DataFrame with duplicate rows removed, optionally only considering certain columns.
- Parameters
- subsetcolumn label or sequence of labels, optional
Only consider certain columns for identifying duplicates, by default use all of the columns.
- keep{‘first’, ‘last’, False}, default ‘first’
Determines which duplicates (if any) to keep. -
first
: Drop duplicates except for the first occurrence. -last
: Drop duplicates except for the last occurrence. - False : Drop all duplicates.- inplaceboolean, default False
Whether to drop duplicates in place or to return a copy.
- Returns
- DataFrame
DataFrame with duplicates removed or None if
inplace=True
.
>>> df = ps.DataFrame( ..
- … {‘a’: [1, 2, 2, 2, 3], ‘b’: [‘a’, ‘a’, ‘a’, ‘c’, ‘d’]}, columns = [‘a’, ‘b’])
>>> df a b
- 0 1 a
- 1 2 a
- 2 2 a
- 3 2 c
- 4 3 d
>>> df.drop_duplicates().sort_index() a b
- 0 1 a
- 1 2 a
- 3 2 c
- 4 3 d
>>> df.drop_duplicates('a').sort_index() a b
- 0 1 a
- 1 2 a
- 4 3 d
>>> df.drop_duplicates(['a', 'b']).sort_index() a b
- 0 1 a
- 1 2 a
- 3 2 c
- 4 3 d
>>> df.drop_duplicates(keep='last').sort_index() a b
- 0 1 a
- 2 2 a
- 3 2 c
- 4 3 d
>>> df.drop_duplicates(keep=False).sort_index() a b
- 0 1 a
- 3 2 c
- 4 3 d