2882. 删去重复的行
2882. 删去重复的行
题目
DataFrame
customers
+-------------+--------+ | Column Name | Type | +-------------+--------+ | customer_id | int | | name | object | | email | object | +-------------+--------+
There are some duplicate rows in the DataFrame based on the email
column.
Write a solution to remove these duplicate rows and keep only the first occurrence.
The result format is in the following example.
Example 1:
Input:
+-------------+---------+---------------------+ | customer_id | name | email | +-------------+---------+---------------------+ | 1 | Ella | emily@example.com | | 2 | David | michael@example.com | | 3 | Zachary | sarah@example.com | | 4 | Alice | john@example.com | | 5 | Finn | john@example.com | | 6 | Violet | alice@example.com | +-------------+---------+---------------------+
Output:
+-------------+---------+---------------------+ | customer_id | name | email | +-------------+---------+---------------------+ | 1 | Ella | emily@example.com | | 2 | David | michael@example.com | | 3 | Zachary | sarah@example.com | | 4 | Alice | john@example.com | | 6 | Violet | alice@example.com | +-------------+---------+---------------------+
Explanation:
Alic (customer_id = 4) and Finn (customer_id = 5) both use john@example.com, so only the first occurrence of this email is retained.
题目大意
DataFrame
customers
+-------------+--------+ | Column Name | Type | +-------------+--------+ | customer_id | int | | name | object | | email | object | +-------------+--------+
在 DataFrame 中基于 email
列存在一些重复行。
编写一个解决方案,删除这些重复行,仅保留第一次出现的行。
返回结果格式如下例所示。
示例 1:
输入:
+-------------+---------+---------------------+ | customer_id | name | email | +-------------+---------+---------------------+ | 1 | Ella | emily@example.com | | 2 | David | michael@example.com | | 3 | Zachary | sarah@example.com | | 4 | Alice | john@example.com | | 5 | Finn | john@example.com | | 6 | Violet | alice@example.com | +-------------+---------+---------------------+
输出:
+-------------+---------+---------------------+ | customer_id | name | email | +-------------+---------+---------------------+ | 1 | Ella | emily@example.com | | 2 | David | michael@example.com | | 3 | Zachary | sarah@example.com | | 4 | Alice | john@example.com | | 6 | Violet | alice@example.com | +-------------+---------+---------------------+
解释:
Alice (customer_id = 4) 和 Finn (customer_id = 5) 都使用 john@example.com,因此只保留该邮箱地址的第一次出现。
解题思路
去重操作:
- Pandas 提供了
drop_duplicates
方法,支持根据指定列去重。 - 调用
customers.drop_duplicates(subset='email')
,表示以email
列为基准,自动检测并去掉email
列中的重复记录,保留每组重复值的第一条记录。
- Pandas 提供了
返回新 DataFrame:
drop_duplicates
不修改原 DataFrame,而是返回去重后的新 DataFrame。- 如果希望对原 DataFrame 就地修改,需显式传递
inplace=True
参数。
复杂度分析
- 时间复杂度:
O(n)
,其中n
是行数,去重操作需要对email
列进行遍历和哈希检查。 - 空间复杂度:
O(n)
,去重操作需要一个临时哈希。
代码
import pandas as pd
def dropDuplicateEmails(customers: pd.DataFrame) -> pd.DataFrame:
return customers.drop_duplicates('email')