跳至主要內容

2882. 删去重复的行


2882. 删去重复的行

🟢   🔗 力扣open in new window LeetCodeopen in new window

题目

DataFrame customers

+-------------+--------+
| Column Name | Type   |
+-------------+--------+
| customer_id | int    |
| name        | object |
| email       | object |
+-------------+--------+

There are some duplicate rows in the DataFrame based on the email column.

Write a solution to remove these duplicate rows and keep only the first occurrence.

The result format is in the following example.

Example 1:

Input:

+-------------+---------+---------------------+
| customer_id | name    | email               |
+-------------+---------+---------------------+
| 1           | Ella    | emily@example.com   |
| 2           | David   | michael@example.com |
| 3           | Zachary | sarah@example.com   |
| 4           | Alice   | john@example.com    |
| 5           | Finn    | john@example.com    |
| 6           | Violet  | alice@example.com   |
+-------------+---------+---------------------+

Output:

+-------------+---------+---------------------+
| customer_id | name    | email               |
+-------------+---------+---------------------+
| 1           | Ella    | emily@example.com   |
| 2           | David   | michael@example.com |
| 3           | Zachary | sarah@example.com   |
| 4           | Alice   | john@example.com    |
| 6           | Violet  | alice@example.com   |
+-------------+---------+---------------------+

Explanation:

Alic (customer_id = 4) and Finn (customer_id = 5) both use john@example.com, so only the first occurrence of this email is retained.

题目大意

DataFrame customers

+-------------+--------+
| Column Name | Type   |
+-------------+--------+
| customer_id | int    |
| name        | object |
| email       | object |
+-------------+--------+

在 DataFrame 中基于 email 列存在一些重复行。

编写一个解决方案,删除这些重复行,仅保留第一次出现的行。

返回结果格式如下例所示。

示例 1:

输入:

+-------------+---------+---------------------+
| customer_id | name    | email               |
+-------------+---------+---------------------+
| 1           | Ella    | emily@example.com   |
| 2           | David   | michael@example.com |
| 3           | Zachary | sarah@example.com   |
| 4           | Alice   | john@example.com    |
| 5           | Finn    | john@example.com    |
| 6           | Violet  | alice@example.com   |
+-------------+---------+---------------------+

输出:

+-------------+---------+---------------------+
| customer_id | name    | email               |
+-------------+---------+---------------------+
| 1           | Ella    | emily@example.com   |
| 2           | David   | michael@example.com |
| 3           | Zachary | sarah@example.com   |
| 4           | Alice   | john@example.com    |
| 6           | Violet  | alice@example.com   |
+-------------+---------+---------------------+

解释:

Alice (customer_id = 4) 和 Finn (customer_id = 5) 都使用 john@example.com,因此只保留该邮箱地址的第一次出现。

解题思路

  1. 去重操作

    • Pandas 提供了 drop_duplicates 方法,支持根据指定列去重。
    • 调用 customers.drop_duplicates(subset='email'),表示以 email 列为基准,自动检测并去掉 email 列中的重复记录,保留每组重复值的第一条记录。
  2. 返回新 DataFrame

    • drop_duplicates 不修改原 DataFrame,而是返回去重后的新 DataFrame。
    • 如果希望对原 DataFrame 就地修改,需显式传递 inplace=True 参数。

复杂度分析

  • 时间复杂度O(n),其中 n 是行数,去重操作需要对 email 列进行遍历和哈希检查。
  • 空间复杂度O(n),去重操作需要一个临时哈希。

代码

import pandas as pd

def dropDuplicateEmails(customers: pd.DataFrame) -> pd.DataFrame:
  return customers.drop_duplicates('email')