819. 最常见的单词
819. 最常见的单词
🟢 🔖 数组
哈希表
字符串
计数
🔗 力扣
LeetCode
题目
Given a string paragraph
and a string array of the banned words banned
, return the most frequent word that is not banned. It is guaranteed there is at least one word that is not banned, and that the answer is unique.
The words in paragraph
are case-insensitive and the answer should be returned in lowercase.
Example 1:
Input: paragraph = "Bob hit a ball, the hit BALL flew far after it was hit.", banned = ["hit"]
Output: "ball"
Explanation:
"hit" occurs 3 times, but it is a banned word.
"ball" occurs twice (and no other word does), so it is the most frequent non-banned word in the paragraph.
Note that words in the paragraph are not case sensitive,
that punctuation is ignored (even if adjacent to words, such as "ball,"),
and that "hit" isn't the answer even though it occurs more because it is banned.
Example 2:
Input: paragraph = "a.", banned = []
Output: "a"
Constraints:
1 <= paragraph.length <= 1000
- paragraph consists of English letters, space
' '
, or one of the symbols:"!?',;."
. 0 <= banned.length <= 100
1 <= banned[i].length <= 10
banned[i]
consists of only lowercase English letters.
题目大意
给你一个字符串 paragraph
和一个表示禁用词的字符串数组 banned
,返回出现频率最高的非禁用词。题目数据 保证 至少存在一个非禁用词,且答案唯一 。
paragraph
中的单词 不区分大小写 ,答案应以 小写 形式返回。
示例 1:
输入: paragraph = "Bob hit a ball, the hit BALL flew far after it was hit.", banned = ["hit"]
输出: "ball"
解释:
"hit" 出现了 3 次,但它是禁用词。
"ball" 出现了两次(没有其他单词出现这么多次),因此它是段落中出现频率最高的非禁用词。
请注意,段落中的单词不区分大小写,
标点符号会被忽略(即使它们紧挨着单词,如 "ball,"),
并且尽管 "hit" 出现的次数更多,但它不能作为答案,因为它是禁用词。
示例 2:
输入: paragraph = "a.", banned = []
输出: "a"
提示:
1 <= paragraph.length <= 1000
paragraph
由英文字母、空格' '
、和以下符号组成:"!?',;."
0 <= banned.length <= 100
1 <= banned[i].length <= 10
banned[i]
仅由小写英文字母组成
解题思路
预处理输入:
- 首先将字符串
paragraph
中的所有标点符号替换为空格,以便提取出所有单词。 - 将
paragraph
中的所有字符转换为小写。 - 使用空格将处理后的
paragraph
分割成单词列表。
- 首先将字符串
建立单词计数:
- 将禁用词存入一个集合(
bannedSet
)中以加速查询。 - 遍历单词列表,跳过出现在
banned
数组中的单词。 - 使用一个哈希表(
wordCount
)记录每个单词出现的次数。 - 同时更新目前最高的频率,和频率最高的非禁用词。
- 将禁用词存入一个集合(
返回结果:返回频率最高的单词作为结果。
复杂度分析
时间复杂度:
O(n + m + k)
,通常可以近似为O(n)
。- 预处理替换标点符号的时间复杂度为
O(n)
,其中n
是字符串的长度。 - 分割单词的时间复杂度为
O(n)
。 - 统计频率遍历单词列表的时间复杂度为
O(m)
,其中m
是单词数量。 - 寻找最高频率遍历哈希表的时间复杂度为
O(k)
,其中k
是唯一单词的数量。
- 预处理替换标点符号的时间复杂度为
空间复杂度:
O(k + b)
,其中k
是唯一单词数量,b
是禁用词数量,使用的额外空间包括存储单词频率的哈希表和禁用词集合。
代码
/**
* @param {string} paragraph
* @param {string[]} banned
* @return {string}
*/
var mostCommonWord = function (paragraph, banned) {
// 预处理:过滤掉标点符号并转换为小写
const cleanParagraph = paragraph.toLowerCase().replace(/[!?',;.\"]/g, '');
const words = cleanParagraph.split(/\s+/);
// 将禁用词存入 Set 以加速查询
const bannedSet = new Set(banned);
// 统计单词频率
let wordCount = new Map();
let maxCount = 0,
res = null;
for (let word of words) {
if (word && !bannedSet.has(word)) {
const count = (wordCount.get(word) || 0) + 1;
wordCount.set(word, count);
// 更新出现次数最多的单词
if (count > maxCount) {
maxCount = count;
res = word;
}
}
}
return res;
};