393. UTF-8 编码验证

🟠 🔖 位运算 数组 🔗 力扣open in new window LeetCodeopen in new window

题目

Given an integer array data representing the data, return whether it is a valid UTF-8 encoding (i.e. it translates to a sequence of valid UTF-8 encoded characters).

A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:

For a 1-byte character, the first bit is a 0, followed by its Unicode code.
For an n-bytes character, the first n bits are all one's, the n + 1 bit is 0, followed by n - 1 bytes with the most significant 2 bits being 10.

This is how the UTF-8 encoding would work:

    Number of Bytes   |        UTF-8 Octet Sequence
                      |              (binary)
  --------------------+-----------------------------------------
           1          |   0xxxxxxx
           2          |   110xxxxx 10xxxxxx
           3          |   1110xxxx 10xxxxxx 10xxxxxx
           4          |   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

x denotes a bit in the binary form of a byte that may be either 0 or 1.

Note: The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.

Example 1:

Input: data = [197,130,1]
Output: true
Explanation: data represents the octet sequence: 11000101 10000010 00000001.
It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.

Example 2:

Input: data = [235,140,4]
Output: false
Explanation: data represented the octet sequence: 11101011 10001100 00000100.
The first 3 bits are all one's and the 4th bit is 0 means it is a 3-bytes character.
The next byte is a continuation byte which starts with 10 and that's correct.
But the second continuation byte does not start with 10, so it is invalid.

Constraints:

1 <= data.length <= 2 * 10^4
0 <= data[i] <= 255

题目大意

给定一个表示数据的整数数组 data ，返回它是否为有效的 UTF-8 编码。

UTF-8 中的一个字符可能的长度为 1 到 4 字节 ，遵循以下的规则：

对于 1 字节 的字符，字节的第一位设为 0 ，后面 7 位为这个符号的 unicode 码。
对于 n 字节 的字符 (n > 1)，第一个字节的前 n 位都设为 1，第 n+1 位设为 0 ，后面字节的前两位一律设为 10 。剩下的没有提及的二进制位，全部为这个符号的 unicode 码。

这是 UTF-8 编码的工作方式：

    Number of Bytes   |        UTF-8 Octet Sequence
                      |              (binary)
  --------------------+-----------------------------------------
           1          |   0xxxxxxx
           2          |   110xxxxx 10xxxxxx
           3          |   1110xxxx 10xxxxxx 10xxxxxx
           4          |   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

x 表示二进制形式的一位，可以是 0 或 1。

注意： 输入是整数数组。只有每个整数的 最低 8 个有效位 用来存储数据。这意味着每个整数只表示 1 字节的数据。

示例 1：

输入： data = [197,130,1]
输出： true
解释： 数据表示字节序列:11000101 10000010 00000001 。
这是有效的 utf-8 编码，为一个 2 字节字符，跟着一个 1 字节字符。

示例 2：

输入： data = [235,140,4]
输出： false
解释： 数据表示 8 位的序列: 11101011 10001100 00000100.
前 3 位都是 1 ，第 4 位为 0 表示它是一个 3 字节字符。
下一个字节是开头为 10 的延续字节，这是正确的。
但第二个延续字节不以 10 开头，所以是不符合规则的。

提示:

1 <= data.length <= 2 * 10^4
0 <= data[i] <= 255

解题思路

初始化变量 count 表示当前字符所需的后续字节数。
遍历 data 数组，根据当前 count 判断字节类型：
- 若 count == 0，判断是否是单字节字符或多字节字符的起始字节：
  - 0xxxxxxx: 单字节字符。
  - 110xxxxx: 两字节字符起始，设置 count = 1。
  - 1110xxxx: 三字节字符起始，设置 count = 2。
  - 11110xxx: 四字节字符起始，设置 count = 3。
  - 其他情况直接返回 false。
- 若 count > 0，验证当前字节是否以 10 开头。如果不满足，返回 false。
遍历结束时，若 count == 0 则验证成功，否则返回 false。

复杂度分析

时间复杂度：O(n)，其中 n 是数据长度，需要遍历每个字节，进行前缀检查和验证。
空间复杂度：O(1)，使用常数空间存储变量。

代码

/**
 * @param {number[]} data
 * @return {boolean}
 */
var validUtf8 = function (data) {
	let count = 0;
	for (let byte of data) {
		if (count == 0) {
			if (byte >> 7 == 0) continue;
			else if (byte >> 5 == 0b110) count = 1;
			else if (byte >> 4 == 0b1110) count = 2;
			else if (byte >> 3 == 0b11110) count = 3;
			else return false;
		} else {
			if (byte >> 6 != 0b10) return false;
			count--;
		}
	}
	return count == 0;
};

393. UTF-8 编码验证

# 393. UTF-8 编码验证

# 题目

# 题目大意

# 解题思路

# 复杂度分析

# 代码

393. UTF-8 编码验证

题目

题目大意

解题思路

复杂度分析

代码