最近遇到几个项目被MySQL的utf8编码坑,想起之前编码问题被坑的惨痛教训,记录一下,警示自己。
曾几何时,每次建库都选utf8,觉得自己比那些用乱七八糟编码的人不知道酷到哪里去了。直到好多年前的某次课程设计做项目的时候,愉快的建了个用户表:
CREATE TABLE `test_user` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(32) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
然后愉快的新增用户:INSERT INTO test_user(name) VALUES("我是"),接着愉快的反思人生:
Incorrect string value: '\xF0\x9F\x98\x81' for column 'name' at row 1
我是谁?我来自哪里?我在干嘛?难道是我代码里面的字符集用错了?不对啊我所有地方都用的utf8啊……
# MySQL的UTF8编码是什么?
首先来看官方文档:
The character set named utf8 uses a maximum of three bytes per character and contains only BMP characters. The utf8mb4 character set uses a maximum of four bytes per character supports supplementary characters:
For a BMP character, utf8 and utf8mb4 have identical storage characteristics: same code values, same encoding, same length.
For a supplementary character, utf8 cannot store the character at all, whereas utf8mb4 requires four bytes to store it. Because utf8 cannot store the character at all, you have no supplementary characters in utf8 columns and need not worry about converting characters or losing data when upgrading utf8 data from older versions of MySQL.
我们再看看维基百科对UTF8编码的解释:
UTF-8 is a variable width character encoding capable of encoding all 1,112,064 valid code points in Unicode using one to four 8-bit bytes.
可以看出,MySQL中的utf8实质上不是标准的UTF8。MySQL中,utf8对每个字符最多使用三个字节来表示,所以一些emoji甚至是一些生僻汉字就存不下来了,比如“