<aside> 📘 下文中会提到的一些术语
character
:自然意义上的一个字。同义表述:grapheme
、glyphs
。
grapheme
/grapheme cluster
特指由多个 code point 组成的字glyphs
特指渲染后的一个象形文字code point
:unicode 编码的单位,为一个十六进制整数。character 可能由一个或多个 code point 组成
rune
可以理解为 code pointbyte
:字节,ASCII、UTF-8 等编码会将 code point 编码为二进制
</aside>Twenty years ago, Joel Spolsky wrote:
There Ain’t No Such Thing As Plain Text.
It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that “plain” text is ASCII.
A lot has changed in 20 years. In 2003, the main question was: what encoding is this?
In 2023, it’s no longer a question: with a 98% probability, it’s UTF-8. Finally! We can stick our heads in the sand again!
The question now becomes: how do we use UTF-8 correctly? Let’s see!
Unicode is a standard that aims to unify all human languages, both past and present, and make them work with computers.
In practice, Unicode is a table that assigns unique numbers to different characters.
For example:
A
is assigned the number 65
.س
is 1587
.ツ
is 12484
𝄞
is 119070
.💩
is 128169
.Unicode refers to these numbers as code points.
Since everybody in the world agrees on which numbers correspond to which characters, and we all agree to use Unicode, we can read each other’s texts.