<aside> 📘 下文中会提到的一些术语

character：自然意义上的一个字。同义表述：grapheme、glyphs。
- grapheme/grapheme cluster 特指由多个 code point 组成的字
- glyphs 特指渲染后的一个象形文字
code point：unicode 编码的单位，为一个十六进制整数。character 可能由一个或多个 code point 组成
- Golang 里的 rune 可以理解为 code point
byte：字节，ASCII、UTF-8 等编码会将 code point 编码为二进制 </aside>

There Ain’t No Such Thing As Plain Text.

It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that “plain” text is ASCII.

A lot has changed in 20 years. In 2003, the main question was: what encoding is this?

In 2023, it’s no longer a question: with a 98% probability, it’s UTF-8. Finally! We can stick our heads in the sand again!

The question now becomes: how do we use UTF-8 correctly? Let’s see!

What is Unicode?

Unicode is a standard that aims to unify all human languages, both past and present, and make them work with computers.

In practice, Unicode is a table that assigns unique numbers to different characters.

For example:

The Latin letter A is assigned the number 65.
The Arabic Letter Seen س is 1587.
The Katakana Letter Tu ツ is 12484
The Musical Symbol G Clef 𝄞 is 119070.
💩 is 128169.

Unicode refers to these numbers as code points.

Since everybody in the world agrees on which numbers correspond to which characters, and we all agree to use Unicode, we can read each other’s texts.