<aside> 📘 下文中会提到的一些术语

Twenty years ago, Joel Spolsky wrote:

There Ain’t No Such Thing As Plain Text.

It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that “plain” text is ASCII.

A lot has changed in 20 years. In 2003, the main question was: what encoding is this?

In 2023, it’s no longer a question: with a 98% probability, it’s UTF-8. Finally! We can stick our heads in the sand again!

The question now becomes: how do we use UTF-8 correctly? Let’s see!

What is Unicode?

Unicode is a standard that aims to unify all human languages, both past and present, and make them work with computers.

In practice, Unicode is a table that assigns unique numbers to different characters.

For example:

Unicode refers to these numbers as code points.

Since everybody in the world agrees on which numbers correspond to which characters, and we all agree to use Unicode, we can read each other’s texts.