Tokenizers: How machines read

<aside> 📘 Series:

</aside>

The world of Deep Learning (DL) Natural Language Processing (NLP) is evolving at a rapid pace. We tried to capture some of these trends in an earlier post which you can check out if you want more background into these developments. Two of the most important trends are the Transformer (2017) architecture and the BERT (2018) language model, which is the most famous model to take advantage of the former architecture. These two developments, in particular, have been pivotal in helping machines perform much better at a wide range of language-reading tasks.

A vital part of these new developments is how they consume the text they need to perform well in complex linguistic tasks. We often skip this part of the process in order to get to the “meatier” core of the cool new model. It turns out that these input steps are a whole research field of their own, with an abundance of complex algorithms all working simply to enable the larger language models to learn higher level linguistic tasks. Think of it like the sorting algorithms needed to order vast arrays of numbers. While you only use the sort function in your Python script there is a whole industry whose sole goal is to optimize for better and better sorting algorithms. But before we dive into the specifics of these algorithms, we first need to understand why reading text is a difficult task for a machine.

Why is reading difficult for machines?

You could understand language before you learned to read. When you started school you could already talk to your classmates even though you didn’t know the difference between a noun and a verb. After that, you learned to turn your phonetic language into a written language so that you could read and write. Once you had learned to turn text into sounds, you were able to access your previously learned bank of word meanings.

Computers (i.e. Language Models (LMs) or lookup programs (WordNet)) do not learn to speak before they learn to read so they cannot lean on a previous memory bank of learned word meanings. They need to find another way of discovering word meaning.

Machines don’t have this phonetic head start. Without knowing anything about language we need to develop systems that enable them to process text without the ability, like humans, of already being able to associate sounds with the meanings of words. It’s the classic “chicken and egg” problem: how can machines start processing text if they know nothing about grammar, sounds, words or sentences? You can create rules that tell a machine to process text to enable it to perform a dictionary-type lookup. However, in this scenario the machine isn’t learning anything, and you would need to have a static dataset of every possible combination of words and all their grammatical variants.

Instead of training a machine to lookup fixed dictionaries, we want to teach machines to recognize and “read” text in such a way that it can learn from this action itself. In other words, the more it reads the more it learns. Humans do this by leveraging how they previously learned phonetic sounds. Machines don’t have this knowledge to leverage so they need to be told how to break text into standard units to process it. They do this using a system called “tokenization”, where sequences of text are broken into smaller parts, or “tokens”, and then fed as input into a DL NLP model like BERT. But, before we look at the different ways we can tokenize text let’s first see if we really need to use tokenization at all.

Do We Need Tokenizers?

To teach a DL model like BERT or GPT-2 to perform well at NLP tasks we need to feed it lots and lots of text. Hopefully, through the specific design of the architecture, the model will learn some level of syntactic or semantic understanding. It is still an area of active research as to what level of semantic understanding these models learn. It's thought that they learn syntactic knowledge at the lower levels of the neural network and then semantic knowledge at the higher levels as they begin to hone in on more specific language domain signals, e.g. medical vs. technical training texts.

The specific type of architecture used will have a significant impact on what tasks the model can deal with, how quickly it can learn and how well it performs. GPT2 uses a decoder architecture, for example, since its task is to predict the next word in a sequence. In contrast, BERT uses an encoder type architecture since it is trained for a larger range of NLP tasks like next-sentence prediction, question and answer retrieval and classification. Regardless of how they are designed, they all need to be fed text via their input layers to perform any type of learning.

The above diagram shows that we can tokenize input text in different ways.

Option 1 is not ideal since all the words are simply bunched together into one token.
Option 2 breaks the input sequence into separate word tokens.
Option 3 uses one token but adds the “/” symbol to try and differentiate between words.

One simple way to do this would be to simply feed the text as it appears in your training dataset. This sounds easy but there is a problem. We need to find a way to represent the words mathematically for the neural network to process it.