Modal.com and NanoGPT continued: producing output; using tiktoken for bigger tokens

In the previous post we explored how to get NanoGPT training on Modal. There was quite a bit to that, so I left the text generation part to now just to cap that post off. Let’s do that now and then try some more stuff out with NanoGPT.

Let’s make some Shakespam

With all the setup work done in the first post, generating text on Modal will be much easier.

The repo code that generates text is sample.py, and we just need a script to hook into that and run in in Modal, which is this (train_modal.py):

import modal

# Make sure we have access to the data we prepared earlier:
volume = modal.NetworkFileSystem.new().persisted("nano-gpt-volume")

# Set up the container for running the training, and make sure it has the necessary
# python pacakages installed.
stub = modal.Stub("nano-gpt-sample",
    image=modal.Image.debian_slim().pip_install(
        ["torch", "numpy", "transformers", "datasets", "tiktoken", "wandb", "tqdm"]
    )
)

# This stub.function allows train_modal to be called remotely on their servers. We will
# now specify how we want that set up...
@stub.function(
        # Ensure that the function runs with a GPU, I have picked out a cheap one, but you can replace
        # this with "any" in the future if this GPU is no longer available.
        gpu=modal.gpu.T4(), 

        # Increase the timeout to allow long training times.
        timeout=3600, 

        # This tells modal to upload the entire nanogpt package we created. Without doing
        # this it won't be able to locate train.py, model.py etc.
        mounts=[modal.Mount.from_local_python_packages("nanogpt")],
        
        # Mount the data we prepared earlier
        network_file_systems={"/root/data": volume}
        )
def sample_modal():
    # This import is a cheeky and quick way to run nanogpt with minimal changes to Andrej's code. Ideally we would change
    # the `train`` module to expose a function. Then import `train` and call that function.
    import nanogpt.sample

# This is what gets called locally when running `modal run train_modal.py`, and it just calls the 
# remote function.
@stub.local_entrypoint()
def main():
    sample_modal.call()

Then to run it:

modal run sample_modal.py

The result of this is long, and is shown in the expander below. I think this is really impressive:

Shakespeare Output (click to expand)

It amazes me that we can get computers, which are purely logical to do stuff like this at all. For presentive my first computer was an Acorn Electron – 32kb RAM (a millionth of a decent laptop nowadays).

Another reason this is amazing is the step-change that using the transformer model (which is the T in GPT) gives you over other models that are shown in the zero-hero course. It is not just the computing power that does this, but the research into new models that has happened in the last 20 years or so.

Turning up the temperature

Andrej included a temperature setting, which allows you to adjust the “randomness” of the output:

  • If set to very close to zero, it will produce the same output each time. This is the output it considers “most likely”.
  • If set to 1, it will produce the output with the probabilities it predicts, for example if it decides, based on training, that there is a 80% chance of an o coming next, and 15% chance of a d, then it will produce an o 80% of the time.
  • If set higher, the probabilities will move closer together, giving less likely character more chance of appearing.

The chart below (link to Google sheet) shows how increasing temperature makes the probabilities of 3 potential “next characters” close up to each other, and decreasing causes the preferred outcome to be picked as the winner always:

Let’s try a temperature of 2, add this line to train_shakespeare_char.py:

temperature = 2

Here is a small sample of the output I got. It is definitely more chaotic!

HASTINMBSABURY:
Stir-3 Sleep, haugs:
Warthy, usquick..tWarwiXl!
Hatensworn my feans?
You know,
Young, tof it is!
BAmilind!

A low of temperature of 0.1 give us this, which seems more coherent, but much more “stuck like a record”:

CORIOLANUS:
I will be so so much a part of the people,
And then the way of the common of the court,
And then the way of the people of the court,
And the prince of the people of the court,
Which we have stood of the prince of the people,
And the princely of the streets of the state,
Which we have stood to the body of the sea,

I think the default temperature of 0.8 was probably “just right” like the porridge!

Using tiktoken for better encoding of the text

Tiktoken is a tokenizer library used by OpenAI. It’s job is to turn a sentence into a string of number representations, which can then be used to train the model. It does this using an algorithm which first encodes the most frequent words as single tokens, while the less frequent words that contain more frequent words as its subwords are represented by multiple tokens, each of them representing a word part..

Until now. have been training by converting each character to a number. However the problem here is we are not making good use of the structure already in English: words and parts of words to process numbers with more meaning.

Tiktoken offers a choice of the pre-built tokenizers they use in their models, and Andrej uses the gpt2 one. To give an idea of what this does, here is some code that encodes using tiktoken, then shows the resulting encoding

enc = tiktoken.get_encoding("gpt2")
for tok in enc.encode("Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversations?'"):
   print(f'{str(tok).ljust(5)} : {enc.decode([tok])}')

Here is the result:

Click to expand

What I find interesting here is "and" & " and" are different tokens: 392 & 290. It is also interesting that most tokens are whole words here. “Peeped” is the odd one out that got split up.

To train the model using the tiktoken we need to run the prepare.py file in the shakespeare folder (as opposed to the shakespeare_char folder we used last time).

Training with the Tiktoken encoding

There are a few things I had to do to get this to work. It got a bit messy, so I won’t share the code here, but I aim to put something better up on Github eventually. But in short I had to

  • Change the GPU to A100 – 20Gb to have a chance to train it in a reasonable time
  • Because modal has “regions” this means also changing the volume name, so it could create a new volume near that GPU’s region
  • And this mean changing all the modal calls to specify the A100 -20Gb GPU so they would be in the same region
  • I also changed the parameters, I reduced the batch size from 256 to 64, since the tokens now mean more than they did before, so we can do with fewer, but I increased the embedding size from 384 to 384 * 4 since we might need more dimensions to represent a word.

With all of that done, here are the results I got, there is a lot more text because the number of tokens generated is as before:

Click to expand

Training costs were $0.71 for GPU and $0.09 for other stuff. It took almost bang on 1 hour to train. Inference (generating text) took a few seconds.

Leave a Reply

Your email address will not be published. Required fields are marked *

Human-made Content