Running local Python code on a remote GPU (using modal and lob.py)

I explored in a previous post how to run nanoGPT on modal – both the training and sampling. It was successful, but tiresome. There were a lot of changes to the downloaded code which made me unhappy. If I want to try different projects out that are on Github etc. I don’t want to be doing a lot of coding and fiddling just to get the existing code to run. I just want to run it as if I am running it locally. This is where my script lob.py comes in!

What is lob.py?

This is a Python script that provides a fairly easy way (all things considered!) to run your local code on a cloud GPU. It does this by running your code on Modal, and handles some of the logistics of doing so, such as uploading source code.

Let’s explore what it does, by doing what took 1.5 blog posts in 1 blog post, and train and run nanoGPT on Modal.

Using lob.py to train and run nanoGPT.

First, clone nanoGPT, and download lob.py from a public Gist (I may make this a full Github repo later, we will see).

git clone https://github.com/karpathy/nanoGPT
cd nanoGPT
wget -O lob.py https://gist.githubusercontent.com/mcapodici/eaa39861affe75be13badefbe9e05079/raw/bbc9e3cbb692277ffcf18406c61805685bf70d25/lob.py

Now set up a python environment you favourite way. I will use venv in this example:

python3 -m venv .
source bin/activate

Now you might want to add the following to .gitignore to avoid have lots of changes show up (this might differ if you used another Python environment tool)

bin
lib
lib64
share

Now install modal and log in, using their standard instructions:

pip install modal-client
modal token new

We will now set up the lob.py for our requirements. The version we downloaded is already set up for nanoGPT, but let’s review the contents of it’s parameters.

Setting up lob.py run parameters

The first one is just selecting what GPU you want to use. For nanoGPT, the cheapest one t4 is plenty enough for the task:

# Choose one of: "t4", "a10g", "inf2", "a100-20g", "a100" or None
gpu="t4"

Next we define the commands that run. These are run after copying all the local code files to the server and changing directory into that folder. We have a single command for each stage, but you can have multiple.

commands={
    'prepare': ['python data/shakespeare_char/prepare.py'],
    'train': ['python train.py config/train_shakespeare_char.py'],
    'sample': ['python sample.py --out_dir=out-shakespeare-char'],
}

Now we set verbose, which tells us what files are being uploaded, we set the name of the volume (so that we can keep our files for this project separate), a timeout of 60 minutes after which Modal will terminate the job and a list of paths to not upload:

verbose=True
volume_name_prefix="2023-07-27-10-45"
timeout_mins=60
exclude_paths_starting_with=["./.git", "./.github", "./bin", "./lib", "./share"]

Finally we define the image, which is how the container will be set up that will run the program. rsync is needed because it is used to copy up the right files (without losing generated files on the server). In addition we need to do the pip install defined in the README.md of the nanoGPT project:

image = modal.Image \
    .debian_slim() \
    .apt_install("rsync") \
    .pip_install("torch numpy transformers datasets tiktoken wandb tqdm".split(" "))

Train and run using lob.py

With all the set up done, running is very simple. Just run these commands one after the other. They correspond to instructions in the readme.md

modal run lob.py --command prepare
modal run lob.py --command train
modal run lob.py --command sample

Here is some output from the final phase:

ISABELLA:
This is the day of this is your land;
But I have been call'd up him been your tent?

DUKE VINCENTIO:
How far of the solemnity? who is wrong'd?
Why should we shame an arms stoop of life?
They will prove his like offence with life
And to be crave a happy model's guilty of his cheeks;
For all his foes, that are gone of me.

Here is the entire output of the 3 commands (click to expand):

Click to expand

Notes about how lob.py works

The script works by doing the following:

  • The is a function called copy (I should probably have called this copy_and_run) that runs on the remote machine, and copies all of the changed files from the local file system to the remote machine.
  • To this function we bind 2 directories that appear on the remote system:
    • /source/code is a mount that is a copy of your local folder, except for the folders mentioned in exclude_paths_starting_with. This is a (I think) temporary and (for sure) read-only folder.
    • /root/code which is a “network file system” which has been set up as persistent and read/write.
  • The copy function uses rsync to copy everything that has changed from the mount to the persistent file system. This means that future runs are quicker (they only need to copy changed files) and the running code can save data, such as model snapshots, and recover them on future runs.
  • Once copy has done with rsync it changes directory into the /root/code folder, then runs your commands.

Final note:

Hope this was useful and interesting. I'd appreciate it if you drop your name/email below, and I can keep you up to date on new posts. I am for approx 1-2 a fortnight, but that's not a promise! Similar idea to subscribing to a Substack.

Leave a Reply

Your email address will not be published. Required fields are marked *

Human-made Content
×