Unit test mocks in Go

posted on July 24, 2025July 24, 2025tagged as Blog, Go, Testing

Recently I have been using Go for the first time seriously. Go is famously easy to learn for programmers from other languages, and I found this to be true. However, performing mocking in Go unit tests required a bit of my head scratching, so I’ve decided to write about it!

This post explores different ways you can mock in Go, and the tradeoffs involved between them.

What is test mocking?

It’s good to go over the definitions of mocks. I am going to use these definitions, taken from Martin Fowler’s article on mocks vs. stubs, which are in turn taken from Gerard Meszaros’s book xUnit Test Patterns, with my minor edits for brevity. Don’t worry about remembering all of this—I’ll recap as I use them:

Test Double is the generic term for any kind of pretend object used in place of a real object for testing purpose, can be one of:

Dummy objects are passed around but never actually used. Usually they are just used to fill parameter lists.

Fake objects actually have working implementations, but usually take some shortcut which makes them not suitable for production (an in memory database is a good example).

Stubs provide canned answers to calls made during the test, usually not responding at all to anything outside what’s programmed in for the test.

Spies are stubs that also record some information based on how they were called. One form of this might be an email service that records how many messages it was sent.

Mocks are what we are talking about here: objects pre-programmed with expectations which form a specification of the calls they are expected to receive.

I am going to show how to do various test doubles in Go, some of which are mocks.

Test doubles help reduce the computational cost of a test, and make setup simpler (no need to create a DB, for example). They tend to help make more reliable and focused tests.

Download-a-cool-RSS-feed example program.

We will get started with testing, but we need something to test. The program we will use for our examples downloads the feed of (perhaps) your favorite programming blog. Feel free to change the URL!

It’s simple enough, but there is enough complexity here to demonstrate different types of testing. Here is the code:

package main

import (
	“io”
	“log”
	“net/http”
	"os"
)

func main() {
	err := Download("https://martincapodici.com/feed", "feed.xml")
	if err != nil {
		log.Fatalln(err)
		os.Exit(1)
	}
	log.Println("Download complete")
}

func Download(url string, fileName string) error {
	resp, err := http.Get(url)
	if err != nil {
		return err
	}

	defer resp.Body.Close()

	file, err := os.Create(fileName)
	if err != nil {
		return err
	}
	defer file.Close()

	_, err = io.Copy(file, resp.Body)
	return err
}

You can try this now, by copying this into main.go and running the command:

go run main.go

This code has a function Download(...) that performs IO, both to make an HTTP call, and to save the result to disk. I am looking to unit test this without doing any real IO, so I need to change this a bit so it can be run with or without IO. My first job is to modify the function so that mocked IO is an option.

I will create an interface called IO that will expose CreateFile() and GetUrl() methods. I can then create mock implementations of that interface.

First let’s add the interface that lets this method do the IO things it needs to do:

type IO interface {
	GetUrl(url string) (resp *http.Response, err error)
	CreateFile(fileName string) (file io.ReadWriteCloser, err error)
}

The file is passed using the io library interface ReadWriteCloser, to make it swappable for a test object.

You might notice the absence of an interface for the io.Copy call. Since io.Copy works with interfaces, it is already test-ready.

Here is the io.Copy documentation to prove it works with Reader and Writer interfaces:

// Copy copies from src to dst until either EOF is reached
// on src or an error occurs. It returns the number of bytes
// copied and the first error encountered while copying, if any.
// ...
func Copy(dst Writer, src Reader) (written int64, err error) {

Anyway, lets get this done! We now will implement the real (i.e. not test) methods for the interface. We need a “receiver” object to attach the methods to. Since there is no state we need to keep, this is an empty struct that we will call RealIO:

type RealIO struct{}


func NewRealIO() IO {
	return &RealIO{}
}

Here are its methods:



func (r *RealIO) GetUrl(url string) (resp *http.Response, err error) {
	return http.Get(url)
}

func (r *RealIO) CreateFile(fileName string) (file io.ReadWriteCloser, err error) {
	return os.Create(fileName)
}

The download function can now use the interface instead of direct library calls:

func Download(testIO IO, url string, fileName string) error {
	resp, err := testIO.GetUrl(url)
	if err != nil {
		return err
	}

	defer resp.Body.Close()

	file, err := testIO.CreateFile(fileName)
	if err != nil {
		return err
	}
	defer file.Close()

	_, err = io.Copy(file, resp.Body)
	return err
}

Finally, we can update the main program to provide this. A one line change:

err := Download(NewRealIO(), "https://martincapodici.com/feed", "feed.xml")

Testing the app, using mocks (and friends)

We will now look at a few ways we can test the code above. We’ll start with the pure Go solution, then use a library, and then touch on code generation options.

Testing with vanilla Go: Or, just use the interface

Without importing any external libraries we can create a test double that implements the IO interface, providing test data and recording calls.

However, before we get to that, lets create a fake file as well need that for constructing the test and for the interface implementation.

I have chosen to fake the file using a buffer (bytes.Buffer). A buffer is basically bytes in memory that happens to have the same Read and Write methods as a file—so swapping in a buffer for a file should be easy enough.

Buffer + Close() = What we want

A buffer however doesn’t have a Close method, so we need to add one. But how do we do this when bytes.Buffer is a standard library object we can’t modify?

To make a buffer that is a ReadWriteCloser the following code uses a technique called struct embedding. Struct embedding is syntax sugar, meaning it’s short hand for something that would normally need more code.

Struct embedding lets you to make a struct a field of another struct and have all of the functions defined on that struct be part of the outer struct.

If this is confusing, it should become clear when you see the code or try this code out for yourself and experiment.

With struct embedding we can embed the buffer in a new struct, called InMemoryFile that gets given the Read and Write methods from the buffer. We then add the additional Close method so that InMemoryFile implements the entire ReadWriteCloser interface:

type InMemoryFile struct {
	bytes.Buffer
}

func (f *InMemoryFile) Close() error { return nil }

// + Struct Embedding Magic! InMemoryFile's Read() and Write() functions are implemented behind the
// scenes by Go so we don't have to write them by hand. What do they do? They just call
// the Read() and Write() functions of the buffer.

We will now create a TestIO struct, that will be used to sort-of act like the real IO, and also help us make assertions for the test.

Lets define the struct and a function to create a new instance. We have an InMemoryFile in our struct so we can fake the file:

type TestIO struct {
	file InMemoryFile
}

func NewTestIO() *TestIO {
	file := InMemoryFile{*bytes.NewBuffer(nil)}
	return &TestIO{file}
}

The interface for IO has two methods that we need to implement. The first method is GetUrl, which is set up to return an OK response for a test URL, and otherwise an error:

func (f *TestIO) GetUrl(url string) (resp *http.Response, err error) {
	if url == "http://testurl" {
		resp = &http.Response{
			StatusCode: http.StatusOK,
			Body:       io.NopCloser(strings.NewReader("OK")),
		}
		return resp, nil
	}
	return nil, errors.New("error")
}

The CreateFile method is next, and will return the fake file, or an error:

func (f *TestIO) CreateFile(fileName string) (file io.ReadWriteCloser, err error) {
	if fileName == "testfile" {
		return &f.file, nil
	}
	return nil, errors.New("error")
}

This is a test double object that is pretty hard wired for a specific kind of test. Hint hint! (there is a quiz question later!)

So we now have enough scaffold to do a basic test that simulates a file download, and checks that the mocked file has the correct contents:

func TestDownload(t *testing.T) {
	testIO:= NewTestIO()
	err := Download(testIO, "http://testurl", "testfile")
	if err != nil {
		t.Error(err)
	}
	b := testIO.file.Buffer.String()
	if string(b) != "OK" {
		t.Errorf("expected file to contain 'OK', got %s", string(b))
	}
}

This is only tests one scenario, and doesn’t test different kinds of IO errors. To do that we’d need to extend the test double to do more, and possibly record values that are passed in to them.

Doing this by hand gets tedious, so we will look into how to make this less work with mocking libraries. There are two good options I have seen, and we can weigh up pros and cons after.

Quick quiz

What kind of test double is TestIO? (Revisit the definitions under What is test mocking if needed.)

Click for answer

In my opinion, TestIO is a Spy.

Spies are stubs that also record some information based on how they were called and stubs provide canned answers to calls made during the test, usually not responding at all to anything outside what’s programmed in for the test.

It is not a fake because this object fails outside the test scenario for the URL http://testurl. Neither is it a dummy, since it is used and interacted with. It is not a mock since it is not reprogrammable with expectations.

show less

Use the `stretchr/testify/mock` library

The stretchr/testify/mock library is straightforward to use. It’s hard to beat the example given on https://pkg.go.dev/github.com/stretchr/testify/mock#hdr-Example_Usage for a description of how this library works and how to use it. Please read that first.

In short though the idea is your mock struct embeds the mock.Mock object, which you then use in each method you want to implement in the mock, to pass in the given parameters and return mock values that were set up in the test.

Here is how to use it for our example. First install it:

go get github.com/stretchr/testify/mock

Then in a new test file, say main2_test.go, we can create the mock, along with the interface methods that delegate the work to the mock. Each method passes it’s arguments into Called to tell the mock it was called, and then the mock returns the result that can be deconstructed. Nothing specific is set up yet; we get to control the behavior from the test code.

type MockIO struct {
	mock.Mock
}

func (m *MockIO) GetUrl(url string) (resp *http.Response, err error) {
	args := m.Called(url)
	return args.Get(0).(*http.Response), args.Error(1)
}

func (m *MockIO) CreateFile(fileName string) (file io.ReadWriteCloser, err error) {
	args := m.Called(fileName)
	return args.Get(0).(io.ReadWriteCloser), args.Error(1)
}

I also like add a line that asserts that MockIO has implemented the IO interface correctly. This gives you immediate feedback via the IDE if the interfaces is not properly implemented, plus it prevents the code from compiling if such an issue exists:

var _ IO = (*MockIO)(nil)

With this in the place we can now have a succinct test that defines how the mock should behave, then tests the function with that behavior. The rather haphazard fake implementation used earlier is replaced with just two testIO.On statements:


func TestDownloadUsingMock(t *testing.T) {
	testIO := new(MockIO)

	file := &InMemoryFile{*bytes.NewBuffer(nil)}

	resp := &http.Response{
		StatusCode: http.StatusOK,
		Body:       io.NopCloser(strings.NewReader("OK")),
	}

	testIO.On("GetUrl", "http://testurl").Return(resp, nil)
	testIO.On("CreateFile", "testfile").Return(file, nil)

	err := Download(testIO, "http://testurl", "testfile")
	if err != nil {
		t.Error(err)
	}

}

Hopefully you can see that it is now much easier to test other scenarios without needing to change the mock implementation. In addition to the On/Return pattern shown here there a multitude of behaviors available in the library, as well as the ability to bake in expectations e.g. the function should be called exactly once. There is also a MatchedBy argument you can use to specify a function that determines if an argument, for example, this could be used in the test too:

	testIO.On("GetUrl", mock.MatchedBy(func(url string) bool { return strings.Contains(url, "testurl") })).Return(resp, nil)

These features make it strictly better than building your own fake implementations in my opinion, which some rare exceptions.

An example of where a fake might be preferred is you want the fake to act like the real object by maintaining internal state. For example where the real object talks to a Redis cache, and the fake object uses a crude in-memory cache, and consumers want to test their code integrated with that cache. In a test where the function you are testing is making heavy use of the cache, it might be less attractive to set up all of the possible interactions in a mock.

Mockery

I’ll give a quick shout out to mockery, a tool that generates your stretchr/testify/mock mocks. https://vektra.github.io/mockery/latest

The benefit of this is not typing out a lot of boilerplate, but the downside is perhaps having mocks for things you never need, and adding bloat to your code base.

Check out their documentation if you want to use it. I haven’t yet decided if I prefer Mockery, or hand writing mock objects as needed. If you have an opinion or know of other interesting Go mocking libraries, please let me know, e.g. leave a comment.

Image by Bianca Van Dijk from Pixabay

Production tests: a guidebook for better systems and more sleep

posted on May 13, 2025May 13, 2025tagged as Blog, Devops, SRE

Your customers expect your site to be fully working whenever they need it. This means you need to aim for near-perfect uptime not just for the site itself, but for all features customers may use.

Modern software engineering uses quality control measures such as automated test suites and observability tools (tracing, metrics, and logs) to ensure availability. Often overlooked in this landscape is production tests (also known as synthetics) that can give you immediate notification of failures in production.

Production tests can be setup up with minimal fuss—usually within one sprint—and can provide a high return on investment. In this post, I will cover how to best set up production tests and how they can help with reliability, deployments, and observability.

While I have always liked production tests, I got a real appreciation for them at Atlassian, where they are used extensively and are called “pollinators”, and I have seen first hand how they can give early warnings of problems, which can be fixed before the become incidents.

What are production tests?

A production test is any automated test that runs on the production environment. The test runs on a frequent schedule so that an on-call engineer can respond quickly. Typically, they run every minute. The test might use a headless browser to emulate user actions, or it may use an API directly to emulate the actions of browser code or backend service.

The production test should run in a reasonable time. I suggest 30 seconds or less, so that you can run the test easily once per minute. A test that takes longer than 30 seconds is probably to complex anyway for a production test. How the test deals with failure is up to your team. It could integrate with your on call paging system, send a Slack notification, or just log an error into your logging systems.

How do production tests help?

Productions tests help make your production environment more reliable by giving immediate warning of a regression. This means you can potentially fix issues before a customer discovers them.

In addition, the production test can be used as a canary before deployments, and as such acts as an integration test. The test can detect regressions that are caused by mismatches with other services, such as API shape or error-handling issues. You can run the test when deploying to development and staging environments too, to get a warning of an issue during development.

You also make it easier to debug production issues by having production tests. If you have an issue you are investigating, knowing which production tests are working and are failing gives you insight into what the issue might be. If you rely on other team’s services, and they also have production tests, you can look at their tests too to help with diagnostics.

Having production tests will reduce the time to recovery for any incident where a human has to fix it, as the engineer learns about the issue sooner, and has more information at hand to resolve it.

Important design considerations for production tests

For production-tests to be worth having, they need to be well-thought-out. If the test keeps failing it’ll probably get silenced and ignored. Even if the test is reliable it can cause other problems, such as resource usage impacting downstream systems. You may even have to change your systems to be a bit more testable. Here are some tips to consider when setting up your production tests:

Keep your production tests basic

Your production tests should cover less ground than your automated test suites. You want the tests to be reliable enough that they don’t waste your time with false alerts, which leads to frustration, lost time and possibly disabling of alerts. The goal of a production test is to be the canary that something has gone badly wrong. It gives you a head start on fixing it, hopefully before a customer gets affected.

To illustrate the required simplicity, here are some examples of what I think are good candidates for a production test:

Log in, and confirm you are on the home page, showing that user’s name.
Load the main editor of your app and type in Hello. Confirm Hello is there. Reload the browser and confirm Hello is still there.
Call 4 API endpoints to do CRUD operations for your microservice, using possibly some fake data in the API.
Ping /health and check for 200 response code within 250ms

Contrast this with these more complex and problematic production tests:

Run through a 25-step test that checks all the main functionality of the editor, asserting that elements are the correct size and are in the correct position.
Test that if you make call in quick succession to the API, they will be added to the database in the same order.
Check that the credit card page can differentiate between Visa and Mastercard based on the card number.
Ping /health and check for 200 response code within 1.5ms

The first examples are good because they are quick, reliable, and simple. They are not too functionality-specific, are so are unlikely to be broken by feature changes.

The bad examples on the other hand have problems:

The 25-step test will likely be flaky due to browser automation quirks, and feature changes. You will spend a lot of time investigating failures to the test, and concluding “It was just a one-off” or “Oh! It’s because we moved a button”.

The API calls in quick succession will sometimes fail due to network conditions changing the order in which the requests are received. It is not necessary to do something sophisticated in a production test—your goal is to know if something is horribly broken, not detect subtle timing issues.

The credit card test is probably OK in terms of not being flaky, but is a little too specific. You can almost entirely de-risk a bug in credit card UI behavior with tests that run before deployment, so do that instead.

The health check test expecting a fast response is likely to fail often enough to cause alarms that have no meaningful response. There are better ways to monitor latency in production that I will cover in a future post.

But… try go get some decent coverage

You are not aiming for anything near 100% code coverage, in fact code coverage doesn’t really matter here. That said, production tests should cover more than just “load the home page”. This is an art, but a guiding idea here is “if there is a serious problem, how likely are my production tests to detect it”. You are balancing that up with “how frequently will I get false reports of an outage”.

You also may want to think about the value of the things you are testing. Is it a minor feature used by 1% of users, or is it the page where new customers sign up? The latter is particularly important for growth companies who rely a lot on customer self-servicing to try out their software.

You don’t have to get this completely right from the beginning. You can add more tests next week. Furthermore, you can edit your existing tests, or even remove them too. Err on the side of too little coverage to begin with and consider adding more later. This way your team will get used to owning, tweaking and responding to production tests, without a cacophony of alarm bells!

As a rough guide, I would say test 3-5 simple things to begin with, with the goal to eventually move towards a reasonable amount of coverage. What is reasonable? You’ll find out after trying things out, and discussing outcomes with your team. There is no correct answer here.

Production tests are not health checks, but may overlap with them

A health check is usually a simple check that the server is running. For example, if using Node.js and Koa or Express, you might add a health route that just returns a success response on any invocation.

The purpose of this check is to assess basic server health. It allows load balancers to know which nodes to send traffic to, and deployment pipelines to know when a deployment has completed (or failed).

We expect these health checks to fail in production, possibly without affecting the customer. For example if a machine goes offline, the load balancer will detect this and stop sending traffic to that node. However, if there is no adverse effect of this on the consuming service or customer then there is no urgent problem. The system has self-healed.

Calling this health check on a node as a production test is not advisable, as it will cause false alarms. Even if it didn’t, a health check is not a good indicator of user experience.

The term “Health check” is also sometimes used to refer to checks that do more that just check the server is up. E.g., they may check dependencies such as storage systems, caches, queues and so on. A production test that calls such a health check could give early warning of an issue before it becomes noticeable by customers.

Be mindful of how your production tests affect observability

Running a test every minute, or 1440 times a day, will show up quite a lot in logs, metrics, and traces. This is often a good thing, because regions or services with very low traffic are now a bit more “observable” than they would be without such tests.

The downside is it may add costs by keeping resources spun up in those regions. Another downside is that being fake traffic, and sometimes requiring fake data (such as a fake user ID) can add noise to the logs you will sometimes need to filter out.

Fake data considerations

Consider a monolithic application — a single server that serves a website and all of its functionality. You need to write a test that logs in, enters some data in a field, saves it and checks it got saved. Such a test has a few challenges.

Firstly you need to decide how it logs in. Does it have a real account? If it does—what stops that account from expiring (e.g., from a free trial). You may need a discount code or special “fake credit card” that the system knows is for testing. If it is a “fake” account then how does that work? Is it hard coded somewhere.

The test is now generating data on each run. Will the success or failure of a previous run affect the current run of the test. Will running the test thousands of times a week use up storage space?

Think about these issues when planning the tests.

The tempting thing is to do the easiest thing to get it done. E.g., create an account using your staff email, worry about storage later, just get it working. This can come back to haunt you if the test breaks later, and you need to deal with upgrading the account, or maybe the member of staff whose email was used is on leave.

Making a production system testable may take some work and have special switches, such as user fields, or feature flags so that the behavior can be slightly different to facilitate the test.

If you use microservices, or a monolith with a separate auth service, there are ways to avoid needing a test user account. For example if your service uses a JSON Web Token (JWT) to authenticate, the authentication server could support logins for test systems, and your service can declare which tests it is authorizing to call it.

Three strikes before an alarm

If you have enough production tests, and run them on enough systems for long enough you will get false alerts. These can be caused by all sorts of things, network problems that only affect the test side, quirks in browser automation, or a genuine problem that heals itself two seconds later.

The simple way to avoid getting these kinds of false alerts is to wait for three consecutive failures before raising any alarm.

You might be giving me a side eye right now though, because should we be ignoring these false alerts? Probably not. If you have a team ritual of looking through operations data, you can incorporate non-alerting production rest failures into that ritual. The idea is that you prioritize it like any regular work, rather than making it an urgent thing that requires people to work out of hours, or stop their regular work.

Pros and cons of production tests

What is good about testing in production? I have hinted at a few of these already, however let’s get into the details:

Real world testing: Test suites—including unit tests and locally-run integration tests—that run on every build are definitely needed. However, even with all of there is nothing quite like test-driving the finished car, and that is what production tests are for. The proof of the pudding is in the eating.
Quality Control: The main reason for running tests is to know something is broken. Despite all the automated and manual testing you do on your machine and in staging environments, production is always unique. There are some things that only happen they way they do in prod. And even if your staging environment is precisely the same, generally speaking production will have more traffic. Running tests in production let you know if certain things are working in production.
Troubleshooting: If there is an incident where customers are impacted by a bug or outage, the tests are very useful. They may be failing or passing, but either way, their current state tells you something about the system, and can help narrow down the problem.
Observability for low traffic regions: If you deploy your service to multiple regions and some of those regions don’t get much traffic, the production tests will create synthetic traffic. Your observability metrics such as latency and reliability percentiles will be more meaningful with even just 1000 calls per day, than with close to zero traffic.
Safer Deployments: The same tests you use to monitor production can double up as tests you run to accept a blue/green deployment. These tests are then acting as continuous integration tests of last resort, adding to the assurance that your deployment won’t cause big issues.
Reuse in other environments: Just as you can use these tests for safer deployments, you can also use them in dev and staging environments, to get early detection of issues that your build pipeline cannot detect, such as integration issues with other services.

There are, of course some disadvantages to using production tests:

Setup and teardown challenges: Tests run on the real system, so you cannot wipe the database before each run. You need to figure out how to best set up the specific scenario you want to test for, and how to clean up after the test.
They sometimes need setup of scenarios: For example you have a test than you can upgrade your account, but you need an account set up that needs to be upgraded, and a fake payment method that will be accepted. These scenarios may mean changing production code to support them too.
They can be flaky: Any test that runs on a real system can fail from time to time for various reasons. For example network issues between the test and the service or occasional timing issues in the browser the test uses. Flakiness of tests needs to be taken into account if you intend to wake someone up when the test fails.
They cause resource usage and costs: Running tests do cost. Often tests are run across different geographic regions to ensure they are tested well, and hundreds of times a day. That compute cost adds up.
Human cost in maintaining tests: Not every test fail is an issue, but it does need to be investigated. Too many tests could leave you with a full-time position monitoring and making those tests more robust.

Production tests, vs. observability

You can also “test production” though monitoring of real traffic, and looking for problems there. This deserves a post or series of its own, but in short you can check for things like:

Latency, e.g., alert me when the top 99-percentile of latency for a particular endpoint is > 200ms, for 3 periods in a row.
Reliability, e.g., alert me when more than 0.1% of requests fail with a 5xx code.
Assertions, e.g., alert me when a code path that is considered unexpected or impossible has been triggered.
Failures, e.g., alert me when a customer couldn’t perform a task because they got an error.

The good thing about observability-based alerts is they are very simple to set up. It may need some minor code changes, and then setting up detectors in observability tools, that take look for and take an action when the condition is met.

These tests pair well with production tests. They have a different purpose though, since they detect problems after they have happened, and they piggyback on existing system use. They can be helped by production tests creating additional synthetic traffic which they can then monitor, which lets you keep checking things when natural traffic levels fall off.

There is no “versus” though! You would do well to have both. For a particular need it might be better to use a production test or better to use observability. A rule of thumb is if observability can meet the requirements of alerting you to a problem, then it will be easier to set up and tune to your needs than a production test.

Summary

Phew! We covered a lot of ground there. In a nutshell adding well-designed production tests to your systems will help in a number of ways. You’ll get earlier warnings of issues, you will be able to fix them faster, you will get observability benefits, and you can also use production tests as a deployment rollout test.

It is worth adding production tests if you are not doing so already—it is a relatively small task to fit in to your planning, and you will reap rewards. If you are already using them, keep reviewing them, and see what you need to add, remove and tweak to get the most value from them as your systems evolve.

Happy testing!

LanguageTool — Check your writing the open-source way

posted on May 10, 2025May 10, 2025tagged as Blog, Editing, Writing

If you write anything (so that’s a yes) then an in-browser spelling and grammar checker can be handy. Even if all you write is emails!

A checker that is free and good, and easy to use, is LanguageTool. As a bonus, it is open-source too. The open-source version is a bit more technical to set up, and if you don’t want to do that, they have a paid cloud version. However, it isn’t too hard to get going, as we will see below.

1. Install the browser extension

Install the LanguageTool Chrome Extension or Firefox Extension.

This will work as-is, and you can start using the tool to check your spelling on WordPress, Google Docs, Gmail and most other apps.

However, it will be limited to a certain number of words, and certain features.

You can either 1. live with that, 2. upgrade to the paid cloud version, or 3. perform the following steps to run it locally for free.

2. Run a local LanguageTool server

The instructions to do this are here: https://dev.languagetool.org/http-server

If you are lucky enough to use a Mac, there is a simple brew installation. Otherwise, I found the Docker method quite convenient, if you already have Docker installed. (If you don’t it is worth installing Docker Desktop or Rancher, as it is pretty handy for many things, including ephemeral “installations” of software like LanguageTool for example).

The docker images are maintained by different people, so there are a few options. One of these is https://github.com/meyayl/docker-languagetool, which I got working well locally. The command to do so is there, but I will repeat it here:

docker run -d \
  --name languagetool \
  --restart unless-stopped \
  --cap-drop ALL \
  --cap-add CAP_CHOWN \
  --cap-add CAP_DAC_OVERRIDE \
  --cap-add CAP_SETUID \
  --cap-add CAP_SETGID \
  --security-opt no-new-privileges \
  --publish 8081:8081 \
  --env download_ngrams_for_langs=en \
  --env MAP_UID=783 \
  --env MAG_GID=783 \
  --read-only \
  --tmpfs /tmp \
  --volume $PWD/ngrams:/ngrams \
  --volume $PWD/fasttext:/fasttext \
  meyay/languagetool:latest

3. Tell your extension to use the local server

Click the LT logo on your browser. It is either on the toolbar, or nested inside the extensions menu (jigsaw icon). It looks like this:

Click the settings (gear icon), and then scroll down to Advanced settings (only for professional users) and under LanguageTool server, choose Local server and click Save.

4. Try it out

Open your favorite digital parchment, such as Google Docs, Confluence, or Notion, or anything else. For a quick try-out, it works here too: https://www.editpad.org/. Whatever you decide, when you open the page you should see a small blue circle with a tick, somewhere in the bottom right of the editing space.

Type something with an intentional mistake, e.g., “Helo”, and you should see the circle go red, and the mistake is highlighted. You are now being watched! In a good way.

Coding Resources

posted on December 22, 2023May 3, 2025tagged as Blog

A list of resources I have found useful for programming. Mainly for my own reference later:

Icons, quite a few resources for this!
- Iconfinder – search engine, but seems to let you download the SVG version for free unlike flaticon
- Bootstap icons
- Feather icons
- Majest icons
- Unicons
- Heroicons (and v1.heroicons.com)
- Iconoir
- Iconizer
- css.gg
- Phosphor icons
- Streamlinehq icons
- Tabler Icons
- Ionicons
- Remixicon
- Eva-icons
- Mingcute icons
- Tetrisly icons
- Doodle Icons

Running local Python code on a remote GPU (using modal and lob.py)

posted on July 27, 2023July 27, 2023tagged as Blog, MachineLearning

I explored in a previous post how to run nanoGPT on modal – both the training and sampling. It was successful, but tiresome. There were a lot of changes to the downloaded code which made me unhappy. If I want to try different projects out that are on Github etc. I don’t want to be doing a lot of coding and fiddling just to get the existing code to run. I just want to run it as if I am running it locally. This is where my script lob.py comes in!

What is `lob.py`?

This is a Python script that provides a fairly easy way (all things considered!) to run your local code on a cloud GPU. It does this by running your code on Modal, and handles some of the logistics of doing so, such as uploading source code.

Let’s explore what it does, by doing what took 1.5 blog posts in 1 blog post, and train and run nanoGPT on Modal.

Using `lob.py` to train and run nanoGPT.

First, clone nanoGPT, and download lob.py from a public Gist (I may make this a full Github repo later, we will see).

git clone https://github.com/karpathy/nanoGPT
cd nanoGPT
wget -O lob.py https://gist.githubusercontent.com/mcapodici/eaa39861affe75be13badefbe9e05079/raw/bbc9e3cbb692277ffcf18406c61805685bf70d25/lob.py

Now set up a python environment you favourite way. I will use venv in this example:

python3 -m venv .
source bin/activate

Now you might want to add the following to .gitignore to avoid have lots of changes show up (this might differ if you used another Python environment tool)

bin
lib
lib64
share

Now install modal and log in, using their standard instructions:

pip install modal-client
modal token new

We will now set up the lob.py for our requirements. The version we downloaded is already set up for nanoGPT, but let’s review the contents of it’s parameters.

Setting up `lob.py` run parameters

The first one is just selecting what GPU you want to use. For nanoGPT, the cheapest one t4 is plenty enough for the task:

# Choose one of: "t4", "a10g", "inf2", "a100-20g", "a100" or None
gpu="t4"

Next we define the commands that run. These are run after copying all the local code files to the server and changing directory into that folder. We have a single command for each stage, but you can have multiple.

commands={
    'prepare': ['python data/shakespeare_char/prepare.py'],
    'train': ['python train.py config/train_shakespeare_char.py'],
    'sample': ['python sample.py --out_dir=out-shakespeare-char'],
}

Now we set verbose, which tells us what files are being uploaded, we set the name of the volume (so that we can keep our files for this project separate), a timeout of 60 minutes after which Modal will terminate the job and a list of paths to not upload:

verbose=True
volume_name_prefix="2023-07-27-10-45"
timeout_mins=60
exclude_paths_starting_with=["./.git", "./.github", "./bin", "./lib", "./share"]

Finally we define the image, which is how the container will be set up that will run the program. rsync is needed because it is used to copy up the right files (without losing generated files on the server). In addition we need to do the pip install defined in the README.md of the nanoGPT project:

image = modal.Image \
    .debian_slim() \
    .apt_install("rsync") \
    .pip_install("torch numpy transformers datasets tiktoken wandb tqdm".split(" "))

Train and run using `lob.py`

With all the set up done, running is very simple. Just run these commands one after the other. They correspond to instructions in the readme.md

modal run lob.py --command prepare
modal run lob.py --command train
modal run lob.py --command sample

Here is some output from the final phase:

ISABELLA:
This is the day of this is your land;
But I have been call'd up him been your tent?

DUKE VINCENTIO:
How far of the solemnity? who is wrong'd?
Why should we shame an arms stoop of life?
They will prove his like offence with life
And to be crave a happy model's guilty of his cheeks;
For all his foes, that are gone of me.

Here is the entire output of the 3 commands (click to expand):

Click to expand

(nanoGPT) martin@Capo:~/nanoGPT$ modal run lob.py --command prepare
💾 using volume name: 2023-07-27-10-45-2-aws
✓ Initialized. View app at https://modal.com/apps/ap-4Ed0rtqxmI5GCB0E73QVRT
 ./configurator.py
 ./LICENSE
 ./scaling_laws.ipynb
 ./README.md
 ./train.py
 ./model.py
 ./pyvenv.cfg
 ./sample.py
 ./transformer_sizing.ipynb
 ./lob.py
 ./bench.py
 ./data/shakespeare/prepare.py
 ./data/shakespeare/readme.md
 ./data/openwebtext/prepare.py
 ./data/openwebtext/readme.md
 ./data/shakespeare_char/prepare.py
 ./data/shakespeare_char/readme.md
 ./__pycache__/lob.cpython-38.pyc
 ./assets/nanogpt.jpg
 ./assets/gpt2_124M_loss.png
 ./config/train_shakespeare_char.py
 ./config/eval_gpt2.py
 ./config/eval_gpt2_large.py
 ./config/train_gpt2.py
 ./config/eval_gpt2_medium.py
 ./config/eval_gpt2_xl.py
 ./config/finetune_shakespeare.py
✓ Created objects.
├── 🔨 Created copy.
├── 🔨 Created mount .
└── 🔨 Created mount /home/martin/nanoGPT/lob.py
Command prepare was chosen.
This will run: ['python data/shakespeare_char/prepare.py']
💾 using volume name: 2023-07-27-10-45-2-aws
📁 Running rsync to copy files up to container:
sending incremental file list
./
LICENSE
          1,072 100%    0.00kB/s    0:00:00            1,072 100%    0.00kB/s    0:00:00 (xfr#1, to-chk=33/35)
README.md
         13,534 100%   12.91MB/s    0:00:00           13,534 100%   12.91MB/s    0:00:00 (xfr#2, to-chk=32/35)
bench.py
          4,815 100%    4.59MB/s    0:00:00            4,815 100%    4.59MB/s    0:00:00 (xfr#3, to-chk=31/35)
configurator.py
          1,758 100%    1.68MB/s    0:00:00            1,758 100%    1.68MB/s    0:00:00 (xfr#4, to-chk=30/35)
lob.py
          5,037 100%    4.80MB/s    0:00:00            5,037 100%    4.80MB/s    0:00:00 (xfr#5, to-chk=29/35)
model.py
         16,345 100%   15.59MB/s    0:00:00           16,345 100%   15.59MB/s    0:00:00 (xfr#6, to-chk=28/35)
pyvenv.cfg
             70 100%   68.36kB/s    0:00:00               70 100%   68.36kB/s    0:00:00 (xfr#7, to-chk=27/35)
sample.py
          3,942 100%    1.25MB/s    0:00:00            3,942 100%    1.25MB/s    0:00:00 (xfr#8, to-chk=26/35)
scaling_laws.ipynb
         32,768  12%   10.42MB/s    0:00:00          268,519 100%    1.24MB/s    0:00:00 (xfr#9, to-chk=25/35)
train.py
         14,673 100%   69.22kB/s    0:00:00           14,673 100%   69.22kB/s    0:00:00 (xfr#10, to-chk=24/35)
transformer_sizing.ipynb
         14,579 100%   68.45kB/s    0:00:00           14,579 100%   68.45kB/s    0:00:00 (xfr#11, to-chk=23/35)
__pycache__/
__pycache__/lob.cpython-38.pyc
          2,510 100%   11.73kB/s    0:00:00            2,510 100%   11.73kB/s    0:00:00 (xfr#12, to-chk=18/35)
assets/
assets/gpt2_124M_loss.png
         32,768  29%  153.11kB/s    0:00:00          110,433 100%  516.00kB/s    0:00:00 (xfr#13, to-chk=17/35)
assets/nanogpt.jpg
         32,768  27%   93.29kB/s    0:00:00          118,621 100%  337.73kB/s    0:00:00 (xfr#14, to-chk=16/35)
config/
config/eval_gpt2.py
            208 100%    0.59kB/s    0:00:00              208 100%    0.59kB/s    0:00:00 (xfr#15, to-chk=15/35)
config/eval_gpt2_large.py
            215 100%    0.61kB/s    0:00:00              215 100%    0.61kB/s    0:00:00 (xfr#16, to-chk=14/35)
config/eval_gpt2_medium.py
            216 100%    0.61kB/s    0:00:00              216 100%    0.61kB/s    0:00:00 (xfr#17, to-chk=13/35)
config/eval_gpt2_xl.py
            213 100%    0.60kB/s    0:00:00              213 100%    0.60kB/s    0:00:00 (xfr#18, to-chk=12/35)
config/finetune_shakespeare.py
            645 100%    1.82kB/s    0:00:00              645 100%    1.82kB/s    0:00:00 (xfr#19, to-chk=11/35)
config/train_gpt2.py
            681 100%    1.92kB/s    0:00:00              681 100%    1.92kB/s    0:00:00 (xfr#20, to-chk=10/35)
config/train_shakespeare_char.py
          1,132 100%    3.19kB/s    0:00:00            1,132 100%    3.19kB/s    0:00:00 (xfr#21, to-chk=9/35)
data/
data/openwebtext/
data/openwebtext/prepare.py
          3,170 100%    8.92kB/s    0:00:00            3,170 100%    8.92kB/s    0:00:00 (xfr#22, to-chk=5/35)
data/openwebtext/readme.md
            489 100%    1.38kB/s    0:00:00              489 100%    1.38kB/s    0:00:00 (xfr#23, to-chk=4/35)
data/shakespeare/
data/shakespeare/prepare.py
          1,096 100%    3.08kB/s    0:00:00            1,096 100%    3.08kB/s    0:00:00 (xfr#24, to-chk=3/35)
data/shakespeare/readme.md
            161 100%    0.45kB/s    0:00:00              161 100%    0.45kB/s    0:00:00 (xfr#25, to-chk=2/35)
data/shakespeare_char/
data/shakespeare_char/prepare.py
          2,344 100%    6.58kB/s    0:00:00            2,344 100%    6.58kB/s    0:00:00 (xfr#26, to-chk=1/35)
data/shakespeare_char/readme.md
            209 100%    0.59kB/s    0:00:00              209 100%    0.59kB/s    0:00:00 (xfr#27, to-chk=0/35)
🐍 Using remote python version:
Python 3.8.15
🏃🏽Executing command: python data/shakespeare_char/prepare.py
length of dataset in characters: 1,115,394
all the unique characters: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
vocab size: 65
train has 1,003,854 tokens
val has 111,540 tokens
✓ App completed.
(nanoGPT) martin@Capo:~/nanoGPT$ modal run lob.py --command train
💾 using volume name: 2023-07-27-10-45-2-aws
✓ Initialized. View app at https://modal.com/apps/ap-qsVnQ1NihkY8FsxsG1PlYX
 ./configurator.py
 ./LICENSE
 ./scaling_laws.ipynb
 ./README.md
 ./train.py
 ./model.py
 ./pyvenv.cfg
 ./sample.py
 ./transformer_sizing.ipynb
 ./lob.py
 ./bench.py
 ./data/shakespeare/prepare.py
 ./data/shakespeare/readme.md
 ./data/openwebtext/prepare.py
 ./data/openwebtext/readme.md
 ./data/shakespeare_char/prepare.py
 ./data/shakespeare_char/readme.md
 ./__pycache__/lob.cpython-38.pyc
 ./assets/nanogpt.jpg
 ./assets/gpt2_124M_loss.png
 ./config/train_shakespeare_char.py
 ./config/eval_gpt2.py
 ./config/eval_gpt2_large.py
 ./config/train_gpt2.py
 ./config/eval_gpt2_medium.py
 ./config/eval_gpt2_xl.py
 ./config/finetune_shakespeare.py
✓ Created objects.
├── 🔨 Created copy.
├── 🔨 Created mount .
└── 🔨 Created mount /home/martin/nanoGPT/lob.py
Command train was chosen.
This will run: ['python train.py config/train_shakespeare_char.py']
💾 using volume name: 2023-07-27-10-45-2-aws
📁 Running rsync to copy files up to container:
sending incremental file list
data/shakespeare_char/
🐍 Using remote python version:
Python 3.8.15
🏃🏽Executing command: python train.py config/train_shakespeare_char.py
Overriding config with config/train_shakespeare_char.py:
# train a miniature character-level shakespeare model
# good for debugging and playing on macbooks and such

out_dir = 'out-shakespeare-char'
eval_interval = 250 # keep frequent because we'll overfit
eval_iters = 200
log_interval = 10 # don't print too too often

# we expect to overfit on this small dataset, so only save when val improves
always_save_checkpoint = False

wandb_log = False # override via command line if you like
wandb_project = 'shakespeare-char'
wandb_run_name = 'mini-gpt'

dataset = 'shakespeare_char'
gradient_accumulation_steps = 1
batch_size = 64
block_size = 256 # context of up to 256 previous characters

# baby GPT model :)
n_layer = 6
n_head = 6
n_embd = 384
dropout = 0.2

learning_rate = 1e-3 # with baby networks can afford to go a bit higher
max_iters = 5000
lr_decay_iters = 5000 # make equal to max_iters usually
min_lr = 1e-4 # learning_rate / 10 usually
beta2 = 0.99 # make a bit bigger because number of tokens per iter is small

warmup_iters = 100 # not super necessary potentially

# on macbook also add
# device = 'cpu'  # run on cpu only
# compile = False # do not torch compile the model

tokens per iteration will be: 16,384
found vocab_size = 65 (inside data/shakespeare_char/meta.pkl)
Initializing a new model from scratch
number of parameters: 10.65M
num decayed parameter tensors: 26, with 10,740,096 parameters
num non-decayed parameter tensors: 13, with 4,992 parameters
using fused AdamW: True
compiling the model... (takes a ~minute)
[2023-07-27 07:51:04,437] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-07-27 07:51:04,979] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-07-27 07:51:06,088] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-07-27 07:51:06,384] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-07-27 07:51:06,831] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-07-27 07:51:07,271] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-07-27 07:51:07,723] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-07-27 07:51:08,019] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-07-27 07:51:08,464] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-07-27 07:51:08,757] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-07-27 07:51:09,196] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-07-27 07:51:09,489] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
step 0: train loss 4.2874, val loss 4.2823
iter 0: loss 4.2649, time 33065.55ms, mfu -100.00%
iter 10: loss 3.2438, time 103.08ms, mfu 3.62%
iter 20: loss 2.7899, time 103.51ms, mfu 3.61%
iter 30: loss 2.6383, time 104.21ms, mfu 3.61%
iter 40: loss 2.5763, time 103.21ms, mfu 3.61%
iter 50: loss 2.5261, time 105.02ms, mfu 3.60%
iter 60: loss 2.5136, time 104.65ms, mfu 3.60%
iter 70: loss 2.4921, time 104.66ms, mfu 3.60%
iter 80: loss 2.4932, time 103.74ms, mfu 3.60%
iter 90: loss 2.4696, time 104.93ms, mfu 3.59%
iter 100: loss 2.4526, time 104.89ms, mfu 3.59%
iter 110: loss 2.4543, time 104.67ms, mfu 3.58%
iter 120: loss 2.4223, time 102.57ms, mfu 3.59%
iter 130: loss 2.4059, time 105.67ms, mfu 3.58%
iter 140: loss 2.3925, time 104.18ms, mfu 3.58%
iter 150: loss 2.4098, time 103.93ms, mfu 3.58%
iter 160: loss 2.3675, time 104.91ms, mfu 3.58%
iter 170: loss 2.3382, time 104.84ms, mfu 3.58%
iter 180: loss 2.3011, time 105.85ms, mfu 3.57%
iter 190: loss 2.2278, time 105.65ms, mfu 3.57%
iter 200: loss 2.2004, time 105.36ms, mfu 3.56%
iter 210: loss 2.1244, time 107.28ms, mfu 3.55%
iter 220: loss 2.1338, time 107.40ms, mfu 3.55%
iter 230: loss 2.0709, time 105.63ms, mfu 3.54%
iter 240: loss 2.0742, time 104.87ms, mfu 3.55%
step 250: train loss 1.9616, val loss 2.0647
saving checkpoint to out-shakespeare-char
iter 250: loss 2.0277, time 14340.46ms, mfu 3.19%
iter 260: loss 1.9685, time 106.94ms, mfu 3.22%
iter 270: loss 1.9776, time 107.05ms, mfu 3.25%
iter 280: loss 1.9798, time 106.01ms, mfu 3.27%
iter 290: loss 1.9237, time 108.32ms, mfu 3.29%
iter 300: loss 1.8944, time 107.02ms, mfu 3.31%
iter 310: loss 1.8637, time 108.22ms, mfu 3.32%
iter 320: loss 1.8569, time 108.47ms, mfu 3.33%
iter 330: loss 1.8088, time 108.44ms, mfu 3.35%
iter 340: loss 1.7812, time 108.27ms, mfu 3.35%
iter 350: loss 1.8272, time 107.65ms, mfu 3.37%
iter 360: loss 1.7745, time 108.39ms, mfu 3.37%
iter 370: loss 1.7414, time 109.46ms, mfu 3.38%
iter 380: loss 1.7304, time 107.99ms, mfu 3.38%
iter 390: loss 1.7372, time 109.29ms, mfu 3.39%
iter 400: loss 1.7640, time 109.49ms, mfu 3.39%
iter 410: loss 1.6959, time 106.60ms, mfu 3.40%
iter 420: loss 1.7088, time 109.99ms, mfu 3.40%
iter 430: loss 1.6815, time 107.62ms, mfu 3.40%
iter 440: loss 1.6462, time 108.94ms, mfu 3.41%
iter 450: loss 1.6511, time 107.66ms, mfu 3.41%
iter 460: loss 1.6024, time 108.66ms, mfu 3.41%
iter 470: loss 1.6554, time 110.38ms, mfu 3.41%
iter 480: loss 1.6165, time 108.03ms, mfu 3.41%
iter 490: loss 1.6016, time 107.65ms, mfu 3.42%
step 500: train loss 1.5285, val loss 1.7362
saving checkpoint to out-shakespeare-char
iter 500: loss 1.6016, time 11776.34ms, mfu 3.08%
iter 510: loss 1.6162, time 110.05ms, mfu 3.11%
iter 520: loss 1.6020, time 111.90ms, mfu 3.13%
iter 530: loss 1.5657, time 110.67ms, mfu 3.16%
iter 540: loss 1.6203, time 110.58ms, mfu 3.18%
iter 550: loss 1.5671, time 111.13ms, mfu 3.19%
iter 560: loss 1.5651, time 111.00ms, mfu 3.21%
iter 570: loss 1.5745, time 110.94ms, mfu 3.23%
iter 580: loss 1.5401, time 111.85ms, mfu 3.24%
iter 590: loss 1.5016, time 111.08ms, mfu 3.25%
iter 600: loss 1.5180, time 111.45ms, mfu 3.26%
iter 610: loss 1.5552, time 110.42ms, mfu 3.27%
iter 620: loss 1.5292, time 111.15ms, mfu 3.28%
iter 630: loss 1.5179, time 111.15ms, mfu 3.29%
iter 640: loss 1.4779, time 111.21ms, mfu 3.29%
iter 650: loss 1.5043, time 110.39ms, mfu 3.30%
iter 660: loss 1.5145, time 111.28ms, mfu 3.30%
iter 670: loss 1.4491, time 111.07ms, mfu 3.31%
iter 680: loss 1.5118, time 110.52ms, mfu 3.32%
iter 690: loss 1.4611, time 111.86ms, mfu 3.32%
iter 700: loss 1.4802, time 110.82ms, mfu 3.32%
iter 710: loss 1.4568, time 111.52ms, mfu 3.32%
iter 720: loss 1.4481, time 112.24ms, mfu 3.32%
iter 730: loss 1.4214, time 120.96ms, mfu 3.30%
iter 740: loss 1.4311, time 112.04ms, mfu 3.30%
step 750: train loss 1.3611, val loss 1.5957
saving checkpoint to out-shakespeare-char
iter 750: loss 1.4226, time 12240.61ms, mfu 2.97%
iter 760: loss 1.4461, time 112.47ms, mfu 3.01%
iter 770: loss 1.4283, time 111.68ms, mfu 3.04%
iter 780: loss 1.4123, time 111.30ms, mfu 3.07%
iter 790: loss 1.4221, time 111.95ms, mfu 3.10%
iter 800: loss 1.4286, time 111.81ms, mfu 3.12%
iter 810: loss 1.4068, time 113.48ms, mfu 3.14%
iter 820: loss 1.4105, time 113.58ms, mfu 3.15%
iter 830: loss 1.3939, time 111.81ms, mfu 3.17%
iter 840: loss 1.4056, time 113.22ms, mfu 3.18%
iter 850: loss 1.3911, time 111.85ms, mfu 3.20%
iter 860: loss 1.4010, time 113.44ms, mfu 3.21%
iter 870: loss 1.4061, time 112.60ms, mfu 3.22%
iter 880: loss 1.3758, time 116.09ms, mfu 3.22%
iter 890: loss 1.3920, time 113.44ms, mfu 3.22%
iter 900: loss 1.3728, time 113.66ms, mfu 3.23%
iter 910: loss 1.3245, time 113.50ms, mfu 3.23%
iter 920: loss 1.3671, time 114.03ms, mfu 3.24%
iter 930: loss 1.3670, time 112.65ms, mfu 3.24%
iter 940: loss 1.3524, time 111.20ms, mfu 3.25%
iter 950: loss 1.3538, time 112.70ms, mfu 3.26%
iter 960: loss 1.3668, time 114.80ms, mfu 3.26%
iter 970: loss 1.3636, time 115.69ms, mfu 3.25%
iter 980: loss 1.3566, time 112.03ms, mfu 3.26%
iter 990: loss 1.3400, time 113.93ms, mfu 3.26%
step 1000: train loss 1.2762, val loss 1.5256
saving checkpoint to out-shakespeare-char
iter 1000: loss 1.3415, time 12715.10ms, mfu 2.94%
iter 1010: loss 1.3431, time 114.84ms, mfu 2.97%
iter 1020: loss 1.3133, time 114.62ms, mfu 3.00%
iter 1030: loss 1.3367, time 114.21ms, mfu 3.02%
iter 1040: loss 1.3645, time 114.66ms, mfu 3.05%
iter 1050: loss 1.2936, time 114.63ms, mfu 3.07%
iter 1060: loss 1.3426, time 115.89ms, mfu 3.08%
iter 1070: loss 1.3305, time 114.95ms, mfu 3.10%
iter 1080: loss 1.3386, time 116.72ms, mfu 3.11%
iter 1090: loss 1.3539, time 115.43ms, mfu 3.12%
iter 1100: loss 1.3168, time 117.47ms, mfu 3.12%
iter 1110: loss 1.3008, time 116.91ms, mfu 3.13%
iter 1120: loss 1.3026, time 118.77ms, mfu 3.13%
iter 1130: loss 1.3015, time 116.35ms, mfu 3.14%
iter 1140: loss 1.2992, time 112.41ms, mfu 3.16%
iter 1150: loss 1.3126, time 115.89ms, mfu 3.16%
iter 1160: loss 1.3269, time 115.05ms, mfu 3.17%
iter 1170: loss 1.3064, time 116.09ms, mfu 3.17%
iter 1180: loss 1.3226, time 115.55ms, mfu 3.18%
iter 1190: loss 1.2668, time 113.06ms, mfu 3.19%
iter 1200: loss 1.2964, time 113.89ms, mfu 3.20%
iter 1210: loss 1.2739, time 114.14ms, mfu 3.21%
iter 1220: loss 1.3009, time 114.65ms, mfu 3.21%
iter 1230: loss 1.2977, time 116.01ms, mfu 3.21%
iter 1240: loss 1.3042, time 113.79ms, mfu 3.22%
step 1250: train loss 1.2079, val loss 1.4969
saving checkpoint to out-shakespeare-char
iter 1250: loss 1.2753, time 12646.97ms, mfu 2.90%
iter 1260: loss 1.2867, time 115.14ms, mfu 2.93%
iter 1270: loss 1.2715, time 115.65ms, mfu 2.96%
iter 1280: loss 1.2605, time 117.75ms, mfu 2.98%
iter 1290: loss 1.2806, time 115.94ms, mfu 3.00%
iter 1300: loss 1.2990, time 113.14ms, mfu 3.03%
iter 1310: loss 1.2402, time 112.46ms, mfu 3.06%
iter 1320: loss 1.3072, time 113.39ms, mfu 3.08%
iter 1330: loss 1.2705, time 117.34ms, mfu 3.09%
iter 1340: loss 1.2992, time 116.71ms, mfu 3.10%
iter 1350: loss 1.2529, time 114.87ms, mfu 3.12%
iter 1360: loss 1.2679, time 114.92ms, mfu 3.13%
iter 1370: loss 1.2591, time 117.08ms, mfu 3.13%
iter 1380: loss 1.2695, time 116.04ms, mfu 3.14%
iter 1390: loss 1.2550, time 117.59ms, mfu 3.15%
iter 1400: loss 1.2619, time 116.96ms, mfu 3.15%
iter 1410: loss 1.2513, time 117.91ms, mfu 3.15%
iter 1420: loss 1.2738, time 116.81ms, mfu 3.15%
iter 1430: loss 1.2424, time 114.54ms, mfu 3.16%
iter 1440: loss 1.2576, time 114.93ms, mfu 3.17%
iter 1450: loss 1.2388, time 118.07ms, mfu 3.17%
iter 1460: loss 1.2448, time 119.56ms, mfu 3.16%
iter 1470: loss 1.2223, time 118.65ms, mfu 3.16%
iter 1480: loss 1.2091, time 116.77ms, mfu 3.17%
iter 1490: loss 1.2391, time 117.71ms, mfu 3.17%
step 1500: train loss 1.1533, val loss 1.4693
saving checkpoint to out-shakespeare-char
iter 1500: loss 1.1903, time 14834.12ms, mfu 2.85%
iter 1510: loss 1.2334, time 117.89ms, mfu 2.88%
iter 1520: loss 1.2242, time 116.96ms, mfu 2.91%
iter 1530: loss 1.2554, time 116.91ms, mfu 2.94%
iter 1540: loss 1.1952, time 116.01ms, mfu 2.97%
iter 1550: loss 1.2376, time 115.90ms, mfu 2.99%
iter 1560: loss 1.2124, time 115.27ms, mfu 3.02%
iter 1570: loss 1.2289, time 117.66ms, mfu 3.03%
iter 1580: loss 1.2126, time 116.45ms, mfu 3.05%
iter 1590: loss 1.1923, time 117.08ms, mfu 3.06%
iter 1600: loss 1.2019, time 117.95ms, mfu 3.07%
iter 1610: loss 1.2397, time 117.50ms, mfu 3.08%
iter 1620: loss 1.1842, time 119.91ms, mfu 3.08%
iter 1630: loss 1.2066, time 115.21ms, mfu 3.10%
iter 1640: loss 1.2057, time 115.86ms, mfu 3.11%
iter 1650: loss 1.1848, time 117.37ms, mfu 3.12%
iter 1660: loss 1.2201, time 115.62ms, mfu 3.13%
iter 1670: loss 1.2026, time 118.70ms, mfu 3.13%
iter 1680: loss 1.2037, time 116.72ms, mfu 3.14%
iter 1690: loss 1.2076, time 114.73ms, mfu 3.15%
iter 1700: loss 1.1913, time 115.13ms, mfu 3.16%
iter 1710: loss 1.1841, time 117.65ms, mfu 3.16%
iter 1720: loss 1.1854, time 117.87ms, mfu 3.16%
iter 1730: loss 1.1986, time 117.57ms, mfu 3.16%
iter 1740: loss 1.1763, time 117.37ms, mfu 3.16%
step 1750: train loss 1.1028, val loss 1.4603
saving checkpoint to out-shakespeare-char
iter 1750: loss 1.1805, time 12920.15ms, mfu 2.85%
iter 1760: loss 1.1936, time 117.16ms, mfu 2.88%
iter 1770: loss 1.1979, time 119.79ms, mfu 2.90%
iter 1780: loss 1.1955, time 113.89ms, mfu 2.94%
iter 1790: loss 1.1912, time 118.11ms, mfu 2.96%
iter 1800: loss 1.1752, time 117.03ms, mfu 2.98%
iter 1810: loss 1.1647, time 117.42ms, mfu 3.00%
iter 1820: loss 1.1709, time 116.32ms, mfu 3.02%
iter 1830: loss 1.1703, time 116.40ms, mfu 3.04%
iter 1840: loss 1.1592, time 117.01ms, mfu 3.06%
iter 1850: loss 1.1671, time 117.30ms, mfu 3.07%
iter 1860: loss 1.1751, time 118.06ms, mfu 3.08%
iter 1870: loss 1.1444, time 116.93ms, mfu 3.09%
iter 1880: loss 1.1820, time 116.80ms, mfu 3.10%
iter 1890: loss 1.1850, time 114.54ms, mfu 3.11%
iter 1900: loss 1.1317, time 114.22ms, mfu 3.13%
iter 1910: loss 1.1653, time 116.54ms, mfu 3.13%
iter 1920: loss 1.1680, time 115.75ms, mfu 3.14%
iter 1930: loss 1.1419, time 114.94ms, mfu 3.15%
iter 1940: loss 1.1259, time 115.84ms, mfu 3.16%
iter 1950: loss 1.1378, time 118.97ms, mfu 3.16%
iter 1960: loss 1.1573, time 116.48ms, mfu 3.16%
iter 1970: loss 1.1574, time 114.98ms, mfu 3.17%
iter 1980: loss 1.1518, time 114.50ms, mfu 3.18%
iter 1990: loss 1.1529, time 114.88ms, mfu 3.18%
step 2000: train loss 1.0560, val loss 1.4723
iter 2000: loss 1.1313, time 12257.49ms, mfu 2.87%
iter 2010: loss 1.1371, time 115.66ms, mfu 2.90%
iter 2020: loss 1.1267, time 116.53ms, mfu 2.93%
iter 2030: loss 1.1575, time 116.89ms, mfu 2.96%
iter 2040: loss 1.1369, time 116.50ms, mfu 2.98%
iter 2050: loss 1.1200, time 115.13ms, mfu 3.01%
iter 2060: loss 1.1054, time 114.00ms, mfu 3.03%
iter 2070: loss 1.1241, time 116.13ms, mfu 3.05%
iter 2080: loss 1.1240, time 117.11ms, mfu 3.06%
iter 2090: loss 1.1352, time 114.44ms, mfu 3.08%
iter 2100: loss 1.1349, time 115.23ms, mfu 3.10%
iter 2110: loss 1.1359, time 117.54ms, mfu 3.11%
iter 2120: loss 1.1305, time 117.35ms, mfu 3.11%
iter 2130: loss 1.1397, time 116.09ms, mfu 3.12%
iter 2140: loss 1.1365, time 117.49ms, mfu 3.13%
iter 2150: loss 1.1284, time 114.82ms, mfu 3.14%
iter 2160: loss 1.1414, time 116.10ms, mfu 3.15%
iter 2170: loss 1.1367, time 115.65ms, mfu 3.15%
iter 2180: loss 1.1187, time 119.46ms, mfu 3.15%
iter 2190: loss 1.1095, time 112.09ms, mfu 3.17%
iter 2200: loss 1.1194, time 115.60ms, mfu 3.17%
iter 2210: loss 1.1110, time 115.48ms, mfu 3.18%
iter 2220: loss 1.1232, time 114.33ms, mfu 3.19%
iter 2230: loss 1.1231, time 116.60ms, mfu 3.19%
iter 2240: loss 1.1356, time 115.37ms, mfu 3.19%
step 2250: train loss 1.0100, val loss 1.4719
iter 2250: loss 1.1040, time 12193.67ms, mfu 2.88%
iter 2260: loss 1.1040, time 116.32ms, mfu 2.91%
iter 2270: loss 1.1425, time 116.92ms, mfu 2.94%
iter 2280: loss 1.0969, time 115.55ms, mfu 2.97%
iter 2290: loss 1.1535, time 116.32ms, mfu 2.99%
iter 2300: loss 1.1172, time 117.35ms, mfu 3.01%
iter 2310: loss 1.0925, time 116.01ms, mfu 3.03%
iter 2320: loss 1.0992, time 116.60ms, mfu 3.04%
iter 2330: loss 1.1028, time 116.21ms, mfu 3.06%
iter 2340: loss 1.1236, time 114.90ms, mfu 3.08%
iter 2350: loss 1.1000, time 113.61ms, mfu 3.10%
iter 2360: loss 1.1014, time 113.65ms, mfu 3.12%
iter 2370: loss 1.0974, time 114.22ms, mfu 3.13%
iter 2380: loss 1.0846, time 116.35ms, mfu 3.14%
iter 2390: loss 1.0864, time 117.19ms, mfu 3.14%
iter 2400: loss 1.0765, time 113.52ms, mfu 3.16%
iter 2410: loss 1.0720, time 117.45ms, mfu 3.16%
iter 2420: loss 1.0792, time 113.19ms, mfu 3.17%
iter 2430: loss 1.0588, time 116.62ms, mfu 3.17%
iter 2440: loss 1.0537, time 113.97ms, mfu 3.18%
iter 2450: loss 1.0758, time 117.42ms, mfu 3.18%
iter 2460: loss 1.0872, time 115.39ms, mfu 3.19%
iter 2470: loss 1.0853, time 118.00ms, mfu 3.18%
iter 2480: loss 1.0859, time 115.33ms, mfu 3.19%
iter 2490: loss 1.0610, time 113.64ms, mfu 3.20%
step 2500: train loss 0.9590, val loss 1.4886
iter 2500: loss 1.0800, time 12184.79ms, mfu 2.88%
iter 2510: loss 1.0775, time 117.50ms, mfu 2.91%
iter 2520: loss 1.0480, time 115.16ms, mfu 2.94%
iter 2530: loss 1.0608, time 115.74ms, mfu 2.97%
iter 2540: loss 1.0559, time 115.97ms, mfu 2.99%
iter 2550: loss 1.0655, time 115.44ms, mfu 3.02%
iter 2560: loss 1.0615, time 116.20ms, mfu 3.04%
iter 2570: loss 1.0752, time 116.99ms, mfu 3.05%
iter 2580: loss 1.0818, time 116.49ms, mfu 3.07%
iter 2590: loss 1.0710, time 115.88ms, mfu 3.08%
iter 2600: loss 1.0741, time 112.56ms, mfu 3.10%
iter 2610: loss 1.0476, time 117.23ms, mfu 3.11%
iter 2620: loss 1.0413, time 112.69ms, mfu 3.13%
iter 2630: loss 1.0266, time 114.45ms, mfu 3.14%
iter 2640: loss 1.0461, time 116.10ms, mfu 3.15%
iter 2650: loss 1.0662, time 116.35ms, mfu 3.16%
iter 2660: loss 1.0466, time 116.22ms, mfu 3.16%
iter 2670: loss 1.0205, time 116.80ms, mfu 3.16%
iter 2680: loss 1.0463, time 117.87ms, mfu 3.16%
iter 2690: loss 1.0574, time 116.00ms, mfu 3.17%
iter 2700: loss 1.0213, time 115.56ms, mfu 3.17%
iter 2710: loss 1.0412, time 116.20ms, mfu 3.18%
iter 2720: loss 1.0447, time 120.05ms, mfu 3.17%
iter 2730: loss 1.0544, time 115.64ms, mfu 3.18%
iter 2740: loss 1.0265, time 118.42ms, mfu 3.17%
step 2750: train loss 0.9135, val loss 1.5104
iter 2750: loss 1.0351, time 12241.53ms, mfu 2.86%
iter 2760: loss 1.0271, time 116.76ms, mfu 2.89%
iter 2770: loss 1.0300, time 116.32ms, mfu 2.92%
iter 2780: loss 1.0212, time 116.84ms, mfu 2.95%
iter 2790: loss 1.0406, time 116.89ms, mfu 2.97%
iter 2800: loss 1.0119, time 116.11ms, mfu 3.00%
iter 2810: loss 1.0451, time 115.22ms, mfu 3.02%
iter 2820: loss 1.0255, time 116.07ms, mfu 3.04%
iter 2830: loss 1.0377, time 114.26ms, mfu 3.06%
iter 2840: loss 1.0019, time 114.69ms, mfu 3.08%
iter 2850: loss 1.0306, time 115.24ms, mfu 3.10%
iter 2860: loss 1.0237, time 116.07ms, mfu 3.11%
iter 2870: loss 1.0016, time 114.00ms, mfu 3.12%
iter 2880: loss 1.0270, time 117.10ms, mfu 3.13%
iter 2890: loss 1.0150, time 116.35ms, mfu 3.14%
iter 2900: loss 0.9963, time 116.32ms, mfu 3.14%
iter 2910: loss 1.0427, time 115.04ms, mfu 3.15%
iter 2920: loss 1.0190, time 115.45ms, mfu 3.16%
iter 2930: loss 0.9971, time 114.76ms, mfu 3.17%
iter 2940: loss 0.9860, time 114.46ms, mfu 3.18%
iter 2950: loss 1.0231, time 116.47ms, mfu 3.18%
iter 2960: loss 1.0000, time 116.44ms, mfu 3.18%
iter 2970: loss 0.9940, time 113.93ms, mfu 3.19%
iter 2980: loss 1.0014, time 114.07ms, mfu 3.20%
iter 2990: loss 0.9899, time 112.14ms, mfu 3.21%
step 3000: train loss 0.8673, val loss 1.5194
iter 3000: loss 0.9866, time 12198.37ms, mfu 2.89%
iter 3010: loss 0.9940, time 115.19ms, mfu 2.93%
iter 3020: loss 0.9946, time 114.00ms, mfu 2.96%
iter 3030: loss 1.0067, time 116.10ms, mfu 2.99%
iter 3040: loss 1.0279, time 112.95ms, mfu 3.02%
iter 3050: loss 0.9855, time 114.92ms, mfu 3.04%
iter 3060: loss 0.9980, time 114.28ms, mfu 3.06%
iter 3070: loss 1.0166, time 116.64ms, mfu 3.08%
iter 3080: loss 1.0064, time 116.06ms, mfu 3.09%
iter 3090: loss 0.9795, time 116.85ms, mfu 3.10%
iter 3100: loss 0.9986, time 114.91ms, mfu 3.11%
iter 3110: loss 0.9806, time 114.00ms, mfu 3.13%
iter 3120: loss 0.9921, time 115.17ms, mfu 3.14%
iter 3130: loss 0.9779, time 115.70ms, mfu 3.15%
iter 3140: loss 0.9753, time 117.19ms, mfu 3.15%
iter 3150: loss 0.9882, time 117.12ms, mfu 3.15%
iter 3160: loss 1.0155, time 117.83ms, mfu 3.15%
iter 3170: loss 0.9620, time 115.54ms, mfu 3.16%
iter 3180: loss 0.9784, time 115.59ms, mfu 3.17%
iter 3190: loss 0.9972, time 116.21ms, mfu 3.17%
iter 3200: loss 0.9690, time 115.14ms, mfu 3.18%
iter 3210: loss 0.9701, time 117.81ms, mfu 3.18%
iter 3220: loss 0.9613, time 114.30ms, mfu 3.18%
iter 3230: loss 0.9596, time 115.49ms, mfu 3.19%
iter 3240: loss 0.9608, time 116.11ms, mfu 3.19%
step 3250: train loss 0.8239, val loss 1.5611
iter 3250: loss 0.9846, time 12237.43ms, mfu 2.88%
iter 3260: loss 0.9668, time 116.44ms, mfu 2.91%
iter 3270: loss 0.9772, time 116.16ms, mfu 2.94%
iter 3280: loss 0.9445, time 116.97ms, mfu 2.96%
iter 3290: loss 0.9432, time 116.93ms, mfu 2.98%
iter 3300: loss 0.9451, time 115.18ms, mfu 3.01%
iter 3310: loss 0.9571, time 117.17ms, mfu 3.03%
iter 3320: loss 0.9727, time 116.68ms, mfu 3.04%
iter 3330: loss 0.9581, time 112.84ms, mfu 3.07%
iter 3340: loss 0.9502, time 115.18ms, mfu 3.09%
iter 3350: loss 0.9538, time 114.65ms, mfu 3.10%
iter 3360: loss 0.9362, time 115.49ms, mfu 3.11%
iter 3370: loss 0.9606, time 114.58ms, mfu 3.13%
iter 3380: loss 0.9514, time 116.54ms, mfu 3.14%
iter 3390: loss 0.9561, time 114.63ms, mfu 3.15%
iter 3400: loss 0.9596, time 116.01ms, mfu 3.15%
iter 3410: loss 0.9444, time 112.50ms, mfu 3.17%
iter 3420: loss 0.9479, time 116.45ms, mfu 3.17%
iter 3430: loss 0.9430, time 114.03ms, mfu 3.18%
iter 3440: loss 0.9693, time 116.27ms, mfu 3.18%
iter 3450: loss 0.9509, time 115.68ms, mfu 3.19%
iter 3460: loss 0.9453, time 117.12ms, mfu 3.19%
iter 3470: loss 0.9385, time 117.24ms, mfu 3.19%
iter 3480: loss 0.9504, time 115.33ms, mfu 3.19%
iter 3490: loss 0.9106, time 116.81ms, mfu 3.19%
step 3500: train loss 0.7790, val loss 1.5720
iter 3500: loss 0.9011, time 12234.12ms, mfu 2.87%
iter 3510: loss 0.9180, time 116.66ms, mfu 2.91%
iter 3520: loss 0.9251, time 118.31ms, mfu 2.93%
iter 3530: loss 0.9584, time 118.13ms, mfu 2.95%
iter 3540: loss 0.9311, time 116.41ms, mfu 2.98%
iter 3550: loss 0.9215, time 116.94ms, mfu 3.00%
iter 3560: loss 0.9519, time 116.18ms, mfu 3.02%
iter 3570: loss 0.9363, time 118.52ms, mfu 3.03%
iter 3580: loss 0.9419, time 112.09ms, mfu 3.06%
iter 3590: loss 0.9228, time 114.29ms, mfu 3.08%
iter 3600: loss 0.9237, time 113.88ms, mfu 3.10%
iter 3610: loss 0.9118, time 115.23ms, mfu 3.11%
iter 3620: loss 0.9171, time 118.56ms, mfu 3.12%
iter 3630: loss 0.9199, time 115.27ms, mfu 3.13%
iter 3640: loss 0.9228, time 115.28ms, mfu 3.14%
iter 3650: loss 0.9075, time 116.32ms, mfu 3.15%
iter 3660: loss 0.9391, time 118.61ms, mfu 3.14%
iter 3670: loss 0.9375, time 114.86ms, mfu 3.15%
iter 3680: loss 0.9084, time 116.61ms, mfu 3.16%
iter 3690: loss 0.9350, time 115.12ms, mfu 3.17%
iter 3700: loss 0.8690, time 116.40ms, mfu 3.17%
iter 3710: loss 0.8807, time 117.80ms, mfu 3.17%
iter 3720: loss 0.9150, time 115.20ms, mfu 3.18%
iter 3730: loss 0.9002, time 117.32ms, mfu 3.18%
iter 3740: loss 0.9056, time 114.51ms, mfu 3.18%
step 3750: train loss 0.7414, val loss 1.6029
iter 3750: loss 0.9066, time 12211.47ms, mfu 2.87%
iter 3760: loss 0.9350, time 120.05ms, mfu 2.89%
iter 3770: loss 0.9320, time 117.65ms, mfu 2.92%
iter 3780: loss 0.9233, time 116.81ms, mfu 2.95%
iter 3790: loss 0.9007, time 117.14ms, mfu 2.97%
iter 3800: loss 0.9027, time 116.44ms, mfu 2.99%
iter 3810: loss 0.9152, time 116.25ms, mfu 3.01%
iter 3820: loss 0.8884, time 116.74ms, mfu 3.03%
iter 3830: loss 0.9023, time 114.30ms, mfu 3.05%
iter 3840: loss 0.8860, time 115.67ms, mfu 3.07%
iter 3850: loss 0.8922, time 115.66ms, mfu 3.09%
iter 3860: loss 0.8694, time 116.67ms, mfu 3.10%
iter 3870: loss 0.8979, time 117.24ms, mfu 3.11%
iter 3880: loss 0.8883, time 116.62ms, mfu 3.11%
iter 3890: loss 0.8921, time 117.78ms, mfu 3.12%
iter 3900: loss 0.8796, time 117.16ms, mfu 3.13%
iter 3910: loss 0.8910, time 113.17ms, mfu 3.14%
iter 3920: loss 0.8786, time 116.18ms, mfu 3.15%
iter 3930: loss 0.8860, time 114.44ms, mfu 3.16%
iter 3940: loss 0.8757, time 116.20ms, mfu 3.16%
iter 3950: loss 0.8747, time 115.90ms, mfu 3.17%
iter 3960: loss 0.9120, time 115.20ms, mfu 3.18%
iter 3970: loss 0.8914, time 114.93ms, mfu 3.18%
iter 3980: loss 0.9046, time 116.67ms, mfu 3.18%
iter 3990: loss 0.8761, time 116.38ms, mfu 3.19%
step 4000: train loss 0.7089, val loss 1.6181
iter 4000: loss 0.8590, time 12261.06ms, mfu 2.87%
iter 4010: loss 0.8836, time 117.03ms, mfu 2.90%
iter 4020: loss 0.8819, time 116.28ms, mfu 2.93%
iter 4030: loss 0.8776, time 119.57ms, mfu 2.95%
iter 4040: loss 0.8846, time 116.67ms, mfu 2.97%
iter 4050: loss 0.8734, time 114.50ms, mfu 3.00%
iter 4060: loss 0.8648, time 116.99ms, mfu 3.02%
iter 4070: loss 0.8631, time 115.76ms, mfu 3.04%
iter 4080: loss 0.8867, time 115.88ms, mfu 3.06%
iter 4090: loss 0.8479, time 114.74ms, mfu 3.08%
iter 4100: loss 0.8987, time 117.46ms, mfu 3.09%
iter 4110: loss 0.8641, time 117.32ms, mfu 3.10%
iter 4120: loss 0.8797, time 116.96ms, mfu 3.10%
iter 4130: loss 0.8548, time 117.32ms, mfu 3.11%
iter 4140: loss 0.8778, time 115.72ms, mfu 3.12%
iter 4150: loss 0.8723, time 116.71ms, mfu 3.13%
iter 4160: loss 0.8512, time 116.37ms, mfu 3.14%
iter 4170: loss 0.8705, time 117.28ms, mfu 3.14%
iter 4180: loss 0.8739, time 115.51ms, mfu 3.15%
iter 4190: loss 0.8654, time 115.80ms, mfu 3.16%
iter 4200: loss 0.8579, time 117.02ms, mfu 3.16%
iter 4210: loss 0.8754, time 115.09ms, mfu 3.17%
iter 4220: loss 0.8595, time 116.92ms, mfu 3.17%
iter 4230: loss 0.8805, time 114.63ms, mfu 3.18%
iter 4240: loss 0.8654, time 115.65ms, mfu 3.18%
step 4250: train loss 0.6784, val loss 1.6442
iter 4250: loss 0.8753, time 12213.09ms, mfu 2.87%
iter 4260: loss 0.8598, time 114.89ms, mfu 2.90%
iter 4270: loss 0.8634, time 117.44ms, mfu 2.93%
iter 4280: loss 0.8539, time 1225.34ms, mfu 2.67%
iter 4290: loss 0.8318, time 112.19ms, mfu 2.73%
iter 4300: loss 0.8307, time 117.42ms, mfu 2.78%
iter 4310: loss 0.8535, time 117.93ms, mfu 2.82%
iter 4320: loss 0.8466, time 114.60ms, mfu 2.86%
iter 4330: loss 0.8621, time 116.64ms, mfu 2.89%
iter 4340: loss 0.8286, time 118.11ms, mfu 2.92%
iter 4350: loss 0.8388, time 117.73ms, mfu 2.94%
iter 4360: loss 0.8612, time 115.78ms, mfu 2.97%
iter 4370: loss 0.8579, time 112.69ms, mfu 3.00%
iter 4380: loss 0.8347, time 117.20ms, mfu 3.02%
iter 4390: loss 0.8672, time 115.48ms, mfu 3.04%
iter 4400: loss 0.8440, time 115.27ms, mfu 3.06%
iter 4410: loss 0.8630, time 118.10ms, mfu 3.07%
iter 4420: loss 0.8611, time 116.09ms, mfu 3.08%
iter 4430: loss 0.8399, time 115.48ms, mfu 3.10%
iter 4440: loss 0.8514, time 115.89ms, mfu 3.11%
iter 4450: loss 0.8549, time 117.24ms, mfu 3.12%
iter 4460: loss 0.8366, time 117.52ms, mfu 3.12%
iter 4470: loss 0.8530, time 114.76ms, mfu 3.14%
iter 4480: loss 0.8332, time 114.35ms, mfu 3.15%
iter 4490: loss 0.8419, time 116.28ms, mfu 3.15%
step 4500: train loss 0.6544, val loss 1.6613
iter 4500: loss 0.8533, time 12221.51ms, mfu 2.84%
iter 4510: loss 0.8460, time 112.49ms, mfu 2.89%
iter 4520: loss 0.8332, time 112.16ms, mfu 2.93%
iter 4530: loss 0.8519, time 115.10ms, mfu 2.96%
iter 4540: loss 0.8431, time 115.25ms, mfu 2.99%
iter 4550: loss 0.8768, time 113.19ms, mfu 3.02%
iter 4560: loss 0.8361, time 113.91ms, mfu 3.04%
iter 4570: loss 0.8384, time 112.27ms, mfu 3.07%
iter 4580: loss 0.8528, time 117.01ms, mfu 3.08%
iter 4590: loss 0.8554, time 115.77ms, mfu 3.10%
iter 4600: loss 0.8282, time 115.90ms, mfu 3.11%
iter 4610: loss 0.8602, time 115.36ms, mfu 3.12%
iter 4620: loss 0.8341, time 116.40ms, mfu 3.13%
iter 4630: loss 0.8136, time 115.64ms, mfu 3.14%
iter 4640: loss 0.8465, time 117.28ms, mfu 3.14%
iter 4650: loss 0.8606, time 116.78ms, mfu 3.15%
iter 4660: loss 0.8533, time 114.05ms, mfu 3.16%
iter 4670: loss 0.8358, time 113.38ms, mfu 3.17%
iter 4680: loss 0.8533, time 115.58ms, mfu 3.18%
iter 4690: loss 0.8411, time 114.61ms, mfu 3.18%
iter 4700: loss 0.8206, time 115.47ms, mfu 3.19%
iter 4710: loss 0.7934, time 114.49ms, mfu 3.20%
iter 4720: loss 0.8374, time 116.00ms, mfu 3.20%
iter 4730: loss 0.8223, time 114.59ms, mfu 3.20%
iter 4740: loss 0.8268, time 115.66ms, mfu 3.20%
step 4750: train loss 0.6359, val loss 1.6802
iter 4750: loss 0.8078, time 12216.16ms, mfu 2.89%
iter 4760: loss 0.8218, time 117.52ms, mfu 2.92%
iter 4770: loss 0.8055, time 115.92ms, mfu 2.95%
iter 4780: loss 0.8098, time 115.62ms, mfu 2.97%
iter 4790: loss 0.8343, time 113.37ms, mfu 3.00%
iter 4800: loss 0.8281, time 112.90ms, mfu 3.03%
iter 4810: loss 0.8343, time 116.12ms, mfu 3.05%
iter 4820: loss 0.8217, time 115.77ms, mfu 3.07%
iter 4830: loss 0.8183, time 116.31ms, mfu 3.08%
iter 4840: loss 0.8372, time 115.24ms, mfu 3.10%
iter 4850: loss 0.8202, time 112.28ms, mfu 3.12%
iter 4860: loss 0.8241, time 116.29ms, mfu 3.13%
iter 4870: loss 0.8103, time 116.90ms, mfu 3.13%
iter 4880: loss 0.8312, time 112.69ms, mfu 3.15%
iter 4890: loss 0.8079, time 115.64ms, mfu 3.16%
iter 4900: loss 0.8054, time 114.67ms, mfu 3.17%
iter 4910: loss 0.8333, time 114.81ms, mfu 3.18%
iter 4920: loss 0.8248, time 114.41ms, mfu 3.18%
iter 4930: loss 0.8068, time 116.15ms, mfu 3.19%
iter 4940: loss 0.8003, time 115.11ms, mfu 3.19%
iter 4950: loss 0.8276, time 115.75ms, mfu 3.19%
iter 4960: loss 0.8309, time 116.05ms, mfu 3.20%
iter 4970: loss 0.7909, time 112.48ms, mfu 3.21%
iter 4980: loss 0.7935, time 115.85ms, mfu 3.21%
iter 4990: loss 0.8205, time 116.26ms, mfu 3.21%
step 5000: train loss 0.6210, val loss 1.7005
iter 5000: loss 0.8196, time 12154.77ms, mfu 2.89%
✓ App completed.
(nanoGPT) martin@Capo:~/nanoGPT$ modal run lob.py --command sample
💾 using volume name: 2023-07-27-10-45-2-aws
✓ Initialized. View app at https://modal.com/apps/ap-GmvdCJZhE0rBXwoUqZzliE
 ./configurator.py
 ./LICENSE
 ./scaling_laws.ipynb
 ./README.md
 ./train.py
 ./model.py
 ./pyvenv.cfg
 ./sample.py
 ./transformer_sizing.ipynb
 ./lob.py
 ./bench.py
 ./data/shakespeare/prepare.py
 ./data/shakespeare/readme.md
 ./data/openwebtext/prepare.py
 ./data/openwebtext/readme.md
 ./data/shakespeare_char/prepare.py
 ./data/shakespeare_char/readme.md
 ./__pycache__/lob.cpython-38.pyc
 ./assets/nanogpt.jpg
 ./assets/gpt2_124M_loss.png
 ./config/train_shakespeare_char.py
 ./config/eval_gpt2.py
 ./config/eval_gpt2_large.py
 ./config/train_gpt2.py
 ./config/eval_gpt2_medium.py
 ./config/eval_gpt2_xl.py
 ./config/finetune_shakespeare.py
✓ Created objects.
├── 🔨 Created copy.
├── 🔨 Created mount .
└── 🔨 Created mount /home/martin/nanoGPT/lob.py
Command sample was chosen.
This will run: ['python sample.py --out_dir=out-shakespeare-char']
💾 using volume name: 2023-07-27-10-45-2-aws
📁 Running rsync to copy files up to container:
sending incremental file list
./
__pycache__/
🐍 Using remote python version:
Python 3.8.15
🏃🏽Executing command: python sample.py --out_dir=out-shakespeare-char
Overriding: out_dir = out-shakespeare-char
number of parameters: 10.65M
Loading meta from data/shakespeare_char/meta.pkl...


Clown:
So you will be a fellow a servant. He have not strange?

MONTAGUE:
No more, my lord: he's never court.

ANGELO:
Go on,
You shall be a common that little might show.

ANGELO:
Alack to the queen.

ANGELO:
Here's no more.

ANGELO:
But you are near like for this. Go with this good
to sweet the king Willoughby princely.

ANGELO:
Even she is the common of the king thrust for the
great of this isle of the field?

ISABELLA:
I shall to be a word: but it is so.

ISABELLA:
I tell you, my lord; I wi
---------------

MenEnius, I must thou continue me
That I have sent forth and late.

CORIOLANUS:
Change thee,
One that frowns should sold. A silly thing it
To see thee. Thou art a maid?

LARTIUS:
My very time:
Here's a Volscian, sir, come, sir, sir, thou art a very root
To an unballable: 'tis no world of this selfname
Whose lasts our voices' shore, his heir
Than he nothing Camillo is the office
To her sent a submissips; and will not so the silk
Even with the crown, to prove a watch, if
I can rest; and think it e
---------------

MARIANA:
I beseech you, I would weep.
My lord, I do dinner and I would you bring you.

ROMEO:
It is a fellow till your sons, growing as
I dream'd your gracious senIUS:
You take to the covert and in this blame
In my breath in men man is walling out of it;
And still you with that with the prisoner.

Nurse:
What does not?
Madam? O my heart with you? how I was,
Provost, 'tis quickly have been dead!'

LADY CAPULET:
Well, that I have quite to the town throughese man
I' the instrument, so much of thee

---------------

The straight will keep the sweetest princes for kove,
As every top the officers, who was straight
To score the shame, and patricians.

BENVOLIO:
I am worn. The lark of Rome, Romeo comes
Shall bear the news.

MERCUTIO:
O her love! her wife, thou with him:
Here's companion, Lucio.

MERCUTIO:
An if you be fair?

BENVOLIO:
The rest, are no like a sound fool.

MERCUTIO:
O, true, thou hast dead! most dead, an old transport
Than do thou threat parliaments!

ROMEO:
I fear, how she was ever briever and h
---------------

BUCKINGHAM:
Then, here is not some stinlight for your grace,
Not yet ever so much every are there
But in the people's parliament. Some in the man
Being moody to swimit it of you.

BUCKINGHAM:
No other more come to make his most noble great;
And, by her name well appear the nounted soldiers,
Being common your worship not and under his soundly.

GLOUCESTER:
Here comes her than the tedious course.

KING EDWARD IV:
And then moves so he that would as dead.
Go on, and will keep our grace; and then for
---------------


MENENIUS:
He's a letter for Corioli: he's
a gracious lord.

First Citizen:
Away, away!

MENENIUS:
I had a lack of the people, the
people! Take him dead, I had said to chance.

CORIOLANUS:
What's change to the people? have you not
to have an abroad with the seeming noise the proud?

Second Citizen:
Conceit it, Backingham, that walk appointed in the pride.

CORIOLANUS:
I cannot think with all the tomb, that he does
The vilgary of mine and his bed,
Her disgraced him, dear him, to do weep with the 
---------------

Shepherd:
Heaven if so, steel it not.

Clown:
I have not to many of this all, and whose
the shores: your blestings are bried intents
none world, and have you so?

Clown:
Will you not banishment, that you have deserved them, who
seen you do, sir, which I must conto any fool, a
rashning from this is little bustrains. Your
wisdom to see alone all. Proceeds to the limb; a while I woo do it alone.

Clown:
My lord, hold I tell you, and I would prophesy to
hold that which you were look not like a fool.
---------------

ISABELLA:
True, if not me, or with mine honour,
I do not banish you with her rest,
That I should not hear me with her eyes,
The world is beholding in enterial.

ANGELO:
The battle hours I do disgrace the sword
Of her tongue, and to such a brother world in earth!

ISABELLA:
Now, sir, if it breathed not, I will tell you fry them?

ANGELO:
See you, tell me this winter'd divines.

ISABELLA:
How doth here this, my lord; but it is an abuse
Within this approbation, and she was much a maid
But in the ol
---------------

GLOUCESTER:
Lords, I can not see the other sad:
And with her, which she shall seem for their swordship.

WARWICK:
Bloody maids the feast.

KING LEWIS XI:
No, my lord, Bona broke, and go to thee.

YORK:
Exeter, I mean this, by my soul rest.

KING LEWIS XI:
The king my great soul in any summer's love,
And in no hour aughty of your streets of justice;
For your blood with their grief,
And all the ground shall be so grown'd along:
Since will I shame at supper us.

GLOUCESTER:
No, no, my lord, be it w
---------------

How changed the drops of eight of his soul?

ISABELLA:
This is the day of this is your land;
But I have been call'd up him been your tent?

DUKE VINCENTIO:
How far of the solemnity? who is wrong'd?
Why should we shame an arms stoop of life?
They will prove his like offence with life
And to be crave a happy model's guilty of his cheeks;
For all his foes, that are gone of me.

DUKE VINCENTIO:
Their love and press sent to it.

CLAUDIO:
Brother is a visitor.

ANGELO:
Why, 'tis my lord, I die the Cap
---------------

Notes about how `lob.py` works

The script works by doing the following:

The is a function called copy (I should probably have called this copy_and_run) that runs on the remote machine, and copies all of the changed files from the local file system to the remote machine.
To this function we bind 2 directories that appear on the remote system:
- /source/code is a mount that is a copy of your local folder, except for the folders mentioned in exclude_paths_starting_with. This is a (I think) temporary and (for sure) read-only folder.
- /root/code which is a “network file system” which has been set up as persistent and read/write.
The copy function uses rsync to copy everything that has changed from the mount to the persistent file system. This means that future runs are quicker (they only need to copy changed files) and the running code can save data, such as model snapshots, and recover them on future runs.
Once copy has done with rsync it changes directory into the /root/code folder, then runs your commands.

Modal.com and NanoGPT continued: producing output; using tiktoken for bigger tokens

posted on July 19, 2023July 19, 2023tagged as Blog, MachineLearning, NanoGPT

In the previous post we explored how to get NanoGPT training on Modal. There was quite a bit to that, so I left the text generation part to now just to cap that post off. Let’s do that now and then try some more stuff out with NanoGPT.

Let’s make some Shakespam

With all the setup work done in the first post, generating text on Modal will be much easier.

The repo code that generates text is sample.py, and we just need a script to hook into that and run in in Modal, which is this (train_modal.py):

import modal

# Make sure we have access to the data we prepared earlier:
volume = modal.NetworkFileSystem.new().persisted("nano-gpt-volume")

# Set up the container for running the training, and make sure it has the necessary
# python pacakages installed.
stub = modal.Stub("nano-gpt-sample",
    image=modal.Image.debian_slim().pip_install(
        ["torch", "numpy", "transformers", "datasets", "tiktoken", "wandb", "tqdm"]
    )
)

# This stub.function allows train_modal to be called remotely on their servers. We will
# now specify how we want that set up...
@stub.function(
        # Ensure that the function runs with a GPU, I have picked out a cheap one, but you can replace
        # this with "any" in the future if this GPU is no longer available.
        gpu=modal.gpu.T4(), 

        # Increase the timeout to allow long training times.
        timeout=3600, 

        # This tells modal to upload the entire nanogpt package we created. Without doing
        # this it won't be able to locate train.py, model.py etc.
        mounts=[modal.Mount.from_local_python_packages("nanogpt")],
        
        # Mount the data we prepared earlier
        network_file_systems={"/root/data": volume}
        )
def sample_modal():
    # This import is a cheeky and quick way to run nanogpt with minimal changes to Andrej's code. Ideally we would change
    # the `train`` module to expose a function. Then import `train` and call that function.
    import nanogpt.sample

# This is what gets called locally when running `modal run train_modal.py`, and it just calls the 
# remote function.
@stub.local_entrypoint()
def main():
    sample_modal.call()

Then to run it:

modal run sample_modal.py

The result of this is long, and is shown in the expander below. I think this is really impressive:

Shakespeare Output (click to expand)


ANGELO:
And coward to lay them again.

DUKE VINCENTIO:
My lord,
My lord, I have received. Come, and you not affance,
I have heard to the way of the wanton.

DUKE VINCENTIO:
Once some rove sorrow to prince, you must have a
party husband with a creature her years and we are at lives
to the world that you have done evil so, and your face now
yet to-day, you must pardon not the son; if he must not have
been fair of your own.

KING EDWARD IV:
I am a brief, and that he straight no more stones of the

---------------

Men pardon me, you shall have hang have myself
And fortune's madam, who cause me to my grave,
That could perform the glory to ask of men.

Servant:
Peace, my lord, dispatch, I say you to the state.

First Citizen:
It is Claudio, now she shall content me to do home,
That were but with his grace.

First Servingman:
Nor I had said it, he did not the voice.

Third Citizen:
He is it is no no strength of all the city;
And so the silence lies all of soldiers part.
There is it out to give us your wife.

---------------

MARIANA:
I beseech you, I would were her the compass,
And say your complexions are to be rest;
And in the world-find the veins of this very
Doth like a good sword. You have a man a poison come as
In that lamentation or pleadeness, not of happy part
In antic determy's landers, most lovely speech
To the world a deed of through he is head,
Resolved in closen. What is the law of your taste?

HENRY BOLINGBROKE:
Farewell, and let us my lord: yet I will be foresh,
Believe me as you are so are all to th
---------------

The shortening shall the crown, even I be sudden.

GLOUCESTER:
But when I was before the straight of my death?

QUEEN ELIZABETH:
And if thou deserve mistrust with God!

KING RICHARD III:
Hast thou know'st this sentence may not less?

QUEEN ELIZABETH:
The crowns of thy hand, like all thy courts to see.

KING RICHARD III:
Thou liest; and like a spoil to thy wit should.
Wert thou and a world prize head, and sent my grace
To comfort, a should things bring thy honour,
And a submission that be a happy
---------------

Be every believed him to the crown
For less of Gremio, I means to give us him.

FLORIZEL:
What rade you beseech your tongue.
If my live, we'll be gone, my lord.

CLAUDIO:
Look,
That he not so your trannous company many
Does his noble gentle confession
Procled had done, let us you go your foe
To say advise me which royal provided in his beast.

ISABELLA:
But your mother, the senators of brother party,
Sir, he is in joy and little his head;
The air fearful rob her false for our growther's crown,
Y
---------------


MENENIUS:
He hath had not a scatch'd before him:
Fie, make a grace, but he sword him well
To pluck a town and so at the time;
But what news is he see his beauty love?
What was done? if we will not sit it now,
Respected it come, and you receiver
To stand me the line of your honour grace
And have heard him fair.

ROMEO:
Cry you not inheritate is the prince of heaven,
Which your ladyy contraction will strike him to bring them
From my sovereignty: they are in his heir,
And he is so his arms and spi
---------------

She would be make from thee spirit!--Her, chequence here,
A man thoughts are the place of a sweeter,
Bare from faults: but I come to him,
You do leave the prisoner and raining
Fear him fill old my vault, and save him love.

First Lord:
He shall be some counsel hunt and the morning warrant,
So if the never warrants of your father's most
To fall father's order? Therefore, I perceive
Think there the new to defend, that tears the people.

HASTINGS:
I thank you at xpeals in the while be
All the souls
---------------

lord:
How dost thou tell he is himself? what give me the world?

Third Citizen:
And the master that be a party villain.

Second Citizen:
Here hath he comes at known to use your guarden
how by the own cause.

First Citizen:
He hath peruned but a stronge in earth!

Second Citizen:
He wisdom crave him with your strokes and highness
Have he deserved to your native with his power.

CORIOLANUS:
No more eye than country's purpose a worthy lips.

Second Citizen:
A peace, sir, when you must be dead! who 
---------------

See that wretched in the night's care. But if you like, I
leave you shall be true.

LEONTES:
Who's some that made you have scretch'd me to take.

Second Watchman:
Not, how my brother gates that can the day
And I may out wish her and revenge with yourselves. I

First Gentleman:
And he is my present but to your honours.

CATESBY:
As if they gave you here such be the business
With a rude appetitent of my watch mother,
Be relessed in her is a poison.

LADY ANNE:
How fearful is yourself, my gracious 
---------------

Her cheeks is a dreadful end; my lady's gracious lord,
And like a bastard to her tyrannous tain,
But if I think, at so so such disposed.

KING RICHARD II:
How! what a woman?

QUEEN ELIZABETH:
My lord, shall I were her no score to me.

QUEEN ELIZABETH:
Go then be a bounty's place-gowen fyriends!

KING RICHARD II:
Then gentle counsel your headship office,
And I shall say therefore and person committent.

KING RICHARD III:
O Bushy, thou hast not speak'd of fair both to thee.
Thou but shalt thou sha
---------------

It amazes me that we can get computers, which are purely logical to do stuff like this at all. For presentive my first computer was an Acorn Electron – 32kb RAM (a millionth of a decent laptop nowadays).

Another reason this is amazing is the step-change that using the transformer model (which is the T in GPT) gives you over other models that are shown in the zero-hero course. It is not just the computing power that does this, but the research into new models that has happened in the last 20 years or so.

Turning up the temperature

Andrej included a temperature setting, which allows you to adjust the “randomness” of the output:

If set to very close to zero, it will produce the same output each time. This is the output it considers “most likely”.
If set to 1, it will produce the output with the probabilities it predicts, for example if it decides, based on training, that there is a 80% chance of an o coming next, and 15% chance of a d, then it will produce an o 80% of the time.
If set higher, the probabilities will move closer together, giving less likely character more chance of appearing.

The chart below (link to Google sheet) shows how increasing temperature makes the probabilities of 3 potential “next characters” close up to each other, and decreasing causes the preferred outcome to be picked as the winner always:

Let’s try a temperature of 2, add this line to train_shakespeare_char.py:

temperature = 2

Here is a small sample of the output I got. It is definitely more chaotic!

HASTINMBSABURY:
Stir-3 Sleep, haugs:
Warthy, usquick..tWarwiXl!
Hatensworn my feans?
You know,
Young, tof it is!
BAmilind!

A low of temperature of 0.1 give us this, which seems more coherent, but much more “stuck like a record”:

CORIOLANUS:
I will be so so much a part of the people,
And then the way of the common of the court,
And then the way of the people of the court,
And the prince of the people of the court,
Which we have stood of the prince of the people,
And the princely of the streets of the state,
Which we have stood to the body of the sea,

I think the default temperature of 0.8 was probably “just right” like the porridge!

Using tiktoken for better encoding of the text

Tiktoken is a tokenizer library used by OpenAI. It’s job is to turn a sentence into a string of number representations, which can then be used to train the model. It does this using an algorithm which first encodes the most frequent words as single tokens, while the less frequent words that contain more frequent words as its subwords are represented by multiple tokens, each of them representing a word part..

Until now. have been training by converting each character to a number. However the problem here is we are not making good use of the structure already in English: words and parts of words to process numbers with more meaning.

Tiktoken offers a choice of the pre-built tokenizers they use in their models, and Andrej uses the gpt2 one. To give an idea of what this does, here is some code that encodes using tiktoken, then shows the resulting encoding

enc = tiktoken.get_encoding("gpt2")
for tok in enc.encode("Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversations?'"):
   print(f'{str(tok).ljust(5)} : {enc.decode([tok])}')

Here is the result:

Click to expand

44484 : Alice
373   :  was
3726  :  beginning
284   :  to
651   :  get
845   :  very
10032 :  tired
286   :  of
5586  :  sitting
416   :  by
607   :  her
6621  :  sister
319   :  on
262   :  the
3331  :  bank
11    : ,
290   :  and
286   :  of
1719  :  having
2147  :  nothing
284   :  to
466   :  do
25    : :
1752  :  once
393   :  or
5403  :  twice
673   :  she
550   :  had
613   :  pe
538   : ep
276   : ed
656   :  into
262   :  the
1492  :  book
607   :  her
6621  :  sister
373   :  was
3555  :  reading
11    : ,
475   :  but
340   :  it
550   :  had
645   :  no
5986  :  pictures
393   :  or
10275 :  conversations
287   :  in
340   :  it
11    : ,
705   :  '
392   : and
644   :  what
318   :  is
262   :  the
779   :  use
286   :  of
257   :  a
1492  :  book
4032  : ,'
1807  :  thought
14862 :  Alice
705   :  '
19419 : without
5986  :  pictures
393   :  or
10275 :  conversations
8348  : ?'

What I find interesting here is "and" & " and" are different tokens: 392 & 290. It is also interesting that most tokens are whole words here. “Peeped” is the odd one out that got split up.

To train the model using the tiktoken we need to run the prepare.py file in the shakespeare folder (as opposed to the shakespeare_char folder we used last time).

Training with the Tiktoken encoding

There are a few things I had to do to get this to work. It got a bit messy, so I won’t share the code here, but I aim to put something better up on Github eventually. But in short I had to

Change the GPU to A100 – 20Gb to have a chance to train it in a reasonable time
Because modal has “regions” this means also changing the volume name, so it could create a new volume near that GPU’s region
And this mean changing all the modal calls to specify the A100 -20Gb GPU so they would be in the same region
I also changed the parameters, I reduced the batch size from 256 to 64, since the tokens now mean more than they did before, so we can do with fewer, but I increased the embedding size from 384 to 384 * 4 since we might need more dimensions to represent a word.

With all of that done, here are the results I got, there is a lot more text because the number of tokens generated is as before:

Click to expand

RICHARD:
Sir, if you will.

CLIFFORD:
No, by this news.

CLIFF:
I have the duke?

BUCKINGHAM:
The queen, but I'll not answer,
BUCKINGHAM:
So far is she is: I'll tell you well.

CAPULET:
Then, with this?

JULIET:
We are forgot:
My lord,
That is not not a more.

KING RICHARD II:
FRIAR LAURENCE:
But Warwick, noble my heart's son,
And learn thee for afeard.

KING RICHARD III:
POMAS MOWBRAY:
How now,
That thou speak, I should be thy eye,
Call this more as my lord,
That thy good time?

Boy:
No, I will not be done.

QUEEN MARGARET:
But so the duke is this forswear,
Thou canst thou notstst go?

KING RICHARD II:
Well,
Or, my lord, I'll tell thee.

QUEEN ELIZABETH:
My lord, lords, my lord, though I be a life
To your request, by my body or bad.

DUKE VINCENTIO:
At your grace of sorrow and I'll swear
From this is a slave, go, sir, let him be not be so.

LUCIO:
Not much much with him.

DUKE VINCENTIO:
I'll'll not be so,
That you shall be.

DUKE VINCENTIO:
Come, sir, let me be this,
Nor would he:
I have not stay you, sir, my lord?

LUCIO:
Why, madam, in the hour; but I am a
for never will and be sent by that knows you,
That can that you are no man.

ESCALUS:
O, have you.

ISABELLA:
Look ye, I'll give him the queen,
That ever all the city,
Will they see; I am
---------------

Romeo; more a poor gentleman.

FRIAR LAURENCE:
My lord, Warwick, the king,
And, and so, if you did so,
Think what you have been an hour not what?

BUCKINGHAM:
My lord, I pray, my lord,
Which, look yourself to your lords,
I speak or my lord,
I must be a villain's son--

KING RICHARD III:
Not a traitor, and such a wife's wife;
Who now, for, who is the prince blood.

JULIET:
My lord, like Edward's son,
They are a deed were gone.

PETER:
An that it shall never be my son,
Wherein your worship's face, to the cause of this pleasure,
To fight in the rest, with the one that he does approach
To bring my hands
That Henry is my name
Thy mother? Sir and take down
To be too revenure
And yet more, and for, by a righteous justice. This is my son,
Than nature, and never
Which to be in my soul,
To bear his country's daughter's son,
I hate him. So, now how the consul'st, as my friends
Than he might have
The dost he shall live, and in his blood
And the hear him at a burthe;
And he shall hear the earth; and this is the field
When he could not keep me not and make him be but. Look, do I have not for any,
Who ere thou doth not stay:
She is not in him
As when thou that thou hast thou there's deny,
Not yet too worse than a man and comfort
Have no more in such idle by the rest;
Within this means, and thy mind
And any that grieved day and heart'st
As to the people were a corse,
There is the crown of the prince, and with a deal of a
thing by the crown had to his breath.

FRIAR LAURENCE:
Welcome, come, and the cause.

JULIET:
FRIAR LAURENCE:
A night is, I can,
Would you be this?

BUCKINGHAM
---------------


A second arms; she did I have sat home.

CORIOLANUS:
No, my friar,
Whose words, you are true.

COMINIUS:
It is the execution'en is not
As I
would be, and,
AEdidius!

MENENIUS:
I would so, sir.

MENENIUS:
I have worn a good sir.

CORIOLANUS:
I pray you, he you to know him:
Nor I have not long.
All:

COMINIUS:
I'll tell me, and what we will tell him,
I am nor place,
And yet I'll tell your office to my good lord,
To see the queen's answer,
To see him to come, and to him
To-emad,
Unless we may. We'll make them stay away.

BUCKINGHAM:
The grandONay, and I will he'll not be?

Messenger:
You'll not not be gone.

POLIXENES:
No cause, if you be an oathful.

PERDITA:
How now comes, Montague's the city?

CAMILLO:
Tush, the sun; but,
You are too, to be my lord, to be most,
And so I will;
But so would be in her ears.

LEONTES:
AEdet:
You are to be just and my life,
Unless them in them, and,
So now he did have seen him,
How now, I'll not my friends, i' the war,
And to my life be gone; take to have
Are I be this.

ANTIGONUS:
I then, go you, come, sir.

CORIOLANUS:
The cause be your care.

VOLUMNIA:
Go, sir.

MENENIUS:
See, sir.

MARCIUS:
He should you mean with him on,
LADY ANNE:
The people has so past the
land is hanged?

LADY ANNE:
Here's a man.

Citizens:
O, the more of
---------------

LUCIO:
Peace, sir, good lord,
Be it to be your opinion's way: I
What, I'll beseech you, where all you,
That we his wife.

DUKE VINCENTIO:
That you might say the duke is with you.

Provost:
Thou hast, make us keep him for you?

LUCIO:
I would be his friend:
We am I now, not to the poor
wixt it, and lay their
bclaim me again.

ISABELLA:
A:
Gentlemen, madam, and not we must not.

ABHORSON:
I pray, the truth.

Cous my lord, for the other's.

DUKE VINCENTIO:
It will, and yet of good Warwick,
To give the oracle.

ESCALUS:
My lord, with this world is a little great
To the or voice! I could not--

LUCIO:
There may I do; the people
To say 'tis bad. Show you are hanged.

ANGELO:
I am a man, be done.

CLAUDIO:
A day, more true
Of nature's son, or so far
With comfort for his.

DUKE VINCENTIO:
I have made her, and he was no more than well most in the other,
I'll make you
Than I have seen them, and that he doth he rather have,
To the loss of the truth of her.

ISABELLA:
A:
Go, good my lord,
For my lord, my lord,
My lord, I'll tell me.

ANGELO:
Tomewell, farewell:
I'll be but too, if your lip,
Or else be so sent my lord, sir,
AEdage:
Thou art past, then thou liest
For thou art a fawn'd at thy fortune,
Do not put him there?

LEONTES:
Welcome, worthy better
That it be a man;
I will give thee so,
I thank the friar,--

RICHMOND:
An is not a
---------------

GLOUCESTER:
My lord, I'll take it more,
And that it be sworn.

KING EDWARD IV:
My lord, and well I do know
That you have done to use the king,
And love the king may be the king on the king: some other.
To tell him here:
I have been been gone.

KING RICHARD II:
Ay, my lord,
And yet please me, my lord,
That all the king or England's face
My gracious lord, to be satisfied?

CLIFFORD:
Call my lord, our good soul, when I feel,
How dost thou but gone'st not good,
And, in a very day, in my father's death,
Pray thou to have
On him a dish and by me, but of these life.

KING EDWARD IV:
Uncle, and 'tis not, with my soul's head,
Virtue in the benefit of a
more than my soul is valiant--

GLOUCESTER:
Here is a thousand days, I I am strange,
That knows not the more, my lord,
I might have come to thee; so is a
people them the king.

LADY ANNE:
What is't is the gentle's son,
As you have been ill thing to be made,
And, for now he will not, the king is our
That is, by it is with his foot
From all with a good boy.

GLOUCESTER:
O, what you will at your lordship?

LADY ANNE:
I am a man; she's no more fearful
They'll reap the people.

BUCKINGHAM:
They are no more
To use my lord, let me be a
horse it must then, and I can fall
Than a secret fellow.

Citizens:
You have gone to make him you that, I have a
your lords, for it is no better,
Cannot take your goodness and let him be
With every not a loss, by all your wife's love he is gone.

Clown:
My lords!

First Murderer:
Oh, he's.

Third Murderer:
What!
---------------

What news you, that you stood these good.

COMINIUS:
The like you.

Cousin, the time;
Not I did;
The one we must be a great man,
That he may, the war of a noble
To draw the house of the world is amended.

MENENIUS:
Is not I
That'er he had not believe him.

MENENIUS:
Are you deserve to the
man come.

CORIOLANUS:
But call me hence?

VIRGILIA:
O Marcius!

First Murderer:
ANE:
You have been, he did
these sun of our brother's death,
You are no more
As I pray you.
You have been as we have made a drunkenry
With all the wind of a bawme and honour
With all the king, and, I will say
I have heard
I, how I must be
That he sits
He will have thought
To see them, for any doubt; and,
And so we are in my cheeks to our noble man.

VOLUMNIA:
Peace, sir, sir,
Who shows him my lord.

VALERIA:
No, what he did.

MENENIUS:
As Paris is too brief, but that is you,
From this day again, till he shall be gone. They must not go.

COMINIUS:
I'll tell me, be bad.

BALTHASAR:
I'll not your grace he is off
To the
st he does the boar!

LUCIO:
I'll tell you all to speak.

ESCALUS:
I will not give him.
O night, and he is
To the duke I would not be.

LUCIO:
O, my lord! Come, sir, sir, 'twas it?

ESCALUS:
Faith.

ELBOW:
I would not know that I have made us good,
With that you are to my care:
The other that is before the Earl of York.

CLAUDIO:
O, my lord,
That's not my soul.

QUE
---------------

Thou shalt be my good lord, and,
And yet, we must be gone.
Shall be gone:
Nay, good my good wife, thy heart's wife's hence to her husband! Let me be gone.

MENENIUS:
AEdue? Yet;
By our noble uncle, from the good man,
Your suit with the people o'en
To execute my foe, how we did use him.

SICINIUS:
What dost thou didst to the least?

CORIOLANUS:
The sacred noble
report him in the time
Is he had the one that is
be one; and, and not,
If the world be the heavens could.

MARCIUS:
O, fellow you shall you find all
The first will speak him.

SICINIUS:
That often will, O, you shall not not he
But would do it, and
Of this your true? and he's
in the people's wife?

BRUTUS:
Good sir, when he might,
I could be so?

Second Servingman:
I am welcome to the day is the people!

SICINIUS:
Is bad than we are absent to know
With no more.

CORIOLANUS:
I pray, but I know you
Thou art no less a king's head.

SICINIUS:
Even with being a
be a little thing to be it
kand it is hither, he hath it any of. But and so, it is,
For those that are gone with the city.

BRUTUS:
Come to be come!

MENENIUS:
I must but you may be so.

COMINIUS:
O, that it is so.

SICINIUS:
I' the mayor will you.

LADY CAPULET:
Go, sir.

Hath not of my mother, to the rest are,
That you shall not be quickly gone.

JULIET:
Brother, to rise to your highness here.

JULIET:
Good prince, let him be so.

First Citizen:
Why, my lord
---------------

SICINIUS:
O, sir,
And, what is what?

CORIOLANUS:
At the cause of those he
How made't, be satisfied; what we are an evil he is not too.

MENENIUS:
We'll do obey you
Amen, and with a man of care
If you do please you.

A gods here
you was there; then a name, and leave you,
I have all a one.

CORIOLANUS:
If you have been dangerous,
Which it shall not come to tell them yet
Like to be too; here comes the day of the air,
If I had a man so?

MENENIUS:
O, the
fabb'd my friends! the prince
And in all the
news, and you have in the
great man hath given'd a man.

CORIOLANUS:
Not, I'll not be the
day.

CORIOLANUS:
I did not be gone.

MENENIUS:
A god of his pleasure is here.
The very king's
sine, to hold us, being not
with him, with me?
But in the gods I will
My lord, I rather have you, I'll tell him.

CORIOLANUS:
There is it so! he shall be.

MENENIUS:
That you will do deserve.

CORIOLANUS:
Worthy my lord,
You are like a
teest, I am a ballgreat, if you have won'd to thee.

CORIOLANUS:
You shall be made me to give your
to be gone.

MENENIUS:
The one that are about.

MENENIUS:
Your high the way of the people.

CORIOLANUS:
O, how you are no more without
To tell us, sir, the people?

CORIOLANUS:
The people of this doubt till he is so much,
Let me be got.

Second Senator:
No, I'll not kill'd a fearful
Of this is't!

CORIOLANUS:
Go away, that he hath spoke to
---------------

As the queen, but with a glorious man,
And bid it been spent,
To help he can never be drunk.

Page:
What you were a joyful days,
And of all, you must return.

GLOUCESTER:
For honour, if the king's queen?

GLOUCESTER:
I would not never be a poor case.

PRINCE EDWARD:
And, my lord, I am no more
At that the poor king, he is here.

LADY ANNE:
Gentlemen, my lord; and so,
And say, and well thou wilt,
And, by the earth.

LADY ANNE:
I may not be thy blood; my word,
And lay the matter in her, thou art not at all,
But never be too late.

HENRY BOLINGBROKE:
I will keep her.

KING RICHARD II:
O, I'll tell thee to make me speak,
And, this, I see his
me, and all the common
My lord, she is
My lord.
Are I can make you to-day,
Though I have well heard thy brother, O, being the doth not trouble.

QUEEN ELIZABETH:
I'll tell him.

KING RICHARD III:
I'll tell them well.

QUEEN ELIZABETH:
And he did I leave my lord.

KING RICHARD II:
And for my love.

BUCKINGHAM:
The king, poor my lord,
And make my good father, for a man that says I am early upon their best
Of my gracious head, and I thank you.

CLIFFORD:
I have not so so.

QUEEN MARGARET:
But I would tell me, I'll not be satisfied,
How now, with the Duke of York,
And she's gone.

QUEEN MARGARET:
My gracious son, so he did.

GLOUCESTER:
But to thy name, too; and, sir; and to Richmond.

GLOUCESTER:
I hate your grace, my lord,

---------------

Whose my lips, the stroke of God,
And but both be past a devil.

FLORIZEL:
Now, and so place.

HORTENSIO:
POMPEY:
I prithee, madam, my lord,
My name, I have seen it as the more.

CAMILLO:
I'll not be ignorant.

ISABELLA:
HENRY BOLINGBROKE:
Yes, sir, I know the world.

ANGELO:
Give me this remedy, you shall not be with me; make with my
shixt my good lord.

GLOUCESTER:
I am up.

KING EDWARD IV:
Would you go you.

KING EDWARD IV:
Not, for you.

KING EDWARD IV:
The Earl of York, my lord,
And make this good sir, uncle, farewell.
Which he any he makes him.

KING RICHARD III:
Go thou did grant me our fortune's love!

GLOUCESTER:
The queen'st of thy heart shall have been gone.

QUEEN ELIZABETH:
VICINIUS:
From that you may said, I am made the more
At the more that thou not a poor lady's wife
As I'll bite him:
An much in thy noble heart, to be gone,
By my brother,
When I am not satisfied.

CORIOLANUS:
Plantis well when your pleasure.

MENENIUS:
I hope how this is true.

BRUTUS:
These is not the people?

SICINIUS:
The queen is the more, to look them with me.

MENENIUS:
Good princes.

SICINIUS:
AEdile:
Faith, gentle fellow, and it is gone.

MENENIUS:
Besides, what news
To bring him
At our noble man is your grace i' the people
Would free good?

CORIOLANUS:
O, good sir, a poor kingdom,
Which it is no more
To the people is the sanctuary that I'll have been with you.

Training costs were $0.71 for GPU and $0.09 for other stuff. It took almost bang on 1 hour to train. Inference (generating text) took a few seconds.

No local GPU? No Problem! Running Andrej Karpathy’s NanoGPT on Modal.com

posted on July 15, 2023July 25, 2023tagged as Blog, MachineLearning, NanoGPT

Andrej Karpathy released a series of timeless lectures teaching us mortal 9-5 programmers from scratch how to train an “AI” language model, a bit like that GPT4 or ChatGPT you may have heard of.

He goes into a deep dive that includes building your own tiny Pytorch from scratch, setting up bigram models, and simple neural nets, before moving over to use the real Pytorch later. He then explains how transformers (the T in GPT) work, and codes one up to generate some dubious Shakespeare. This final model he calls “NanoGPT”, because of the similarity between it’s model and that of the early GPT models that lead to ChatGPT.

So why this post?

Well, while I absolutely loved the series, I don’t enjoy working with Colab or Jupyter Notebooks. It is easy to forget what code blocks have run, and I am forever scrolling up and down because the code is mixed up with results in one giant page. Not only that but if you are using Google Colab it will time out fairly quicky so you need to waste time running everything again.

⚠️Warning: I don’t think I recommend doing what I do here anymore. It works but is super fiddly. I am working on a much easier way to do this with a single Python file you download and run. So please read bearing that in mind…

I’d run it on my machine instead, but…

I want to run NanoGPT locally but I don’t have a good GPU. To save buying one for $2000+, I would like to rent one in the cloud if possible. If I use cloud GPUs I can experiment quickly with different chips as needed. An A100 GPU for example costs maybe $7000 – $15000 USD, but grabbing one for an hour for $4 is much more in my budget.

modal.com provides this service, and they take care of all of the “devops” as we will see soon. There is some housekeeping Python code to write, but no bash, Terraform or Ansible, which is great because I don’t want to do that.

Their GPU prices are not the cheapest. I would say they charge fair (average) prices though. And they charge for the milliseconds of actual usage and nothing else. That means I don’t pay extra because I forgot to shut down a server. Also they include $40/month credit for free anyway so it is costing me nothing to learn.

In this post I will show you how I used Modal to quickly train and run the NanoGPT model, while having the creature comforts of developing in VSCode.

What is NanoGPT anyway?

NanoGPT is nothing but a text producing bot!

When trained on some text it will learn how to predict the next character. So for example if you feed it “Hello ” it might predict W. You then feed it “Hello W” and it might predict o and so on. By repeating this you get text generation.

When trained on Shakespeare it makes muddled text that is quite a bit Shakespeare-looking.

Example of NanoGPT generated text:
FlY BOLINGLO: Them thrumply towiter arts the muscue rike begatt the sea it What satell in rowers that some than othis Marrity.

LUCENTVO: But userman these that, where can is not diesty rege; What and see to not. But’s eyes. What?

JOHN MARGARET: Than up I wark, what out, I ever of and love, one these do sponce, vois I me; But my pray sape to ries all to the not erralied in may.

If you want to know more, you can check out:

Video (with links to resources)
Google Colab code
Github code

Now let’s get started, and get NanoGPT trained and running with local code, and a cloud GPU from Modal.

Step 1: Learn how to run code on Modal

I won’t parrot too much what Modal have in their tutorials, as that is the best place to go, but in a nutshell you can decorate functions in Python that you want to run on their servers.

For example you have a function you want to run in their cloud:

@stub.function()
def f(i):
    if i % 2 == 0:
        print("hello", i)
    else:
        print("world", i, file=sys.stderr)

    return i * i

And then you can call this from a local function either as-is (to run locally) or with .call (to run on the server):

@stub.local_entrypoint()
def main():
    # Call the function locally.
    print(f(1000))

    # Call the function remotely.
    print(f.call(1000))

To run this from the command line:

modal deploy example.py

Step 2: Fork the NanoGPT repo, and check it works on local computer

The next step is to make a fork of https://github.com/karpathy/nanoGPT and clone that fork to my computer, so that I can make some changes to adapt it to use Modal.

Note: If using Windows, you will need to use a Linux distribution installed to WSL2 to do this successfully as Windows is not supported for torch.compile

It is a good idea to check that we can get it to run locally. I just want to check the code works fast so I will reduce the number of iterations in train_shakespeare_char.py to 5, and dumb down the model size to ridiculously small so it completes in a few seconds on a crap laptop. Here are the changed lines in train_shakespeare_char.py:

...
max_iters = 5
...
# baby GPT model :)
n_layer = 2
n_head = 4
n_embd = 16
dropout = 0.2
...

In addition, I uncomment these 2 lines in the same file (train_shakespeare_char.py) to make it possible to run on an average laptop with no GPU:


# on macbook also add
device = 'cpu'  # run on cpu only
compile = False # do not torch compile the model

To check that it works, I set up a Python environment, and run similar commands as shown in the NanoGPT README.md:

python -m venv .
source bin/activate
pip install torch numpy transformers datasets tiktoken wandb tqdm
python data/shakespeare_char/prepare.py
python train.py config/train_shakespeare_char.py

From this we get a confirmation that this training loop is running correctly:

step 0: train loss 4.1783, val loss 4.1771
iter 0: loss 4.1791, time 47896.67ms, mfu -100.00%

Knowing that it works on my computer makes me more confident to try and getting it working on Modal.

Step 3: Upload the training data to modal

3.1 Authenticate with modal

First, lets do the basic setup for Modal and get authenticated:

pip install modal-client
modal token new

3.2 Change the prepare.py to upload to Modal

Now edit data/shakespeare_char/prepare.py, and nest the existing code inside a main function. Add a @stub.local_entrypoint() decorator, so that Modal knows to run this locally.

@stub.local_entrypoint()
def main():
    """     
    Prepare the Shakespeare dataset for character-level language modeling.
    So instead of encoding with GPT-2 BPoE tokens, we just map characters to ints.
    Will save train.bin, val.bin containing the ids, and meta.pkl containing the
    encoder and decoder and some other related info.
    """
    import os
    import pickle
    ...

Add the following lines at the top of the file to define the volume and app name:

import modal

volume = modal.NetworkFileSystem.new().persisted("nano-gpt-volume")
stub = modal.Stub("nano-gpt-code")

Now add this function at the bottom, which will run on the remote server. All it does is copies the files over with some prints to check if it was successful. It keeps the folder structure on the server the same (the working directory is /root there) so that there is less code to change in train.py when we get to it.


dataset = "shakespeare_char"

@stub.function(
        mounts=[modal.Mount.from_local_dir("data", remote_path="/source/data")],
        network_file_systems={"/root/data": volume})
def copy():
    import shutil          
    import os


    source_dataset_path = os.path.join("/source/data", dataset)
    dest_dataset_path = os.path.join("/root/data", dataset)

    def check():        
        if os.path.exists(dest_dataset_path):
            files = os.listdir(dest_dataset_path)
            print(f"Files: {str.join(', ', files)}")
        else:
            print(f"Path doesn't exist")

    check()
    shutil.copytree(source_dataset_path, dest_dataset_path, dirs_exist_ok=True)
    print("files copied")
    check()

Now make the call to copy from main:

...
    # val has 111540 tokens

    copy.call()

3.3 Run the upload

You can now run this to perform the upload:

modal run data/shakespeare_char/prepare.py

You should get an output like this:

Path doesn't exist
files copied
Files: meta.pkl, val.bin, prepare.py, input.txt, __pycache__, train.bin, readme.md

If you run it again, it should show that the files exist before it is copied, proving that the data was persisted. Now the remote machine has access to the training data.

Step 4: Adapt the training code to run on Modal

4.1 Make the training code into a Python package

As far as I can tell, in order for Modal to see all of your Python code it must be organised in a package.

To make the code into a Python package those is quite simple, first move the python files for the model training and text generation into a new folder:

mkdir nanogpt
mv config *.py nanogpt

Find all instances of from model in these files, and replace with from .model (Add a period). For example in train.py:

from .model import GPTConfig, GPT

Adding a period to these local imports says “this is from the current directory’s package”. This allows the code to work when called from another package or location, which will be doing when using Modal.

4.2 Remove the configurator

There is a line in train.py that needs to be commented out because it won’t work in Modal (because it doesn’t have the source files in the same place), so comment this out, and add a hard-coded line that does the equivalent thing for the Shakespeare model.

# exec(open('nanogpt/configurator.py').read()) # overrides from command line or config file
from .config.train_shakespeare_char import *

This is perhaps not the ideal way to do it, but a quick change for the purposes of making this blog post not too long.

4.3 Add a python script to run the code in Modal

Create a new file called train.modal.py in the root of the project (so one up from nanogpt folder) and add the code below. I have put some comments in there to explain it.

import modal

# Make sure we have access to the data we prepared earlier:
volume = modal.NetworkFileSystem.new().persisted("nano-gpt-volume")

# Set up the container for running the training, and make sure it has the necessary
# python pacakages installed.
stub = modal.Stub("nano-gpt-train",
    image=modal.Image.debian_slim().pip_install(
        ["torch", "numpy", "transformers", "datasets", "tiktoken", "wandb", "tqdm"]
    )
)

# This stub.function allows train_modal to be called remotely on their servers. We will
# now specify how we want that set up...
@stub.function(
        # Ensure that the function runs with a GPU, I have picked out a cheap one, but you can replace
        # this with "any" in the future if this GPU is no longer available.
        gpu=modal.gpu.T4(), 

        # Increase the timeout to allow long training times.
        timeout=3600, 

        # This tells modal to upload the entire nanogpt package we created. Without doing
        # this it won't be able to locate train.py, model.py etc.
        mounts=[modal.Mount.from_local_python_packages("nanogpt")],
        
        # Mount the data we prepared earlier
        network_file_systems={"/root/data": volume}
        )
def train_modal():
    # This import is a cheeky and quick way to run nanogpt with minimal changes to Andrej's code. Ideally we would change
    # the `train`` module to expose a function. Then import `train` and call that function.ction and call that.
    import nanogpt.train

# This is what gets called locally when running `modal run train_modal.py`, and it just calls the 
# remote function.
@stub.local_entrypoint()
def main():
    train_modal.call()

With a GPU available, we can comment these 2 lines back out in train_shakespeare_char.py:

# on macbook also add
# device = 'cpu'  # run on cpu only
# compile = False # do not torch compile the model

We also want the checkpoint saving to work (which saves the progress so we can resume on error, and also to run the model later). Because we mounted a folder called data, make the following change, otherwise the checkpoints wont be saved:


out_dir = 'data/out-shakespeare-char'

4.4 Run the script

Now we can run this from the command line: modal run train_modal.py, and here is the result:

(nanoGPTonModal) martin@Capo:~/nanoGPTonModal$ modal run train_modal.py
✓ Initialized. View app at https://modal.com/apps/ap-k9Oehw5IpXCxmt3yNBUNds
✓ Created objects.
├── 🔨 Created train_modal.
├── 🔨 Created mount /home/martin/nanoGPTonModal/nanogpt
└── 🔨 Created mount /home/martin/nanoGPTonModal/train_modal.py
tokens per iteration will be: 16,384
found vocab_size = 65 (inside data/shakespeare_char/meta.pkl)
Initializing a new model from scratch
number of parameters: 0.01M
num decayed parameter tensors: 10, with 11,280 parameters
num non-decayed parameter tensors: 5, with 80 parameters
using fused AdamW: True
step 0: train loss 4.1783, val loss 4.1771
iter 0: loss 4.1791, time 3620.00ms, mfu -100.00%
✓ App completed.

4.5. Revert to the proper sized hyper-parameters

Revert the values in train_shakespeare_char.py to the bigger model values, with more iterations. Now we are using Modal, this will be able to run in a reasonable time.

...
max_iters = 5000
...
# baby GPT model :)
n_layer = 6
n_head = 6
n_embd = 384
dropout = 0.2
...

Tip, the next step takes about 15 minutes. If it makes training progress (says checkpoint has been created) but then gets stopped, you can resume it by setting init_from = 'resume' in the parameters above.

Running modal run train_modal.py again:

\(nanoGPTonModal) martin@Capo:~/nanoGPTonModal$modal run train_modal.py
✓ Initialized. View app at https://modal.com/apps/ap-HU6D2SRnxOv1OsJpmlb3Fj
✓ Created objects.
├── 🔨 Created train_modal.
├── 🔨 Created mount /home/martin/nanoGPTonModal/nanogpt
└── 🔨 Created mount /home/martin/nanoGPTonModal/train_modal.py
tokens per iteration will be: 16,384
found vocab_size = 65 (inside data/shakespeare_char/meta.pkl)
Initializing a new model from scratch
number of parameters: 10.65M
num decayed parameter tensors: 26, with 10,740,096 parameters
num non-decayed parameter tensors: 13, with 4,992 parameters
using fused AdamW: True
compiling the model... (takes a ~minute)
step 0: train loss 4.2874, val loss 4.2823
iter 0: loss 4.2649, time 29573.95ms, mfu -100.00%
iter 10: loss 3.2438, time 101.76ms, mfu 3.66%
iter 20: loss 2.7899, time 103.62ms, mfu 3.66%
iter 30: loss 2.6383, time 104.10ms, mfu 3.65%
iter 40: loss 2.5763, time 101.83ms, mfu 3.65%
iter 50: loss 2.5261, time 104.54ms, mfu 3.64%
iter 60: loss 2.5136, time 103.90ms, mfu 3.64%
...
iter 4980: loss 1.2050, time 117.62ms, mfu 3.16%
iter 4990: loss 1.2493, time 114.90ms, mfu 3.17%
step 5000: train loss 1.1405, val loss 1.4969
iter 5000: loss 1.2446, time 12044.48ms, mfu 2.86%
✓ App completed.

Costs

It took about 14 minutes and cost $0.21 to train the model. I think $0.14 was for the GPU and the rest was for CPU/memory.

Conclusion

First, this took a little more work than expected to get some local python code running on Modal.

The combination of design choices in the nanoGPT repo, and the fairly narrow happy path to get code to run in Modal meant that a lot of changes had to be made. To summarize these things meant code changes were needed:

Modal will only upload a bunch of Python files if specified as a package. NanoGPT didn’t do this.
Modal will put the files “somewhere”, so using exec() on relative paths to local scripts like NanoGPT does won’t work.
Modal requires additional functions and decorations, so a new file is needed.
Modal requires specification of mounts etc. so this new file has quite a bit to it.

I think if you build a Python project with Modal in mind, then the experience will be easier. You will know how to organize files, what not to do, etc. So there will be less work to do.

Next, it is worth saying that once you get this working, it works really well. Running modal run train_modal.py it gets going and chugs along, you almost forget this is doing a whole bunch of ops stuff in the cloud for you. Then you can iterate and change things up and Modal gets out of your way a bit.

With Modal set up, I can now code with an IDE, IDE Plugins, file structure, git, etc. It is more what I am used to than the Jupyter experience where you have to remember what state things are in, there is effectively one big file, and output and code are all mixed up. This is much better.

Therefore overall I think Modal is worth learning and experimenting with, and putting that initial effort to get set up. Or if money is no object, just go buy a big GPU :-).

In the next blog post I run the text generation to see what kind of Shakespeare this model can produce. This will require some code changes to get that to work on Modal, but I expect it to be a lot less as much of the work has been done.

I will also explore what other features are in NanoGPT and try them out using Modal too.

NextJS – Undocumented Features

posted on June 8, 2023June 8, 2023tagged as Blog, NextJS

I have recently being playing around with the app router in NextJS. This is “new paradigm for building applications using React’s latest features”, and was introduces in v13. Personally I find this a real headache to work with. Mainly because the documentation is a bit scant and only covers the happy path. It feels a bit “early” to be using this, so I would probably stick to Page Routing for anything critical.

NEXT_PRIVATE_DEBUG_CACHE

One problem is that it will refuse to cache any fetch over 2Mb (well, looking at their code anything with more than 2 * 1024 * 1024 string length, which is an approximation. However 2Mb and a bit is a annoyingly low limit. I tried to hack it to be higher, but it seems Vercel refuses to cache it anyway. I discovered a nice hidden feature in their library code. Dotted around it are statements like this:

                if (res.status === 404) {
                    if (this.debug) {
                        console.log(`no fetch cache entry for ${key}, duration: ${Date.now() - start}ms`);
                    }

We have a debug flag, which if set logs more stuff. And more logs lead to, well more knowing what the hell is going on! How to set this debug flag on? Simple. Set the environment variable NEXT_PRIVATE_DEBUG_CACHE to true (or any truthy value). If you are using Vercel to host, you can set this up in your settings, under environment variables, then redeploy. Like this:

Here are some example logs, after doing that:

no fetch cache entry for 7710c8185037cf970b4bbd65edf9625c9f8acd1a32b78bc76ec6fc2314ff8273, duration: 83ms
no fetch cache entry for 7710c8185037cf970b4bbd65edf9625c9f8acd1a32b78bc76ec6fc2314ff8273, duration: 44ms
set cache 7710c8185037cf970b4bbd65edf9625c9f8acd1a32b78bc76ec6fc2314ff8273 { tags: '/deals/[UniqueId]/page' }
set cache 7710c8185037cf970b4bbd65edf9625c9f8acd1a32b78bc76ec6fc2314ff8273 { tags: '/deals/[UniqueId]/page' }

Repeats of this told me what I expected, my stuff ain’t getting cached. Now I need to figure out why.

How to make Windows 11 livable

posted on June 1, 2023June 5, 2023tagged as Blog, windows

Bring back the old File Explorer right-click menu

Enable the old menu for the explorer shell. The one that had everything on it rather than just 6 or so things! Like this:

To do this, run these from the “cmd” command prompt. (This will close any explorer windows by the way).

reg.exe add "HKCU\Software\Classes\CLSID\{86ca1aa0-34aa-4e8b-a509-50c905bae2a2}\InprocServer32" /f /ve
taskkill /f /im explorer.exe
explorer

Add VSCode to the context menu

Want this? (The bottom option below)

The easiest way to do this is check the 2 options when installing:

If you have already installed VSCode, no problem. Just download the installer and install again!

Install 7z

Install 7z to extract zip, 7z, tar and gz files: https://www.7-zip.org/download.html

More to come…

Learning List

posted on April 10, 2023July 27, 2023tagged as Blog

This is the stuff I want to read and learn at the moment:

General Learning Stuff

Effective Spaced Repetition – https://borretti.me/article/effective-spaced-repetition – because I think this will speed up my learning of other topics. Update: I have read most of this (all of the text and maybe 10 of the examples), and am trying it in practice now.
http://blog.ezyang.com/2019/05/pytorch-internals/
How to code in Python 3 – https://www.digitalocean.com/community/books/digitalocean-ebook-how-to-code-in-python – to speed up when I need to do Python machine learning exercises.
Voice Training | Free your voice to empower your personality – https://www.udemy.com/course/voice-training-free-your-voice-to-empower-your-personality/ – because I would like to create some Youtube programming videos, and I want them to be great.

Machine Learning

CSC 321 Winter 2018 – Intro to Neural Networks and Machine Learning – https://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/ – so I can keep up when studying Neural Networks: Zero to Hero
- There is a git repo of answers here https://github.com/liuguanxiong/CSC321 by Guanxiong Liu
Neural Networks: Zero to Hero – https://github.com/karpathy/nn-zero-to-hero
Blog Post: LLM Engineering https://huyenchip.com/2023/04/11/llm-engineering.html
https://www.deeplearningbook.org/
https://towardsdatascience.com/perplexity-intuition-and-derivation-105dd481c8f3
https://towardsdatascience.com/the-intuition-behind-shannons-entropy-e74820fe9800
TBC Some course on calculus / linear algebra to brush up
TBC Some course on PyTorch

Computers in general

cpu.land – a small course about learning how Linux works in more detail

What is test mocking?

Download-a-cool-RSS-feed example program.

Testing the app, using mocks (and friends)

Testing with vanilla Go: Or, just use the interface

Buffer + Close() = What we want

Quick quiz

Use the stretchr/testify/mock library

Mockery

What are production tests?

How do production tests help?

Important design considerations for production tests

Keep your production tests basic

But… try go get some decent coverage

Production tests are not health checks, but may overlap with them

Be mindful of how your production tests affect observability

Fake data considerations

Three strikes before an alarm

Pros and cons of production tests

Production tests, vs. observability

Summary

1. Install the browser extension

2. Run a local LanguageTool server

3. Tell your extension to use the local server

4. Try it out

What is lob.py?

Using lob.py to train and run nanoGPT.

Setting up lob.py run parameters

Train and run using lob.py

Notes about how lob.py works

Let’s make some Shakespam

Turning up the temperature

Using tiktoken for better encoding of the text

Training with the Tiktoken encoding

I’d run it on my machine instead, but…

What is NanoGPT anyway?

Step 1: Learn how to run code on Modal

Step 2: Fork the NanoGPT repo, and check it works on local computer

Step 3: Upload the training data to modal

3.1 Authenticate with modal

3.2 Change the prepare.py to upload to Modal

3.3 Run the upload

Step 4: Adapt the training code to run on Modal

4.1 Make the training code into a Python package

4.2 Remove the configurator

4.3 Add a python script to run the code in Modal

4.4 Run the script

4.5. Revert to the proper sized hyper-parameters

Costs

Conclusion

Next

NEXT_PRIVATE_DEBUG_CACHE

Bring back the old File Explorer right-click menu

Add VSCode to the context menu

Install 7z

More to come…

General Learning Stuff

Machine Learning

Computers in general

Use the `stretchr/testify/mock` library

What is `lob.py`?

Using `lob.py` to train and run nanoGPT.

Setting up `lob.py` run parameters

Train and run using `lob.py`

Notes about how `lob.py` works