---
date: "2024-10-29T20:08:12.000Z"
title: "2024-10-29"
draft: false
---

The following code allowed me to successfully download the IMDB dataset with fastai to a Modal volume:

```python
import os

os.environ["FASTAI_HOME"] = "/data/fastai"

from fastai.text.all import *

app = modal.App("imdb-dataset-train")
vol = modal.Volume.from_name("modal-llm-data", create_if_missing=True)


@app.function(
    gpu="any",
    image=modal.Image.debian_slim().pip_install("fastai"),
    volumes={"/data": vol},
)
def download():
    path = untar_data(URLs.IMDB)
    print(f"Data downloaded to: {path}")
    return path

```

run with

```sh
modal run train.py::download
```

Next, I tried to run one epoch of training of the language model

```python
@app.function(
    gpu="h100",
    image=modal.Image.debian_slim().pip_install("fastai"),
    volumes={"/data": vol},
    timeout=20 * 60,
)
def train():
    path = untar_data(URLs.IMDB)
    print(f"Training with data from: {path}")
    get_imdb = partial(get_text_files, folders=["train", "test", "unsup"])

    dls_lm = DataBlock(
        blocks=TextBlock.from_folder(path, is_lm=True),
        get_items=get_imdb,
        splitter=RandomSplitter(0.1),
    ).dataloaders(path, path=path, bs=128, seq_len=80)

    print("Sample from datablock:")
    print(dls_lm.show_batch(max_n=2))

    learn = language_model_learner(
        dls_lm, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()]
    ).to_fp16()

    learn.fit_one_cycle(1, 2e-2)
    learn.save("1epoch")
```

I was waiting around for a long time, having never seen "Sample from datablock:" print.
Looking into the volume with the Modal UI, I noticed the `/fastai/data/imdb_tok/unsup` folder has been modified recently.
It seemed like the tokenization of the dataset was talking a long time.
I was able to do this tokenization quite fast locally, so I am going to chalk this up to the Modal volume not being as performant as a local file system.
While I'm not 100%, I think the need to train with so many little files may undermine my ability to train this model on Modal.