--- title: "Practical Deep Learning, Lesson 4, Language Model Blog Post Imitator" createdAt: "2024-11-04T17:57:00.000Z" updatedAt: "2024-11-04T17:57:00.000Z" publishedAt: "2024-11-04T17:57:00.000Z" tags: ["language_models","course.fast.ai"] draft: false githubUrl: "https://github.com/danielcorin/fastbook_projects/tree/main/blog_post_imitator" --- In this notebook/post, we're going to be using the markdown content from my [blog](https://github.com/danielcorin/blog) to try a language model. From this, we'll attempt to prompt the model to generate a post for a topic I might write about. Let's import `fastai` and disable warnings since these pollute the notebook a lot when I'm trying to convert these notebooks into posts (I am writing this as a notebook and converting it to a markdown file with [this script](https://github.com/danielcorin/blog/blob/8181116943a7e4a8583edcf9d64c2b08b41cbf34/scripts/convert_notebook.py)). ```python from fastai.text.all import * from pathlib import Path ``` ```python import warnings warnings.filterwarnings('ignore') ``` ## Loading the data The written content in my blog is markdown (.md) files. You can see the raw contents of any of these posts by appending `/index.md` to the end of the URL on any post on this site. They look something like ```md --- --- ``` To get started, I modeled my approach after the one used in [chapter 10](https://github.com/fastai/fastbook/blob/master/10_nlp.ipynb), which looks something like ```python get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup']) dls_lm = DataBlock( blocks=TextBlock.from_folder(path, is_lm=True), get_items=get_imdb, splitter=RandomSplitter(0.1) ).dataloaders(path, path=path, bs=128, seq_len=80) ``` There were a few modifications I needed to make with the approach. For starters, we were loading `.md` files rather than text files, so initially I tried to do this with ```python path = Path("./data/content") files = get_files(path, extensions='.md', recurse=True) for f in files[:3]: print(f) ``` data/content/posts/2013/2013-07-05-qc.md data/content/posts/2024/models-writing-about-coding-with-models.md data/content/posts/2024/vlms-hallucinate.md However, these seemed to cause opaque and confusing issues with `DataBlock` or `DataLoaders` that manifested something like this ``` --------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[209], line 10 3 # First, create a tokenizer 4 text_processor = TextBlock.from_folder(path, is_lm=True) 6 dls_lm = DataBlock( 7 blocks=text_processor, 8 get_items=get, 9 splitter=RandomSplitter(0.1) ---> 10 ).dataloaders( 11 path, 12 path=path, 13 bs=128, 14 seq_len=80, 15 ) File ~/dev/lab/fastbook_projects/blog_post_generator/.venv/lib/python3.12/site-packages/fastai/data/block.py:157, in DataBlock.dataloaders(self, source, path, verbose, **kwargs) 151 def dataloaders(self, 152 source, # The data source 153 path:str='.', # Data source and default `Learner` path 154 verbose:bool=False, # Show verbose messages 155 **kwargs 156 ) -> DataLoaders: --> 157 dsets = self.datasets(source, verbose=verbose) 158 kwargs = {**self.dls_kwargs, **kwargs, 'verbose': verbose} ... 387 self.types.append(type(x)) --> 388 types = L(t if is_listy(t) else [t] for t in self.types).concat().unique() 389 self.pretty_types = '\n'.join([f' - {t}' for t in types]) TypeError: 'NoneType' object is not iterable ``` To workaround this challenge, I changed all the file extensions to `.txt`. This allowed the model to load and tokenize the dataset. Next, I had an issue with encoding ``` return f.read() ^^^^^^^^ File "", line 322, in decode UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte ``` I solved this in the `copy_and_rename_files` function with ```python src_file.read_text(encoding=encoding) ``` There is also a function called `clean_content` which I will address later. ````python import re def clean_content(text): # Handle code blocks with language specifiers text = re.sub(r'```(\w+)?\n(.*?)```', lambda m: f'{m.group(2)}', text, flags=re.DOTALL) # Replace single backticks text = re.sub(r'`([^`]+)`', r'\1', text) return text def copy_and_rename_files(): src_dir = Path("./data/content") dst_dir = Path("./data/cleaned_content") if not dst_dir.exists(): dst_dir.mkdir(parents=True) for src_file in src_dir.rglob("*.md"): try: content = None for encoding in ['utf-8', 'latin-1', 'cp1252']: try: content = src_file.read_text(encoding=encoding) if content.startswith('---'): # Remove markdown frontmatter parts = content.split('---', 2) if len(parts) >= 3: content = parts[2] break except UnicodeDecodeError: continue if content is None: print(f"Skipping {src_file}: Unable to decode with supported encodings") continue content = clean_content(content) rel_path = src_file.relative_to(src_dir) dst_file = dst_dir / rel_path.with_suffix('.txt') dst_file.parent.mkdir(parents=True, exist_ok=True) dst_file.write_text(content, encoding='utf-8') except Exception as e: print(f"Error processing {src_file}: {str(e)}") copy_and_rename_files() ```` ## Training the model With the needed adjustments to the dataset made, and the model-ready content now in the `data/cleaned_content` folder, we can load and tokenize that data with `fastai`. We validate that we can read the files paths with our code ```python path = Path("./data/cleaned_content") files = get_text_files(path) for f in files[:3]: print(f) ``` data/cleaned_content/posts/2013/2013-07-05-qc.txt data/cleaned_content/posts/2024/language-model-based-aggregators.txt data/cleaned_content/posts/2024/making-your-vision-real.txt Then we create a `DataBlock` and view the loaded, tokenized content ```python get = partial(get_text_files, folders=['posts', 'til', 'logs', 'projects']) text_processor = TextBlock.from_folder(path, is_lm=True) dls_lm = DataBlock( blocks=text_processor, get_items=get, splitter=RandomSplitter(0.1) ).dataloaders( path, path=path, bs=128, seq_len=512, ) ``` ```python dls_lm.show_batch(max_n=2) ```

	text	text_
0	xxbos xxmaj i 've been wanting to create a chat component for this site for a while , because i really do n't like quoting conversations and manually formatting them each time . \n xxmaj when using a model playground , usually there is a code snippet option that generates xxmaj python code you can copy out intro a script . \n xxmaj using that feature , i can now copy the message list and paste it as xxup json into a xxmaj hugo shortcode and get results like this : \n\n\n { { < chat model="gpt-4o - mini " > } } \n [ \n▁ { \n▁ " role " : " system " , \n▁ " content " : [ \n▁ { \n▁ " type " : " text " , \n▁ " text " : " you should respond with the understanding they are an experienced software	xxmaj i 've been wanting to create a chat component for this site for a while , because i really do n't like quoting conversations and manually formatting them each time . \n xxmaj when using a model playground , usually there is a code snippet option that generates xxmaj python code you can copy out intro a script . \n xxmaj using that feature , i can now copy the message list and paste it as xxup json into a xxmaj hugo shortcode and get results like this : \n\n\n { { < chat model="gpt-4o - mini " > } } \n [ \n▁ { \n▁ " role " : " system " , \n▁ " content " : [ \n▁ { \n▁ " type " : " text " , \n▁ " text " : " you should respond with the understanding they are an experienced software engineer
1	xxunk / aka \n [ sublime xxup cli ] : https : / / xxrep 3 w xxunk / docs / 2 / xxunk xxbos i built a [ site](https : / / xxunk / ) to host a language model generated kid ’s book i built using chatgpt and xxmaj midjourney . xxmaj the plot was sourced from my book xxunk to see how a language model would perform writing a xxunk with an unusual plot . \n xxmaj it prompted some interesting conversations about the role of xxunk and art in culture on the [ xxunk xxunk : / / news.ycombinator.com / xxunk ) . \n\n xxmaj tech : xxmaj react , xxmaj next.js , openai , xxmaj midjourney , xxmaj vercel \n\n [ source code](https : / / github.com / danielcorin / adventure - of - xxunk / ) xxbos i wrote a tiny site to use	/ aka \n [ sublime xxup cli ] : https : / / xxrep 3 w xxunk / docs / 2 / xxunk xxbos i built a [ site](https : / / xxunk / ) to host a language model generated kid ’s book i built using chatgpt and xxmaj midjourney . xxmaj the plot was sourced from my book xxunk to see how a language model would perform writing a xxunk with an unusual plot . \n xxmaj it prompted some interesting conversations about the role of xxunk and art in culture on the [ xxunk xxunk : / / news.ycombinator.com / xxunk ) . \n\n xxmaj tech : xxmaj react , xxmaj next.js , openai , xxmaj midjourney , xxmaj vercel \n\n [ source code](https : / / github.com / danielcorin / adventure - of - xxunk / ) xxbos i wrote a tiny site to use to

That looks good, so we create a learner then run the same approach as is done in Chapter 10, checkpointing as we go. We don't _have_ to do it this way -- we could just call `fit_one_cycle` the number of times we want -- but it was helpful for me to validate the process end-to-end once more. ```python learn = language_model_learner( dls_lm, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()] ).to_fp16() ``` ```python learn.fit_one_cycle(1, 2e-2) ```

epoch	train_loss	valid_loss	accuracy	perplexity	time
0	4.923746	4.835233	0.222222	125.867935	00:12

```python learn.save('1epoch') ``` Path('data/cleaned_content/models/1epoch.pth') ```python learn = learn.load('1epoch') ``` Now, we do the bulk of the fine-tuning. ```python learn.unfreeze() learn.fit_one_cycle(10, 2e-3) ```

epoch	train_loss	valid_loss	accuracy	perplexity	time
0	3.721324	4.359152	0.278559	78.190788	00:12
1	3.582399	3.972712	0.338368	53.128395	00:13
2	3.452389	3.608671	0.379557	36.916973	00:13
3	3.307291	3.372669	0.413889	29.156248	00:13
4	3.221604	3.291163	0.422917	26.874105	00:13
5	3.131132	3.213343	0.436892	24.862070	00:13
6	3.047300	3.147552	0.447179	23.278996	00:13
7	2.978492	3.120242	0.455946	22.651857	00:14
8	2.921093	3.109544	0.458030	22.410818	00:14
9	2.876113	3.107162	0.458464	22.357492	00:12

```python learn.save_encoder('finetuned') ``` ## Experimenting with the Result With the fine-tuned model, we can run inference! We define the beginning of the content we want the model to output in `TEXT`, then we call `learn.predict` for the number of tokens we want the model to output and set temperature to determine the randomness/creativity of the output. ```python TEXT = "I am interacting with a language model as a thought partner to" N_WORDS = 256 TEMP = 0.75 pred = learn.predict(TEXT, N_WORDS, temperature=TEMP) ``` ```python print(pred) ``` i am interacting with a language model as a thought partner to train . I think it should be more difficult to use Sonnet but it is often difficult to get to because i am a very familiar language model . In this way , i wanted to use the phrase Sonnet > rather than just see the phrase as a word . He 's also using the word " prompt " , which is a word word that is used in art , then as a title to describe a language model that extract structured data from an individual 's experience . I 've used this idea to describe how a language model has a structure with a single structure and i can try and solve this with the following following Sonnet code : i n't have such a thead for a language model . Being like this is an easy way to design an [ initial Sonnet ] . In another example , where you would have written a single word to explicitly describe a language model , i had to use the following word usage : < inline_code > < / INLINE_CODE > : The word i used to read the script in a language model and then describe the word as an JSON object . This working was quite different from my JSON code and Python CODE > the most recent model to use this PYTHON code The output is a little wild but it _kinda sorta_ makes sense and isn't too bad for a model I trained in a couple minutes on my MacBook. I tweaked several parts of the approach in effort to improve the model's output quality. Jumping back to the `clean_content` function, I found that removing the markdown frontmatter and replacing triple backticks with single tokens seemed to make the output make a bit more sense. When this fine-tuned model tries to generate code, it makes little sense and does strange things like emit tokens with triple words like `importimportimport`. I have a feeling this deficiency may be because the base model wasn't trained on much source code. So there we have it. A simple language model fine-tuned on my blog posts. This was a helpful experience for getting a feel for some feature engineering. If you liked this post, be sure to check out some of my other notebooks I've built while working through the FastAI Course linked below.