--- title: "Practical Deep Learning, Lesson 3, Stochastic Gradient Descent on the Titanic Dataset" createdAt: "2024-10-18T00:00:00.000Z" updatedAt: "2024-10-18T00:00:00.000Z" publishedAt: "2024-10-18T00:00:00.000Z" tags: ["course.fast.ai"] draft: false githubUrl: "https://github.com/danielcorin/fastbook_projects/tree/main/sgd_titanic" --- In this notebook, we train two similar neural nets on the classic [Titanic dataset](https://www.kaggle.com/competitions/titanic/data) using techniques from `fastbook` [chapter 1](https://github.com/fastai/fastbook/blob/master/01_intro.ipynb) and [chapter 4](https://github.com/fastai/fastbook/blob/master/04_mnist_basics.ipynb). The first, we train using mostly PyTorch APIs. The second, with FastAI APIs. There are a few cells that output warnings. I kept those because I wanted to preserve print outs of the models' accuracy. The Titanic data set can be downloaded from the link above or with: ```python !kaggle competitions download -c titanic ``` To start, we install and import the dependencies we'll need: ```python %pip install torch pandas scikit-learn fastai ``` ```python import pandas as pd import torch import torch.nn as nn import torch.optim as optim from fastai.tabular.all import * from sklearn.preprocessing import StandardScaler ``` Next, we import the training data ```python df = pd.read_csv('titanic/train.csv') features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare'] X = df[features].copy() y = df['Survived'].copy() X.head(5) ```

	Pclass	Sex	Age	SibSp	Fare
0	3	male	22.0	1	7.2500
1	1	female	38.0	1	71.2833
2	3	female	26.0	0	7.9250
3	1	female	35.0	1	53.1000
4	3	male	35.0	0	8.0500

Now, we define two functions to normalize and fill in holes in the data so we can train on it. ```python def process_training_data(X): X['Sex'] = X['Sex'].map({'male': 0, 'female': 1}) X['Age'] = X['Age'].fillna(X['Age'].median()) X['Fare'] = X['Fare'].fillna(X['Fare'].median()) return X def process_test_data(X): X['Sex'] = X['Sex'].map({'male': 0, 'female': 1}) return X ``` ```python X = process_training_data(X) X.head(5) ```

	Pclass	Sex	Age	SibSp	Fare
0	3	0	22.0	1	7.2500
1	1	1	38.0	1	71.2833
2	3	1	26.0	0	7.9250
3	1	1	35.0	1	53.1000
4	3	0	35.0	0	8.0500

We need to scale the numeric values to be between 0 and 1, otherwise we'll get ``` RuntimeError: all elements of input should be between 0 and 1 ``` We'll do this with `StandardScaler` for the both the training and test data, per Sonnet's recommendation. `StandardScaler` doesn't actually constrain the data between 0 and 1 but it seems to get the job done for the needs of the model architecture I selected. ```python scaler = StandardScaler() X_scaled = scaler.fit_transform(X) test_df = pd.read_csv('titanic/test.csv') X_test = test_df[features].copy() X_test = process_test_data(X_test) X_test_scaled = scaler.transform(X_test) y_test_df = pd.read_csv('titanic/gender_submission.csv') y_test = y_test_df['Survived'] ``` Turn these `numpy` arrays into PyTorch tensors and define the model architecture. ```python X_train_tensor = torch.FloatTensor(X_scaled) y_train_tensor = torch.FloatTensor(y.values) X_test_tensor = torch.FloatTensor(X_test_scaled) y_test_tensor = torch.FloatTensor(y_test.values) model = nn.Sequential( nn.Linear(6, 8), nn.ReLU(), nn.Linear(8, 1), nn.Sigmoid() ) ``` Also, define a loss function and an optimizer: ```python criterion = nn.BCELoss() optimizer = optim.SGD(model.parameters(), lr=0.01) ``` Finally, we can train the model. Sonnet wrote this code. ```python num_epochs = 1000 batch_size = 64 for epoch in range(num_epochs): for i in range(0, len(X_train_tensor), batch_size): batch_X = X_train_tensor[i:i+batch_size] batch_y = y_train_tensor[i:i+batch_size] outputs = model(batch_X) loss = criterion(outputs, batch_y.unsqueeze(1)) optimizer.zero_grad() loss.backward() optimizer.step() if (epoch + 1) % 100 == 0: print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}') ``` Epoch [100/1000], Loss: 0.3562 Epoch [200/1000], Loss: 0.3216 Epoch [300/1000], Loss: 0.3113 Epoch [400/1000], Loss: 0.3065 Epoch [500/1000], Loss: 0.3038 Epoch [600/1000], Loss: 0.3024 Epoch [700/1000], Loss: 0.2996 Epoch [800/1000], Loss: 0.2975 Epoch [900/1000], Loss: 0.2955 Epoch [1000/1000], Loss: 0.2937 With the model trained, we can run inference on the test set and compare the results to the "Survived" column in the test set from `gender_submission.csv`. ```python model.eval() with torch.no_grad(): y_pred = model(X_test_tensor) y_pred_class = (y_pred > 0.5).float() correct_predictions = (y_pred_class == y_test_tensor.unsqueeze(1)).sum().item() total_predictions = len(y_test_tensor) acc = correct_predictions / total_predictions print(f"Correct predictions: {correct_predictions} out of {total_predictions}") print(f"Accuracy: {acc:.2%}") ``` Correct predictions: 368 out of 418 Accuracy: 88.04% Now, let's build what I think is a similar model with `fastai` primitives. Load the data again to avoid any unintentional contamination. ```python train_df = pd.read_csv('titanic/train.csv') test_df = pd.read_csv('titanic/test.csv') ``` The `TabularDataLoaders` from `fastai` needs the following configuration to create `DataLoaders`. - `cat_names`: the names of the categorical variables - `cont_names`: the names of the continuous variables - `y_names`: the names of the dependent variables ```python cat_names = ['Pclass', 'Sex'] cont_names = ['Age', 'SibSp', 'Parch', 'Fare'] dep_var = 'Survived' ``` Following a pattern similar to the one used in [chapter 1](https://github.com/fastai/fastbook/blob/master/01_intro.ipynb), we train the model: ```python procs = [Categorify, FillMissing, Normalize] dls = TabularDataLoaders.from_df( train_df, path='.', procs=procs, cat_names=cat_names, cont_names=cont_names, y_names=dep_var, valid_pct=0.2, seed=42, bs=64, ) learn = tabular_learner(dls, metrics=accuracy) learn.fit_one_cycle(5, 1e-2) ``` /Users/danielcorin/dev/lab/fastbook_projects/sgd_titanic/.venv/lib/python3.12/site-packages/fastai/tabular/core.py:314: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. to[n].fillna(self.na_dict[n], inplace=True)

epoch	train_loss	valid_loss	accuracy	time
0	0.486258	0.233690	0.662921	00:02
1	0.378460	0.192642	0.662921	00:00
2	0.294309	0.132269	0.662921	00:00
3	0.248516	0.140377	0.662921	00:00
4	0.220335	0.132353	0.662921	00:00

For some reason, `learn.dls.test_dl` does not apply `FillMissing`, for the 'Fare` column of the test data, so we do that manually here. ```python test_df['Fare'] = test_df['Fare'].fillna(test_df['Fare'].median()) ``` We run the test set through the model, then compare the results to the ground truth labels and calculate the model accuracy. ```python test_dl = learn.dls.test_dl(test_df) preds, _ = learn.get_preds(dl=test_dl) binary_preds = (preds > 0.5).float() y_test = pd.read_csv('titanic/gender_submission.csv') correct_predictions = (binary_preds.numpy().flatten() == y_test['Survived']).sum() total_predictions = len(y_test) acc = correct_predictions / total_predictions print(f"Correct predictions: {correct_predictions} out of {total_predictions}") print(f"Accuracy: {acc:.2%}") ``` /Users/danielcorin/dev/lab/fastbook_projects/sgd_titanic/.venv/lib/python3.12/site-packages/fastai/tabular/core.py:314: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. to[n].fillna(self.na_dict[n], inplace=True) Correct predictions: 377 out of 418 Accuracy: 90.19% The accuracies of the two models are about the same! For a first pass at training neural networks (with plenty of help from Sonnet), I think this went pretty well. If you know things about deep learning, let me know if I made any major mistakes. It's a bit tough to know if you're doing things correctly in isolation. I suppose that's why Kaggle competitions can be useful for learning.