This is a notebook designed to introduce users to machine learning using PyTorch
and station data. The metloom
package developed by M3Works is needed to run this script.
This notebook is adapted from a SnowEx Hackweek tutorial developed by Ibrahim Alabi. The full tutorial may be found here: https://
What is Machine Learning?¶
Machine learning (ML) is a field of artificial intelligence (AI) that focuses on developing algorithms or computer models using data. The goal is to use these “trained” compuer models to make decisions. Unlike traditional programming, where we write explicit rules for every situation, ML models learn patterns from data to perform tasks.
Data Download and Cleaning¶
To begin with this tutorial, we will download SNOTEL data using the metloom
package. Users that don’t have it installed can run the cell below.
!pip install -q metloom torch torchvision torchaudio
For a more detailed explanation on metloom
, check out the tutorial on SNOTEL data access.
In this notebook, we will grab the following variables: SWE (SWE
), average temperature (TEMPAVG
), snow depth (SNOWDEPTH
), and precipitation (PRECIPITATION
). These variables will be obtained from the SNOTEL station at Green Lake, WA.
from datetime import datetime
from metloom.pointdata import SnotelPointData
# Define variables of interest
ALLOWED_VARIABLES = [
SnotelPointData.ALLOWED_VARIABLES.SWE,
SnotelPointData.ALLOWED_VARIABLES.TEMPAVG,
SnotelPointData.ALLOWED_VARIABLES.SNOWDEPTH,
SnotelPointData.ALLOWED_VARIABLES.PRECIPITATION,
]
# Define SNOTEL station
snotel_point = SnotelPointData(station_id="502:WA:SNTL", name="Green Lake")
# Get daily SNOTEL data from Green Lake, WA
data = snotel_point.get_daily_data(
start_date=datetime(*(2010, 1, 1)),
end_date=datetime(*(2023, 1, 1)),
variables=ALLOWED_VARIABLES
)
data.head()
Let’s take a look at the data that we just collected.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Reset the index of the DataFrame for plotting purposes
for_plotting=data.reset_index()
# Define the units for each variable
units={
"SWE": "in",
"SNOWDEPTH": "in",
"AVG AIR TEMP": "degF",
"PRECIPITATION": "in"
}
# List the variables for plotting
variables_to_plot = [
"SWE", "SNOWDEPTH", "AVG AIR TEMP", "PRECIPITATION"
]
# Make the plot
plt.figure(figsize=(12, 8))
for variable in variables_to_plot:
plt.subplot(2, 2, variables_to_plot.index(variable) + 1)
plt.plot(for_plotting["datetime"], for_plotting[variable], label=variable)
plt.ylabel(f"{variable} ({units[variable]})", fontsize=14)
plt.xlabel("Date", fontsize=14)
plt.tight_layout()
plt.show()
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[3], line 3
1 import numpy as np
2 import pandas as pd
----> 3 import matplotlib.pyplot as plt
5 # Reset the index of the DataFrame for plotting purposes
6 for_plotting=data.reset_index()
ModuleNotFoundError: No module named 'matplotlib'
We will now process this data so it’s easier to work with. First, we convert from imperial to metric units for easier interpretation. We then set the dates as the index, so that we can derive weekly rolling averages of precipitation and tempoerature.
The dataframe is then cleaned so that only snow depth, SWE, weekly temperature, and weekly precipitation are left. The dataframe is then filtered for any NaN values, and for any zero/unrealistic snow depth and SWE values. Finally, we derive snow density from the SWE and depth data.
clean_df=(
for_plotting
.assign(
swe=lambda x: x.SWE.map(lambda y: y*2.54 if y is not None else None),
snowdepth=lambda x: x.SNOWDEPTH.map(lambda y: y*2.54 if y is not None else None),
precipitation=lambda x: x.PRECIPITATION.map(lambda y: y*2.54 if y is not None else None),
tempavg=lambda x: x['AVG AIR TEMP'].map(lambda y: (y-32)*5/9 if y is not None else None)
)
.set_index('datetime')
.assign(
precip_7_days_avg=lambda x: x.precipitation.shift().rolling(window="7D", min_periods=7).mean(),
tempavg_7_days_avg=lambda x: x.tempavg.shift().rolling(window="7D", min_periods=7).mean(),
)
.filter(["datetime", "swe", "snowdepth", "tempavg_7_days_avg", "precip_7_days_avg"])
.dropna()
.query(
"snowdepth != 0 and swe != 0 and "
"snowdepth > 5 and swe > 3"
)
.assign(snowdensity=lambda x: x.swe / x.snowdepth)
)
clean_df
Building and Training a Neural Network¶
Now that we have SNOTEL data configured in a desirable format, we can build a simple neural network using PyTorch.
# Basic analysis libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# PyTorch libraries
import torch
import torch.nn as nn
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
# Find the best available computing resource
available_device = (
"cuda"
if torch.cuda.is_available()
else "mps"
if torch.backends.mps.is_available()
else "cpu"
)
print(f"Available device: {available_device}")
The above cell identifies the most appropriate computing resource, based on what is available. If NVIDIA GPU or Apple’s Metal Performance Shaders are available, then they are prioritized. Otherwise, the code defaults to CPU. The former options offer faster GPU support, though the CPU option is universally available.
Data Splitting¶
With our current SNOTEL data, we are going to split it into training, validation, and testing sets. A typical split is 70% training data, 15% validation data, and 15% testing data.
# Non-snow density data
features = clean_df.drop('snowdensity', axis=1).values
# Snow density data only
targets = clean_df['snowdensity'].values
# Split the data into training and temporary sets (70% training, 30% temporary)
features_train,features_temp,targets_train,targets_temp = train_test_split(features, targets,
test_size=0.3, random_state=0)
# Further split the temp set into validation and test sets (15% each)
features_val,features_test,targets_val,targets_test = train_test_split(features_temp, targets_temp,
test_size=0.5, random_state=0)
Here is a breakdown of the above cell:
- We identify our “features”, or datasets that will be used to predict snow density. These include SWE, snow depth, temperature, and precipitation.
- We identify our “target”, which is snow density in this example.
- The data is split into model training data (
features_train
,targets_train
) and validation/testing data (features_temp
,targets_temp
). - The temporary data sets are further split in half to validation data (
features_val
,targets_val
) and test data (features_test
,targets_test
).
In both splitting operations, we also set random_state=0
. This means that the same training/validation/testing split occurs every time the code is run, to ensure reproducibility.
Data Scaling¶
Now that we’ve split the data, we can apply scaling. The scaler should be fit on the training data and then used to transform the training, validation, and test sets.
# Generate scaler
scaler = StandardScaler()
# Fit scaler to training data
scaler.fit(features_train)
# Transform the training, validation, and test sets
features_train = scaler.transform(features_train)
features_val = scaler.transform(features_val)
features_test = scaler.transform(features_test)
Create Dataset Classes¶
Next, we define custom Dataset
classes for each of the three sets: training, validation, and testing.
# Create class for available data
class SNOTELDataset(Dataset):
def __init__(self, data, targets):
self.data = torch.tensor(data, dtype=torch.float32)
self.targets = torch.tensor(targets, dtype=torch.float32).view(-1, 1)
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
sample = self.data[idx]
target = self.targets[idx]
return sample, target
# Create instances of the custom datasets for training, validation, and testing sets
train_dataset = SNOTELDataset(data=features_train, targets=targets_train)
val_dataset = SNOTELDataset(data=features_val, targets=targets_val)
test_dataset = SNOTELDataset(data=features_test, targets=targets_test)
Now, we use DataLoader to manage our data in mini-batches during training, validation, and testing.
# Create DataLoaders for training, validation, and testing sets
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)
Define the neural network¶
To set up our model, we begin by generating a simple feedforward neural network using torch.nn.Module
.
# Define new class for neural network
class SNOTELNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(SNOTELNN, self).__init__() # super class to inherit from nn.Module
# Define the layers
self.fc1 = nn.Linear(input_size, hidden_size) # Fully connected layer 1
self.relu = nn.ReLU() # ReLU activation function
self.fc2 = nn.Linear(hidden_size, output_size) # Fully connected layer 2
def forward(self, x): # x is the batch of input
# Define the forward pass
out = self.fc1(x) # Pass input through first layer
out = self.relu(out) # Apply ReLU activation
out = self.fc2(out) # Pass through second layer to get output
return out
# Instantiate the model and move it to the device (GPU or CPU)
model = SNOTELNN(input_size=features_train.shape[1], hidden_size=128, output_size=1).to(available_device)
The forward
method defines how the input data flows through the network layers. It specifies the sequence of operations that the data undergoes as it moves from the input layer to the output layer. This method is automatically called when you pass data through the model (e.g., outputs = model(inputs)
).
For any ML application, we need to define a loss function and an optimizer. Here, we’ll use Mean Squared Error Loss since we’re dealing with a regression problem. We’ll use the Adam optimizer, which is a good default choice due to its adaptive learning rates.
# Loss function
criterion = nn.MSELoss()
# Optimizer function
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
Training the Neural Network¶
We now write the training loop, which includes zeroing the gradients, performing the forward pass, computing the loss, backpropagating, and updating the model parameters. We will also validate the model on the validation set after each epoch.
num_epochs = 5
# Lists to store the training and validation losses
train_losses = []
val_losses = []
for epoch in range(num_epochs):
# Training phase
model.train()
train_loss = 0.0 # Initialize cumulative training loss
for inputs, labels in train_loader:
# Move data to the appropriate device
inputs, labels = inputs.to(available_device), labels.to(available_device)
# Zero the gradients from the previous iteration
optimizer.zero_grad()
# Perform forward pass
outputs = model(inputs)
# Compute the loss
loss = criterion(outputs, labels)
# Perform backward pass (compute gradients)
loss.backward()
# Update the model parameters
optimizer.step()
# Accumulate training loss
train_loss += loss.item()
# Average training loss
train_loss /= len(train_loader)
train_losses.append(train_loss) # Store the training loss for this epoch
# Validation phase
model.eval() # Set model to evaluation mode
val_loss = 0.0
with torch.no_grad():
for inputs, labels in val_loader:
inputs, labels = inputs.to(available_device), labels.to(available_device) # Move to device
outputs = model(inputs)
loss = criterion(outputs, labels)
val_loss += loss.item()
# Average validation loss
val_loss /= len(val_loader)
val_losses.append(val_loss) # Store the validation loss for this epoch
print(f'Epoch [{epoch+1}/{num_epochs}], Training Loss: {train_loss:.4f}, Validation Loss: {val_loss:.4f}')
Let’s check out the loss from both the training data and the validation data.
# Plotting the training and validation losses
plt.figure(figsize=(10, 5))
plt.plot(range(1, num_epochs + 1), train_losses, label='Training Loss')
plt.plot(range(1, num_epochs + 1), val_losses, label='Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training and Validation Loss Over Epochs')
plt.legend()
plt.show()
Testing the Model¶
We’ve done a lot of work to prepare the machine learning model for our applications. Now, it’s finally time to test it against our observations.
# Evaluate the model on the test set and collect predictions
model.eval() # Set the model to evaluation mode
test_loss = 0.0 # Initialize cumulative test loss
all_preds = []
all_labels = []
with torch.no_grad(): # Disable gradient computation for inference
for inputs, labels in test_loader:
# Move data to the appropriate device
inputs, labels = inputs.to(available_device), labels.to(available_device)
# Perform forward pass
outputs = model(inputs)
# Compute the loss
loss = criterion(outputs, labels)
# Accumulate test loss
test_loss += loss.item()
# Store the predictions and the corresponding labels
all_preds.extend(outputs.cpu().numpy())
all_labels.extend(labels.cpu().numpy())
# Calculate the average test loss
test_loss /= len(test_loader)
print(f'Test Loss: {test_loss:.4f}')
# Convert lists to numpy arrays for plotting
all_preds = np.array(all_preds)
all_labels = np.array(all_labels)
# Plot observed vs predicted
plt.figure(figsize=(8, 8))
plt.scatter(all_labels, all_preds, alpha=0.7)
plt.xlabel('Observed (Actual) Values')
plt.ylabel('Predicted Values')
plt.title('Observed vs. Predicted Values')
plt.grid(True)
plt.show()
Save the Model¶
We now have some pretty good results for predicting snow density. Now it is essential to save our trained model, as doing so allows you to reuse the model for predictions, further training, or sharing with others without having to retrain it from scratch. In PyTorch, saving and loading models is straightforward and can be done using the torch.save
and torch.load
functions.
# Save the model's state dictionary
torch.save(model.state_dict(), 'snotel_nn_model.pth')
# Initialize the model architecture
model = SNOTELNN(input_size=features_train.shape[1], hidden_size=128, output_size=1)
# Load the model's state dictionary
model.load_state_dict(torch.load('snotel_nn_model.pth', weights_only=True))
# Set the model to evaluation mode before inference
model.eval()
Further Information¶
Hyperparameter Tuning¶
Hyperparameter tuning is a critical step in building machine learning models. Unlike model parameters (like weights and biases), which are learned from the data during training, hyperparameters are the settings you choose before the training process begins. These include:
- Learning Rate: Controls how much to adjust the model’s weights with respect to the loss gradient.
- Batch Size: Determines the number of training examples utilized in one iteration.
- Number of Hidden Layers and Neurons: Specifies the architecture of the neural network.
- Optimizer: The algorithm used to update model weights based on the computed gradients (e.g., Adam, SGD).
Tuning these hyperparameters can significantly affect the performance of your model. However, finding the optimal set of hyperparameters can be a challenging and time-consuming process, often requiring experimentation.
Manual vs. Automated Tuning¶
Manual Tuning: Involves adjusting hyperparameters based on intuition, experience, or trial and error. While straightforward, this approach can be inefficient and might not always yield the best results.
Automated Tuning: Tools like Optuna can help automate the search for the best hyperparameters. These tools explore the hyperparameter space more systematically and can save a lot of time compared to manual tuning.