Evaluation Lab

Model Evaluation and Selection Lab

Optional Lab

Lab Overview

Core Skills

This lab provides hands-on practice with:

Splitting datasets into training, cross validation, and test sets
Evaluating regression and classification models
Improving models by adding polynomial features
Comparing different neural network architectures
Implementing systematic model selection

Part 1: Regression Models

Dataset Setup and Visualization

Loaded a dataset with 50 examples of input feature x and target y
Plotted the dataset to visualize the relationship between input and target

Dataset Splitting

Three-Way Split

Split the data into:
Training set (60%): 30 examples
Cross validation set (20%): 10 examples
Test set (20%): 10 examples

# Get 60% of the dataset as the training set
x_train, x_, y_train, y_ = train_test_split(x, y, test_size=0.40, random_state=1)

# Split the 40% subset into two: half for CV and half for test set
x_cv, x_test, y_cv, y_test = train_test_split(x_, y_, test_size=0.50, random_state=1)

Feature Scaling

Used StandardScaler to compute z-score of inputs: z = (x - μ)/σ
Crucial to use training set’s mean and standard deviation when scaling CV and test sets:
Fit and transform on training set: X_train_scaled = scaler.fit_transform(x_train)
Only transform on CV/test sets: X_cv_scaled = scaler.transform(x_cv)

Linear Model Evaluation

Trained a linear regression model on the scaled training data
Calculated Mean Squared Error (MSE) for both training and CV sets:
J_train = (1/2m_train)∑(f(x_train) - y_train)²
J_cv = (1/2m_cv)∑(f(x_cv) - y_cv)²

Adding Polynomial Features

Model Improvement

Created polynomial features up to degree 10 using PolynomialFeatures
For each polynomial degree:

Added polynomial features
Scaled features
Trained linear regression model
Computed training and CV MSEs

# Initialize lists to save errors, models, and transforms
train_mses = []
cv_mses = []
models = []
polys = []
scalers = []

# Loop over different polynomial degrees
for degree in range(1,11):
 # Add polynomial features
 poly = PolynomialFeatures(degree, include_bias=False)
 X_train_mapped = poly.fit_transform(x_train)

 # Scale features
 scaler_poly = StandardScaler()
 X_train_mapped_scaled = scaler_poly.fit_transform(X_train_mapped)

 # Train model and compute errors
 # ...

Results showed:
Linear model (degree=1) had high training and CV MSEs
Adding a quadratic term (degree=2) dramatically reduced both errors
Performance remained relatively stable through degree=5
Higher degrees (6-10) showed increasing CV error, indicating overfitting

Model Selection

Selected model with lowest CV MSE (degree = 5)
Computed test MSE to estimate generalization error
Final results demonstrated the model selection process:
Training MSE, CV MSE, and Test MSE all reported

Part 2: Neural Network Regression Models

Architecture Comparison

Used same dataset but applied neural network models
Tested multiple architectures:

Model 1: Small network (1 hidden layer)
Model 2: Medium network (2 hidden layers)
Model 3: Larger network (3 hidden layers)

For each architecture:

Trained the model on scaled training data
Computed training and CV MSEs
Selected model with lowest CV MSE

Part 3: Classification Tasks

Binary Classification

Dataset and Preparation

Loaded a binary classification dataset with 200 examples
Each example had 2 input features and a target (0 or 1)
Split into training (60%), CV (20%), and test (20%) sets
Scaled features using the training set statistics

Classification Error Metrics

Measured performance using misclassification rate:
Fraction of examples where predicted class != actual class
Computed as: np.mean(predictions != ground_truth)

# After getting model outputs (probabilities)
yhat = tf.math.sigmoid(model.predict(x_scaled))
yhat = np.where(yhat >= threshold, 1, 0)

# Compute fraction of misclassified examples
error = np.mean(yhat != y_true)

Key Takeaways

Lab Summary

Three-way splitting is crucial for model selection and honest performance estimation
Feature scaling should always use training set statistics
Polynomial features can dramatically improve linear models for non-linear data
Model selection should be based on cross-validation performance, not training performance
Generalization error should be estimated using a separate test set that wasn’t used for model decisions

The systematic approach to model evaluation and selection demonstrated in this lab provides a solid foundation for developing models that generalize well to new data. By properly separating training, validation, and testing data, you can confidently select model architectures and report honest performance metrics.