Skip to content
Pablo Rodriguez

Evaluation Lab

Optional Lab Core Skills

This lab provides hands-on practice with:

  • Splitting datasets into training, cross validation, and test sets
  • Evaluating regression and classification models
  • Improving models by adding polynomial features
  • Comparing different neural network architectures
  • Implementing systematic model selection
  • Loaded a dataset with 50 examples of input feature x and target y
  • Plotted the dataset to visualize the relationship between input and target
Three-Way Split
  • Split the data into:
  • Training set (60%): 30 examples
  • Cross validation set (20%): 10 examples
  • Test set (20%): 10 examples
Train-CV-Test Split
# Get 60% of the dataset as the training set
x_train, x_, y_train, y_ = train_test_split(x, y, test_size=0.40, random_state=1)
# Split the 40% subset into two: half for CV and half for test set
x_cv, x_test, y_cv, y_test = train_test_split(x_, y_, test_size=0.50, random_state=1)
  • Used StandardScaler to compute z-score of inputs: z = (x - μ)/σ
  • Crucial to use training set’s mean and standard deviation when scaling CV and test sets:
  • Fit and transform on training set: X_train_scaled = scaler.fit_transform(x_train)
  • Only transform on CV/test sets: X_cv_scaled = scaler.transform(x_cv)
  • Trained a linear regression model on the scaled training data
  • Calculated Mean Squared Error (MSE) for both training and CV sets:
  • J_train = (1/2m_train)∑(f(x_train) - y_train)²
  • J_cv = (1/2m_cv)∑(f(x_cv) - y_cv)²
Model Improvement
  • Created polynomial features up to degree 10 using PolynomialFeatures
  • For each polynomial degree:
  1. Added polynomial features
  2. Scaled features
  3. Trained linear regression model
  4. Computed training and CV MSEs
Polynomial Feature Testing
# Initialize lists to save errors, models, and transforms
train_mses = []
cv_mses = []
models = []
polys = []
scalers = []
# Loop over different polynomial degrees
for degree in range(1,11):
# Add polynomial features
poly = PolynomialFeatures(degree, include_bias=False)
X_train_mapped = poly.fit_transform(x_train)
# Scale features
scaler_poly = StandardScaler()
X_train_mapped_scaled = scaler_poly.fit_transform(X_train_mapped)
# Train model and compute errors
# ...
  • Results showed:
  • Linear model (degree=1) had high training and CV MSEs
  • Adding a quadratic term (degree=2) dramatically reduced both errors
  • Performance remained relatively stable through degree=5
  • Higher degrees (6-10) showed increasing CV error, indicating overfitting
  • Selected model with lowest CV MSE (degree = 5)
  • Computed test MSE to estimate generalization error
  • Final results demonstrated the model selection process:
  • Training MSE, CV MSE, and Test MSE all reported
Architecture Comparison
  • Used same dataset but applied neural network models
  • Tested multiple architectures:
  1. Model 1: Small network (1 hidden layer)
  2. Model 2: Medium network (2 hidden layers)
  3. Model 3: Larger network (3 hidden layers)
  • For each architecture:
  1. Trained the model on scaled training data
  2. Computed training and CV MSEs
  3. Selected model with lowest CV MSE
Binary Classification
  • Loaded a binary classification dataset with 200 examples
  • Each example had 2 input features and a target (0 or 1)
  • Split into training (60%), CV (20%), and test (20%) sets
  • Scaled features using the training set statistics
  • Measured performance using misclassification rate:
  • Fraction of examples where predicted class != actual class
  • Computed as: np.mean(predictions != ground_truth)
Classification Error Calculation
# After getting model outputs (probabilities)
yhat = tf.math.sigmoid(model.predict(x_scaled))
yhat = np.where(yhat >= threshold, 1, 0)
# Compute fraction of misclassified examples
error = np.mean(yhat != y_true)
  • Built same neural network architectures as for regression

  • Configured for classification:

  • Used linear activation in output layer

  • Applied binary crossentropy loss with from_logits=True

  • Used sigmoid function to convert outputs to probabilities

  • Applied threshold (0.5) to make binary predictions

  • Selected best model based on CV error

  • Reported final training, CV, and test classification errors

Lab Summary
  1. Three-way splitting is crucial for model selection and honest performance estimation
  2. Feature scaling should always use training set statistics
  3. Polynomial features can dramatically improve linear models for non-linear data
  4. Model selection should be based on cross-validation performance, not training performance
  5. Generalization error should be estimated using a separate test set that wasn’t used for model decisions

The systematic approach to model evaluation and selection demonstrated in this lab provides a solid foundation for developing models that generalize well to new data. By properly separating training, validation, and testing data, you can confidently select model architectures and report honest performance metrics.