Selecting
By Angela C
October 6, 2021
Reading time: 4 minutes.
Select and train a model
After framing the problem, getting and exploring the data, sampling a training set and a test set and written transformation pipelines to clean up and prepare the data for machine learning algorithms automatically, the next step is to select and train a machine learning model. Because of all the previous steps taken above, this will be relatively easy.
Train a linear regression model
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
LinearRegression()
Try it on a few instances from the training set:
Trying the full preprocessing pipeline on some training instances
# try the full preprocessing pipeline on a few training instances
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
print("Predictions:", lin_reg.predict(some_data_prepared))
Predictions: [210816. 317904. 211040. 59112. 189832.]
Compare against the actual values:
print("Labels:", list(some_labels))
Labels: [286600.0, 340600.0, 196900.0, 46300.0, 254500.0]
some_data_prepared
array([[-1.15604281, 0.77194962, 0.74333089, -0.49323393, -0.44543821,
-0.63621141, -0.42069842, -0.61493744, -0.31205452, -0.08649871,
0.15531753, 1. , 0. , 0. , 0. ,
0. ],
[-1.17602483, 0.6596948 , -1.1653172 , -0.90896655, -1.0369278 ,
-0.99833135, -1.02222705, 1.33645936, 0.21768338, -0.03353391,
-0.83628902, 1. , 0. , 0. , 0. ,
0. ],
[ 1.18684903, -1.34218285, 0.18664186, -0.31365989, -0.15334458,
-0.43363936, -0.0933178 , -0.5320456 , -0.46531516, -0.09240499,
0.4222004 , 0. , 0. , 0. , 0. ,
1. ],
[-0.01706767, 0.31357576, -0.29052016, -0.36276217, -0.39675594,
0.03604096, -0.38343559, -1.04556555, -0.07966124, 0.08973561,
-0.19645314, 0. , 1. , 0. , 0. ,
0. ],
[ 0.49247384, -0.65929936, -0.92673619, 1.85619316, 2.41221109,
2.72415407, 2.57097492, -0.44143679, -0.35783383, -0.00419445,
0.2699277 , 1. , 0. , 0. , 0. ,
0. ]])
Measure the regression models RMSE
- See sklearn-metrics, regression metrics
- Scikit-learn
mean_squared_error
function computes mean square error, a risk metric corresponding to the expected value of the squared (quadratic) error or loss. - measure the RMSE on the whole training set
- can set
squared=False
on newer versions to avoid having the get the square root.
Measure the regression model’s RMSE on the whole training set:
from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse
68628.32454669532
You can get the RMSE directly by calling the mean_squared_error()
function with squared=False
.
# set Squared=False to avoid having to get square root
lin_mse = mean_squared_error(housing_labels, housing_predictions, squared=False)
lin_rmse
68628.32454669532
A prediction error of over $68 k when the median housing values range from 120 k to 265k. This is an example of the model underfitting the training data. It means the model is now powerful enough or the features don’t provide enough to make good predictions. To fix underfitting select a more powerful model or feed the algorithm with better features. If the model is regularised you can reduce the constraints on the model.
Mean absolute error
The
median_absolute_error
is particularly interesting because it is robust to outliers. The loss is calculated by taking the median of all absolute differences between the target and the prediction.
The median_absolute_error is robust to outliers. The loss is calculated by taking the median of all absolute differences between the target and the prediction.
from sklearn.metrics import mean_absolute_error
lin_mae = mean_absolute_error(housing_labels, housing_predictions)
lin_mae
49444.22728924418
A Decision Tree Regressor
A DecisionTreeRegressor is a powerful model that can find complex non-linear relationships in the data.
Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation.
Decision trees can also be applied to regression problems, using the DecisionTreeRegressor
class.
- import from sklearn.tree
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_prepared, housing_labels)
DecisionTreeRegressor(random_state=42)
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse
0.0
This zero error above does not mean that the model is absolutely perfect. Instead it implies that the model has badly overfit the data.
Cross Validation
Note you should not yet touch the test set until you are ready to launch a model you are confident in. Therefore you need to do model validation on part of the training set.
This reduces the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets. A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV.
-
Could use the
train_test_split
function to split the training set into a smaller training and validation set, then train the model on the smaller training set and evaluate the models on the validation set. -
Alternatively use K-fold cross validation to split the training set into
k
number of folds, then trains and evaluates the modelk
times, picking a different fold for evaluation every time and training on the other k-1 folds.
The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop.
Note that Scikit-learns cross-validation feature expects a utility function rather than a cost function so the scoring function is the opposite of the MSE.
(With a cost function , lower is better, with a utility function greate is better). This is why the code below computes -scores
before calculating the square root.