Machine learning data preprocessing feature engineering

12 min readMar 16, 2020

George Jen, Jen Tek LLC

Data preprocessing is a big part of the data science and data engineering, it is arguably a most time-consuming section of the machine learning project. Feature engineering is a portion of data preprocessing, part of the feature engineering work is to understand available feature variables or columns in business meaning and to select, transform and optimize right feature variables or columns to participate training of machine learning models. Dimensionality reduction, meaning reducing those feature variables or columns that are not helpful in the dataset is part of that process. The goal of feature engineering is about prepare the right feature variables or columns for the machine learning.

Quality of data preprocessing determines the success or failure of the machine learning project.

The use case that is going to be demonstrated here is an example of dimensionality reduction by using a dataset downloaded from Kaggle active competition, the housing dataset.

https://www.kaggle.com/c/house-prices-advanced-regression-techniques

Future housing price to be predicted based past sold price is continuous, it is prediction by regression.

Please do not confuse it to the well-known Boston Housing dataset.

This dataset contains 80 feature columns (or variables) and 1 target label, SalePrice, the price the house was sold. This dataset contains both contiguous numeric data columns and categorical character string columns. It is a good dataset for practicing data preprocessing by using commonly acceptable mathematical/statistical methodology.

Let’s carry on.

Numeric feature columns that are continuous:

LotFrontage: Linear feet of street connected to property (number)

LotArea: Lot size in square feet (number)

YearBuilt: Original construction date (number)

YearRemodAdd: Remodel date (number)

BsmtFinSF1: Type 1 finished square feet (number)

BsmtFinSF2: Type 2 finished square feet (number)

BsmtUnfSF: Unfinished square feet of basement area (number)

TotalBsmtSF: Total square feet of basement area (number)

BsmtFinSF2: Type 2 finished square feet (number)

BsmtUnfSF: Unfinished square feet of basement area (number)

TotalBsmtSF: Total square feet of basement area (number)

BsmtFullBath: Basement full bathrooms (number)

BsmtHalfBath: Basement half bathrooms (number)

FullBath: Full bathrooms above grade (number)

HalfBath: Half baths above grade (number)

Bedroom: Number of bedrooms above basement level (number)

Kitchen: Number of kitchens (number)

WoodDeckSF: Wood deck area in square feet (number)

OpenPorchSF: Open porch area in square feet (number)

EnclosedPorch: Enclosed porch area in square feet (number)

3SsnPorch: Three season porch area in square feet (number)

ScreenPorch: Screen porch area in square feet (number)

PoolArea: Pool area in square feet (number)

MiscVal: Value of miscellaneous feature (number)

MoSold: Month Sold (number)

YrSold: Year Sold (number)

1stFlrSF: First Floor square feet (number)

2ndFlrSF: Second floor square feet (number)

LowQualFinSF: Low quality finished square feet (all floors) (number)

GrLivArea: Above grade (ground) living area square feet (number)

AlphaNumeric Data (there are number column, not all of them text string, but they are categorical or ordinal in nature):

MSSubClass: The building class (number, cat)

MSZoning: The general zoning classification (String)

Street: Type of road access (String)

Alley: Type of alley access (String)

LotShape: General shape of property (String)

LandContour: Flatness of the property (String)

Utilities: Type of utilities available (String)

LotConfig: Lot configuration (String)

LandSlope: Slope of property (String)

Neighborhood: Physical locations within Ames city limits (String)

Condition1: Proximity to main road or railroad (String)

Condition2: Proximity to main road or railroad (if a second is present) (String)

BldgType: Type of dwelling (String)

HouseStyle: Style of dwelling (String)

OverallQual: Overall material and finish quality (number, cat)

OverallCond: Overall condition rating (number, cat)

RoofStyle: Type of roof (String)

RoofMatl: Roof material (String)

Exterior1st: Exterior covering on house (String)

Exterior2nd: Exterior covering on house (if more than one material) (String)

MasVnrType: Masonry veneer type (String)

ExterQual: Exterior material quality (String)

ExterCond: Present condition of the material on the exterior (String)

Foundation: Type of foundation (String)

BsmtQual: Height of the basement (String)

BsmtCond: General condition of the basement (String)

BsmtExposure: Walkout or garden level basement walls (String)

BsmtFinType1: Quality of basement finished area (String)

BsmtFinType2: Quality of second finished area (if present) (String)

Heating: Type of heating (String)

HeatingQC: Heating quality and condition (String)

CentralAir: Central air conditioning (String)

Electrical: Electrical system (String)

KitchenQual: Kitchen quality (String)

TotRmsAbvGrd: Total rooms above grade (does not include bathrooms) (number)

Functional: Home functionality rating (String)

Fireplaces: Number of fireplaces (number)

FireplaceQu: Fireplace quality (String)

GarageType: Garage location (String)

GarageYrBlt: Year garage was built (number)

GarageFinish: Interior finish of the garage (String)

GarageCars: Size of garage in car capacity (number)

GarageArea: Size of garage in square feet (number)

GarageQual: Garage quality (String)

GarageCond: Garage condition (String)

PavedDrive: Paved driveway (String)

PoolQC: Pool quality Fence: Fence quality (String)

MiscFeature: Miscellaneous feature not covered in other categories (String)

SaleType: Type of sale (String)

SaleCondition: Condition of sale (String)

Target:

SalePrice — the property’s sale price in dollars. This is the target variable that you’re trying to predict for the future.

Most of the numeric feature variables are continuous, with exception of a few to be ordinal, i.e., categorical.

All non-number features are categorical, that need to be label encoded into numeric ordinal values to be used by machine learning algorithms, in this instance, regressors.

Feature engineering is about works conducted on feature columns in the supervised machine learning. Feature engineering on prediction by regression is about following:

1. Exploratory data analysis to understand each feature column, whether be ordinal or ratio, that are statistical terms on categorical values or continue values.

2. Identify correlationship between feature column and target label. Pearson correlation evaluation and Spearman correlation evaluable will often be employed.

The Pearson correlation evaluates the linear relationship between two continuous variables.

Spearman correlation is often used to evaluate relationships involving ordinal (or categorical) variables.

The goal of correlation evaluation is to sort out features that are more strongly correlated, either positively or negatively, to the target label, weed out those features that are little correlated to the target lables with attempt to reduce the feature dimension without sacraficing regression learning quality, which can be measure by R square.

https://en.wikipedia.org/wiki/Coefficient_of_determination

Reduction of feature dimensions, i.e., eliminating feature columns if done properly, has the benefit of:

May improve learning model performance in terms of increased R Square, or at least, not significantly decrease modeling performance. Why reduce the dataset column dimension actually improve regression learning quality? Because those feature columns eliminated are just noise, have little influence to the variaion of target lable.

ML code will run faster, because has less data, uses less amount of memory, and less CPU cycles. If the ML code runs on the cloud, it translates to real money saving, because it demands less cloud resources.

Action:

1. Following is exercise that I run below regressors on all feafure columns using sklearn in Python:

Regressors:

Linear:

LinearRegression

Lasso

Ridge

Non-Linear:

Decision Tree

Random Forest

Gradient Boosting

Following is the code:

import numpy as np

import pandas as pd

import math

#Load csv file into pandas dataframe

df=pd.read_csv(“/home/bigdata2/kaggle/housing.csv”,sep=’,’,header=’infer’)

#The data contains both number and non number columns, get all the number columns into pandas dataframe df_num

df_num=df.select_dtypes(include=’number’)

#Get all the non number columns into anothet pandas dataframe df_non_num

df_non_num=df.select_dtypes(exclude=’number’)

#Convert datatype of all non number columns into category datatype

df_non_num=df_non_num.astype(‘category’)

#Get all the non number columns name into an pandas core index, similar to a list of columns into cat_columns

cat_columns = df_non_num.select_dtypes([‘category’]).columns

#Convert each categorical column text values into numeric values, which is called label encoding, ML algorithm can only deal with numeric values, even they are categorical.

df_non_num[cat_columns]=df_non_num[cat_columns].apply(lambda x: x.cat.codes)

#Finally, join df_non_num with df_num, the last column of df_num is saleprice, which is the target, and which

#is last column of combined dataframe df_all_num

df_all_num=df_non_num.join(df_num)

#There are NaN values in df_all_num, which cannot be handled by ML algorithm, replace NaN with 0

df_all_num=df_all_num.fillna(0)

#Now create feature matrix X and target vector y.

#column 0 to column 79 of pandas dataframe df_all_num are in feature matrix, a numpy array X

#column 80, saleprice, the target, in numpy array y

X=df_all_num.iloc[:,0:79].to_numpy()

y=df_all_num.iloc[:,80].to_numpy()

#Then split X, y into Xtrain, Ytrain 70%, Xtest, Ytest 30%

import sklearn

from sklearn.model_selection import train_test_split

Xtrain, Xtest, Ytrain, Ytest=train_test_split(X,y,test_size=0.30, random_state=0)

#Now start regression.

#Start with LinearRegression on all columns

from sklearn.linear_model import LinearRegression

model = LinearRegression(normalize=True)

model.fit(Xtrain, Ytrain)

y_pred=model.predict(Xtest)

#Evalue mean squared error and R square, the higher the R Square, towards 1, the better

from sklearn.metrics import mean_squared_error,r2_score

r2_score(Ytest,y_pred)

0.644618860108294

#Let’s try Ridge regression, which is also called L2 regularization, it is LinearRegression, but less influenced by outliers

from sklearn.linear_model import Ridge

ridge_model = Ridge(alpha=0.5,normalize=True)

ridge_model.fit(Xtrain, Ytrain)

y_pred=ridge_model.predict(Xtest)

r2_score(Ytest,y_pred)

0.7495168575082853

from sklearn.linear_model import Lasso

lasso_model =Lasso(alpha=2.0, normalize=True)

lasso_model.fit(Xtrain, Ytrain)

y_pred=lasso_model.predict(Xtest)

r2_score(Ytest,y_pred)

0.6588186254023227

#Now try nonlinear regressor, i.e. Decision Tree, Random Forest and Gradient Boosting

from sklearn.tree import DecisionTreeRegressor

#Let’s tune the decision tree, by set the tree depth, i.e., level of tree, set max_depth 2

dt_model=DecisionTreeRegressor(criterion=’mse’, max_depth=2, max_features=None,

max_leaf_nodes=None, min_impurity_decrease=0.0,

min_impurity_split=None, min_samples_leaf=1,

min_samples_split=2, min_weight_fraction_leaf=0.0,

presort=False, random_state=None, splitter=’best’)

dt_model.fit(Xtrain, Ytrain)

y_pred=dt_model.predict(Xtest)

r2_score(Ytest,y_pred)

0.6368370135372332

#Now set the hyperparameter max_depth of DecisionTreeRegressor to 5

dt_model=DecisionTreeRegressor(criterion=’mse’, max_depth=5, max_features=None,

max_leaf_nodes=None, min_impurity_decrease=0.0,

min_impurity_split=None, min_samples_leaf=1,

min_samples_split=2, min_weight_fraction_leaf=0.0,

presort=False, random_state=None, splitter=’best’)

dt_model.fit(Xtrain, Ytrain)

y_pred=dt_model.predict(Xtest)

#Increase tree depth by set max_depth to 5, improve R Square,

#increae tree depth can lead to another problem, overfitting

#For decision tree, more proper tree depth would be 2

r2_score(Ytest,y_pred)

0.8091114463457022

from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor

rf_model=RandomForestRegressor()

rf_model.fit(Xtrain, Ytrain)

y_pred=rf_model.predict(Xtest)

r2_score(Ytest,y_pred)

0.8631843000569566

#Then try GradientBoostingRegressor

gb_model=GradientBoostingRegressor()

gb_model.fit(Xtrain, Ytrain)

y_pred=gb_model.predict(Xtest)

r2_score(Ytest,y_pred)

0.8709366125367382

In summary, R Square scores for each of the regressors are below:

LinearRegression 0.644618860108294

Lasso 0.6588186254023227

Ridge 0.7495168575082853

DecisionTreeRegressor 0.6368370135372332

DecisionTreeRegressor(max_depth=5) 0.8091114463457022

RandomForestRegressor 0.8631843000569566

GradientBoostingRegressor 0.8709366125367382

Aparantly, RandomForestRegressor and GradientBoostingRegressor have the best R Square scores.

2. Following is excercise that I run below regressors on only columns that have positively or hegatively correlationship with the target label, evaluated to greater than absolute value of 0.3 or more:

Regressors:

Linear:

LinearRegression

Lasso

Ridge

Non Linear:

Decision Tree

Random Forest

Gradient Boosting

Before running regressors, data need to be preprocessed first. Following is the code:

import numpy as np

import pandas as pd

import math

#Load csv file into pandas dataframe

df=pd.read_csv(“/home/bigdata2/kaggle/housing.csv”,sep=’,’,header=’infer’)

Now split the dataframe into 2:

dataframe with continuous/ratio/numeric data

dataframe with norminal/ordinal/categorical data

Continuous/ratio/numeric feature columns:

LotFrontage: Linear feet of street connected to property (number)

LotArea: Lot size in square feet (number)

YearBuilt: Original construction date (number)

YearRemodAdd: Remodel date (number)

BsmtFinSF1: Type 1 finished square feet (number)

BsmtFinSF2: Type 2 finished square feet (number)

BsmtUnfSF: Unfinished square feet of basement area (number)

TotalBsmtSF: Total square feet of basement area (number)

BsmtFinSF2: Type 2 finished square feet (number)

BsmtUnfSF: Unfinished square feet of basement area (number)

TotalBsmtSF: Total square feet of basement area (number)

BsmtFullBath: Basement full bathrooms (number)

BsmtHalfBath: Basement half bathrooms (number)

FullBath: Full bathrooms above grade (number)

HalfBath: Half baths above grade (number)

Bedroom: Number of bedrooms above basement level (number)

Kitchen: Number of kitchens (number)

WoodDeckSF: Wood deck area in square feet (number)

OpenPorchSF: Open porch area in square feet (number)

EnclosedPorch: Enclosed porch area in square feet (number)

3SsnPorch: Three season porch area in square feet (number)

ScreenPorch: Screen porch area in square feet (number)

PoolArea: Pool area in square feet (number)

MiscVal: Value of miscellaneous feature (number)

MoSold: Month Sold (number)

YrSold: Year Sold (number)

1stFlrSF: First Floor square feet (number)

2ndFlrSF: Second floor square feet (number)

LowQualFinSF: Low quality finished square feet (all floors) (number)

GrLivArea: Above grade (ground) living area square feet (number)

df_num=df.loc[:,[‘LotFrontage’, ‘LotArea’, ‘YearBuilt’, ‘YearRemodAdd’, ‘BsmtFinSF1’, \

‘BsmtFinSF2’, ‘BsmtUnfSF’, ‘TotalBsmtSF’, ‘BsmtFullBath’, ‘BsmtHalfBath’, ‘FullBath’, \

‘HalfBath’, ‘BedroomAbvGr’, ‘KitchenAbvGr’, ‘WoodDeckSF’, ‘OpenPorchSF’, ‘EnclosedPorch’, ‘3SsnPorch’, ‘ScreenPorch’, \

‘PoolArea’, ‘MiscVal’, ‘MoSold’, ‘YrSold’, ‘1stFlrSF’, ‘2ndFlrSF’, ‘LowQualFinSF’, ‘GrLivArea’]]

Ordinal/Norminal Data (there are number column, but they are caterical):

MSSubClass: The building class (number, cat)

MSZoning: The general zoning classification (String)

Street: Type of road access (String)

Alley: Type of alley access (String)

LotShape: General shape of property (String)

LandContour: Flatness of the property (String)

Utilities: Type of utilities available (String)

LotConfig: Lot configuration (String)

LandSlope: Slope of property (String)

Neighborhood: Physical locations within Ames city limits (String)

Condition1: Proximity to main road or railroad (String)

Condition2: Proximity to main road or railroad (if a second is present) (String)

BldgType: Type of dwelling (String)

HouseStyle: Style of dwelling (String)

OverallQual: Overall material and finish quality (number, cat)

OverallCond: Overall condition rating (number, cat)

RoofStyle: Type of roof (String)

RoofMatl: Roof material (String)

Exterior1st: Exterior covering on house (String)

Exterior2nd: Exterior covering on house (if more than one material) (String)

MasVnrType: Masonry veneer type (String)

ExterQual: Exterior material quality (String)

ExterCond: Present condition of the material on the exterior (String)

Foundation: Type of foundation (String)

BsmtQual: Height of the basement (String)

BsmtCond: General condition of the basement (String)

BsmtExposure: Walkout or garden level basement walls (String)

BsmtFinType1: Quality of basement finished area (String)

BsmtFinType2: Quality of second finished area (if present) (String)

Heating: Type of heating (String)

HeatingQC: Heating quality and condition (String)

CentralAir: Central air conditioning (String)

Electrical: Electrical system (String)

KitchenQual: Kitchen quality (String)

TotRmsAbvGrd: Total rooms above grade (does not include bathrooms) (number)

Functional: Home functionality rating (String)

Fireplaces: Number of fireplaces (number)

FireplaceQu: Fireplace quality (String)

GarageType: Garage location (String)

GarageYrBlt: Year garage was built (number)

GarageFinish: Interior finish of the garage (String)

GarageCars: Size of garage in car capacity (number)

GarageArea: Size of garage in square feet (number)

GarageQual: Garage quality (String)

GarageCond: Garage condition (String)

PavedDrive: Paved driveway (String)

PoolQC: Pool quality Fence: Fence quality (String)

MiscFeature: Miscellaneous feature not covered in other categories (String)

SaleType: Type of sale (String)

SaleCondition: Condition of sale (String)

df_non_num=df.loc[:,[‘MSSubClass’, ‘MSZoning’, ‘Street’, ‘Alley’, ‘LotShape’, ‘LandContour’, \

‘Utilities’, ‘LotConfig’, ‘LandSlope’, ‘Neighborhood’, ‘Condition1’, ‘Condition2’, \

‘BldgType’, ‘HouseStyle’, ‘OverallQual’, ‘OverallCond’, ‘RoofStyle’, ‘RoofMatl’, \

‘Exterior1st’, ‘Exterior2nd’, ‘MasVnrType’, ‘ExterQual’, ‘ExterCond’, ‘Foundation’, \

‘BsmtQual’, ‘BsmtCond’, ‘BsmtExposure’, ‘BsmtFinType1’, ‘BsmtFinType2’, ‘Heating’, ‘HeatingQC’, \

‘CentralAir’, ‘Electrical’, ‘KitchenQual’, ‘TotRmsAbvGrd’, ‘Functional’, ‘Fireplaces’, \

‘FireplaceQu’, ‘GarageType’, ‘GarageYrBlt’, ‘GarageFinish’, ‘GarageCars’, ‘GarageArea’, \

‘GarageQual’, ‘GarageCond’, ‘PavedDrive’, ‘PoolQC’, ‘MiscFeature’, ‘SaleType’, ‘SaleCondition’]]

#Set df_non_num to category datatype

df_non_num=df_non_num.astype(‘category’)

cat_columns=df_non_num.columns.to_numpy()

#Convert each categorical column text values into numeric values, which is called label encoding, ML algorithm can only deal with numeric values.

df_non_num[cat_columns]=df_non_num[cat_columns].apply(lambda x: x.cat.codes)

#Replace NaN to 0 for both numeric and categorical dataframes

df_num=df_num.fillna(0)

df_non_num=df_non_num.fillna(0)

#Add target label df[‘SalePrice’] to both df_num and df_non_num for testing correlationship with.

df_num=df_num.join(df.loc[:,[‘SalePrice’]])

df_non_num=df_non_num.join(df.loc[:,[‘SalePrice’]])

# For continuous columns, using Pearson correlation evaluation, only pick those columns that have absolute correlationship value 0.3 or greater

start=True

for i in df_num.columns.to_numpy():

if i==’Id’ or i==’SalePrice’:

continue

if abs(df_num.loc[:,[i,’SalePrice’]].corr().to_numpy()[0][1])>=0.3:

if start==True:

df_num_keep=df_num.loc[:,[i]]

df_num_keep.head(2)

start=False

else:

df_num_keep=df_num_keep.join(df_num.loc[:,[i]])

df_num_keep

# For categorical columns, using Spearman correlation evaluation, only pick those columns that have absolute correlationship value 0.3 or greater

start=True

for i in df_non_num.columns.to_numpy():

if i==’SalePrice’:

continue

if abs(df_non_num.loc[:,[i,’SalePrice’]].corr(‘spearman’).to_numpy()[0][1])>=0.3:

if start==True:

df_non_num_keep=df_non_num.loc[:,[i]]

df_non_num_keep.head(2)

start=False

else:

df_non_num_keep=df_non_num_keep.join(df_non_num.loc[:,[i]])

df_non_num_keep

#Finally, join df_non_num with df_num, the last column of df_num is saleprice, which is the target, and which

#is last column of combined dataframe df_all_num

df_all_num=df_non_num_keep.iloc[:,0:df_non_num_keep.columns.to_numpy().size-1].join(df_num_keep)

df_all_num=df_all_num.join(df_num.loc[:,[‘SalePrice’]])

#Done preprocessing done

#Now create feature matrix X and target vector y.

#column 0 to column 28 of pandas dataframe df_all_num are in feature matrix, a numpy array X

#column 29, saleprice, the target, in numpy array y

#You notice this is reduced from 80 original feature columns to 29 feature columns.

X=df_all_num.iloc[:,0:df_all_num.columns.to_numpy().size-1].to_numpy()

y=df_all_num.iloc[:,df_all_num.columns.to_numpy().size-1].to_numpy()

#Then split X, y into Xtrain, Ytrain 70%, Xtest, Ytest 30%

import sklearn

from sklearn.model_selection import train_test_split

Xtrain, Xtest, Ytrain, Ytest=train_test_split(X,y,test_size=0.30, random_state=0)

#Now start regression.

#Start with LinearRegression on all columns

from sklearn.linear_model import LinearRegression

model = LinearRegression(normalize=True)

model.fit(Xtrain, Ytrain)

y_pred=model.predict(Xtest)

#Evalue mean squared error and R square, the higher the R Square, towards 1, the better

from sklearn.metrics import mean_squared_error,r2_score

r2_score(Ytest,y_pred)

0.7426271204373056

#Let’s try Ridge regression, which is also called L2 regularization, it is LinearRegression, but less influenced by outliers

from sklearn.linear_model import Ridge

ridge_model = Ridge(alpha=0.5,normalize=True)

ridge_model.fit(Xtrain, Ytrain)

y_pred=ridge_model.predict(Xtest)

r2_score(Ytest,y_pred)

0.7580168370452363

from sklearn.linear_model import Lasso

lasso_model =Lasso(alpha=2.0, normalize=True)

lasso_model.fit(Xtrain, Ytrain)

y_pred=lasso_model.predict(Xtest)

r2_score(Ytest,y_pred)

0.7428797561000005

#Decision tree, Random Forest Regressor and Gradient Boost Regressor

#Try each of them

#Start with DecisionTree Regressor

from sklearn.tree import DecisionTreeRegressor

#Let’s tune the decision tree, by set the tree depth, i.e., level of tree, set max_depth 2

dt_model=DecisionTreeRegressor(criterion=’mse’, max_depth=2, max_features=None,

max_leaf_nodes=None, min_impurity_decrease=0.0,

min_impurity_split=None, min_samples_leaf=1,

min_samples_split=2, min_weight_fraction_leaf=0.0,

presort=False, random_state=None, splitter=’best’)

dt_model.fit(Xtrain, Ytrain)

y_pred=dt_model.predict(Xtest)

r2_score(Ytest,y_pred)

0.6368370135372332

#Now set the hyperparameter max_depth of DecisionTreeRegressor to 5

dt_model=DecisionTreeRegressor(criterion=’mse’, max_depth=5, max_features=None,

max_leaf_nodes=None, min_impurity_decrease=0.0,

min_impurity_split=None, min_samples_leaf=1,

min_samples_split=2, min_weight_fraction_leaf=0.0,

presort=False, random_state=None, splitter=’best’)

dt_model.fit(Xtrain, Ytrain)

y_pred=dt_model.predict(Xtest)

r2_score(Ytest,y_pred)

0.8037083582504732

from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor

rf_model=RandomForestRegressor()

rf_model.fit(Xtrain, Ytrain)

y_pred=rf_model.predict(Xtest)

r2_score(Ytest,y_pred)

0.8561015676909451

#Then try GradientBoostingRegressor

gb_model=GradientBoostingRegressor()

gb_model.fit(Xtrain, Ytrain)

y_pred=gb_model.predict(Xtest)

r2_score(Ytest,y_pred)

0.8710268665585925

In summary, R Square scores for each of the regressors on reduced column dataset are below:

LinearRegression 0.7426271204373056

Lasso 0.7428797561000005

Ridge 0.7580168370452363

DecisionTreeRegressor 0.6368370135372332

DecisionTreeRegressor(max_depth=5)

0.8037083582504732

RandomForestRegressor 0.8561015676909451

GradientBoostingRegressor 0.8710268665585925

Aparantly, RandomForestRegressor and GradientBoostingRegressor have the best R Square scores.

Comparing regressions with whole dataset with reduced dataset that has absolute value of correlationship 0.3 or greater.

Observation:

On linear regressions, remove columns that have little correlationship with target label actually yield better R square value, it means the regression model makes more accurate prediction. It appears these columns removed are simply noise and have no value added in creation of linear regression machine learning models.

On non-linear regressions, removing columns that have little linear correlationship with target label have little impact to the prediction accuracy of the non-linear machine learning model, they are much the same. However, you have the benefit of running on smaller dataset with reduced columns from 80 column original to 29 columns, that is the saving of the system resources or even money if you run it on the cloud service providers.

As always, the code used in this writing is in my github site:

https://github.com/geyungjen/jentekllc

Thanks for viewing.

George Jen

Jen Tek LLC, data science/engineering consultation and education.

Machine learning data preprocessing feature engineering

Written by George Jen

No responses yet