House Prices — Part IV
Introduction
Now that we have the processed data that we need to build our model, let’s jump right into it. Let’s start by creating more directories in our project tree:
project/
├── data/
│ ├── raw/
│ │ ├── kc_house_data.csv
│ ├── interim/
│ │ ├── kc_house_data.csv
│ │ ├── kc_house_data_no_outlier.csv
│ ├── processed/
│ │ ├── kc_house_data.csv
│ │ ├── kc_house_data_no_outlier.csv
├── src/
│ ├── data/
│ │ ├── utils.py
│ │ ├── cleaning.py
│ │ ├── modeling.py
│ │ ├── __init__.py
│ ├── models/
│ │ ├── utils.py
│ │ ├── model_build.py
│ │ ├── __init__.py
│ ├── clean.py
│ ├── modeling.py
│ ├── model_build.py
│ ├── __init__.py
├── requirements.txt
└── README.md
- The models/utils.py file will contain helper functions to build our logic.
- The models/model_build.py file will contain the code to build the model.
- The src/model_build.py file will contain the main logic code to build our model.
Price Forecast
The first thing we’re going to do is to forecast the price of a house, but there’s a little problem with doing this: we don’t have in our dataset the price of a house changing over time in order to learn from it, and then forecast future values. To bypass this problem we’re going to do a little trick: we’ll split the houses into clusters and consider the price of houses inside a cluster as the continues price over time. In the models/utils.py file let’s create the code for it:
from typing import Dict, List, Union
import pandas as pd
from sklearn.cluster import KMeans
def get_clusters(dataframe: pd.DataFrame, n_clusters = 10, ignore_cols: List[str] = []) -> Union[pd.DataFrame, KMeans]:
""" Assign each house in the dataset a cluster
Parameters:
dataframe: Dataframe to be labeled
n_clusters: Number of clusters to split
ignore_cols: List of columns to ignore while creating the clusters
Returns:
dataframe: Returns the labeled dataframe
kmeans: Returns the trained KMeans algortihm to be used in future predictions
"""
data = dataframe.drop(ignore_cols, axis="columns")
kmeans = KMeans(n_clusters)
labels = kmeans.fit_predict(data)
dataframe["cluster"] = labels
return dataframe, kmeans
Now that each house belongs to a cluster, let’s create the necessary code to forecast future values following this guide provided by Peixeiro. In the models/model_build.py file we can add the following code:
from typing import Dict, List
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_percentage_error as mape
def get_forecast_by_cluster(clusters: Dict[str, pd.DataFrame]) -> Dict[str, List]:
""" gets the forecast model for each cluster
Parameters:
clusters: Dict containg the cluster as key to dataframe
Returns:
results: Dict containing the cluster as key to array of test values
and predicted values
"""
results = {}
for key in clusters:
cluster_df = clusters[key]
cluster_df = normalize_clusters_by_date(cluster_df)
seq_df = window_input_output(cluster_df, 5, 5)
# Builds train and test set
X_cols = [col for col in seq_df.columns if col.startswith("x")]
X_cols.insert(0, "price")
y_cols = [col for col in seq_df.columns if col.startswith("y")]
# Will use all, but the two last rows for train
X_train = seq_df[X_cols][:-2].values
y_train = seq_df[y_cols][:-2].values
# Will use the last two rows for test
X_test = seq_df[X_cols][-2:].values
y_test = seq_df[y_cols][-2:].values
dt_seq = DecisionTreeRegressor(random_state=42)
dt_seq.fit(X_train, y_train)
dt_seq_preds = dt_seq.predict(X_test)
results[key] = []
results[key].append(y_test[1])
results[key].append(dt_seq_preds[1])
print(f"MAPE for cluster {key} is {mape(dt_seq_preds.reshape(1, -1), y_test.reshape(1, -1))}")
return results
In this function, we have three particular points to highlight:
- Since we grouped houses into clusters to consider them as the same data changing over time, we need to group all the data by the date provided in the dataset. In the models/utils.py file we can add:
def normalize_clusters_by_date(dataframe: pd.DataFrame) -> pd.DataFrame:
""" Groups each cluster by the date
Parameters:
dataframe: Dataframe to be manipulated
Returns:
dataframe: Dataframe grouped by date
"""
dataframe = dataframe[["date", "price"]]
dataframe = dataframe.groupby("date").mean("price")
return dataframe
2. Following the guide provided by Peixeiro, in order to do a multi-prediction problem it’s necessary to create multiple inputs to train our model and use them as predictions. This is done in the following function:
def window_input_output(dataframe: pd.DataFrame, input_length: int, output_length: int) -> pd.DataFrame:
""" Creates a sequence of observation to train the model
Parameters:
dataframe: Dataframe to be manipulated
input_lenght: Number of training observations
output_lenght: Number of labels
Returns:
dataframe: Dataframe with aritifical observations
"""
df = dataframe.copy()
for i in range(1, input_length):
df[f"x_{i}"] = df["price"].shift(-i)
for j in range(output_length):
df[f"y_{j}"] = df["price"].shift(-output_length-j)
df = df.dropna()
return df
3. We’re using the Mean Absolute Percentage Error as our metric to evaluate the quality of our model.
Now that we’ve considered these points, we can visualize the results obtained with the training data.
In these images, we can see the evaluation by cluster in each one of the datasets created in Part II of this series. In general, our model didn’t have good results and this is probably related to the fact that we didn't have a large amount of continuous data by each house and had to group them into clusters. Due to this fact unfortunately we won’t be able to answer the questions proposed in Part I.
Price Prediction
Well, now that we had all this work we’ll at least use it to build a second model: if we can’t help our company by predicting the price change over time, we can help them by evaluating a house given its characteristics.
Let’s start by loading our dataset again and splitting it into test and training data. In the data/utils.py file let’s add the necessary code for it:
def get_train_test_data(
dataframe: pd.DataFrame, target_variables: List[str],
test_size: float = 0.2
) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]:
""" Splits the dataset for training and testing
Parameters:
dataframe: Dataframe to be split
target_variables: Target variables to be used by the model
test_size: Quantity of data that will go to the test set
"""
features = list(dataframe.columns)
for var in target_variables:
features.remove(var)
X = dataframe[features]
y = dataframe[target_variables]
return train_test_split(X, y, test_size=test_size, random_state=42)
Now that we have our data loaded, we’re going to build a model based on the XGBoost library. This algorithm has proven to have high quality when predicting values. We’ll also use the RandomizedSearchCV method from Scikit-Learn to tune our model and choose the best parameters for it. In the models/model_build.by let’s add the following code:
def xgboost_train(X_train: pd.DataFrame, y_train: pd.DataFrame, n_iter: int = 5) -> RandomizedSearchCV:
""" Trains the XGBoost model against the data
Parameters:
X_train: The data to be used during training
y_ttrain: The labels for each train data
n_iter: Number of iterations to be used during parameters tuning in the model
Returns:
search: The randomized search object containing the tunned model
"""
params = {
'max_depth': [3, 5, 6, 10, 15, 20],
'learning_rate': [0.01, 0.1, 0.2, 0.3],
'subsample': np.arange(0.5, 1.0, 0.1),
'colsample_bytree': np.arange(0.4, 1.0, 0.1),
'colsample_bylevel': np.arange(0.4, 1.0, 0.1),
'n_estimators': [100, 500, 1000]
}
xgb_model = xgb.XGBRegressor(random_state=42)
search = RandomizedSearchCV(
estimator=xgb_model,
param_distributions=params,
scoring='neg_mean_squared_error',
random_state=42,
n_iter=n_iter, verbose=2)
search.fit(X_train, y_train)
return search
And now we can test the built model with the test data, in the models/utils.py let’s add the following code:
def xgboost_test(xgb_search: RandomizedSearchCV, X_test: pd.DataFrame, y_test: pd.DataFrame) -> np.ndarray:
""" Testes the trained XGBoost against the data
Parameters:
xgb_search: The randomized search object
X_test: The data to be tested
y_test: The labels for each test data
Returns:
y_pred: The predicted values for each test data
"""
y_pred = xgb_search.predict(X_test)
print("Predicted RMSE: ", np.sqrt(mean_squared_error(y_test, y_pred)))
return y_pred
When plotting the obtained results we can see:
The proposed model performs better in the dataset where the outliers weren’t removed and this proves what we had initially supposed: these outliers aren’t actually a problem, but real-world valuable data.
Business Analysis
Based on the obtained results, it’s possible to say that it would be risky to use our model in a real-world scenario, especially for houses with prices higher than US$ 1.000.000,00.
Based on these restrictions would be nice in future work to build alternatives for the XGBoost model or even improve the featured selection for this process.
Conclusion
Through this series, we were able to learn multiple concepts relates to data manipulation and machine learning models. Unfortunately, the questions initially proposed aren’t answerable, but we were able to use all of our hard work to build a predictive model!
More of these posts will come soon, probably in a less expanded format since I observed that this format takes far more time than I initially thought. Thank you for following these posts, and remember that all the code is available on GitHub.
Related Posts
- House Prices — Data Report
- House Prices — Part I
- House Prices — Part II
- House Prices — Part III
- House Prices — Part IV [You are here 😄]