Predictive Model For Houses Built Before 1980 with scikit-learn
This project was made using:
- Python
- Numpy
- Altair
- Skicit Learn
- Pandas
Introduction Predictive Model🚀
This project uses machine learning methods to build a predictive model which is going to help the state of Colorado to feed a large portion of data of their dwelling, helping to predict if a house was built before 1980.
Data Visualizations 🏙
Here’s a couple of data visualizations explaining the data in our dataset where we can see the distribution and features of the different houses.
Feature Performance 💪🏻
In this section, we are going to understand the results given by the model created, what we can do with those results in order to make more predictions in the future, and if our model is useful or it’s just doing nothing.
If we take a look at the chart of the feature performance, we’ll see the features that helped our predictive model to get higher accuracy, also, we’ll find that the most useful variable is the ‘arcstyle_ONE-STORY’ which refer to the type of architecture of the house, the ‘gartype’ which refers to the type of the garage, design and details, the ‘quality_C’, refers to the quality of the dwelling in a rate of A to D, and then other variables like the number of bathrooms and number of bedrooms.
Model Metrics 📈
Metrics
precision recall f1-score support
0 0.87 0.86 0.86 2884
1 0.92 0.93 0.92 4907
accuracy 0.90 7791
macro avg 0.89 0.89 0.89 7791
weighted avg 0.90 0.90 0.90 7791
Model Accuracy
0.9005262482351432 = 90%
Based on our table of metrics, we can tell we have high precision, knowing that precision is the percent of times that our predictive model was correct. But what we are looking for in this predictive model is accuracy over recall, because we’re not expecting to find the relevant cases, but finding the correct predictions according to the trained data. Also, we have a high F-1 score which shows us the harmonic mean of precision and recall, and the similarity of the samples.
And if we look in our confusion matrix, we’ll see we have a high number of true positives, which means, predictions that we hoped to be true and actually were true.
Then, we can conclude our predictive model is doing a great job finding dwellings in the state of Colorado before 1980.
Behind the code 🧑🏽💻
# Author: Daniel Hernández
#%%
import pandas as pd
import altair as alt
import numpy as np
# import catboost as cb
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
#%%
alt.data_transformers.enable('json')
dwellings_denver = pd.read_csv("https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_denver/dwellings_denver.csv")
dwellings_ml = pd.read_csv("https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_ml/dwellings_ml.csv")
dwellings_neighborhoods_ml = pd.read_csv("https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_neighborhoods_ml/dwellings_neighborhoods_ml.csv")
#%%
X = dwellings_ml.drop(['before1980', 'yrbuilt'], axis = 1)
y = dwellings_ml.filter(regex = 'before1980')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.34, random_state=76)
#%%
# Stories
# Sales Price
# Square Footage
# Garage Size
# Bathroom
(alt.Chart(dwellings_denver.query('yrbuilt > 0'))
.encode(
alt.X('nbhd:O'),
alt.Y('yrbuilt', scale = alt.Scale(zero = False)))
.mark_boxplot())
# %%
(alt.Chart(dwellings_denver.query('yrbuilt > 0'), title = "Number of Baths Denver Dwellings by Year")
.encode(
alt.X('numbaths:O', scale = alt.Scale(zero = False)),
alt.Y('yrbuilt', scale = alt.Scale(zero = False))
)
.mark_boxplot()
)
#%%
(alt.Chart(dwellings_denver.query('yrbuilt > 0'), title = "Garage Size by Number of Cars Denver Dwellings Through Years")
.encode(
alt.X('nocars:O', scale = alt.Scale(zero = False)),
alt.Y('yrbuilt', scale = alt.Scale(zero = False))
)
.mark_boxplot()
)
#%%
(alt.Chart(dwellings_denver.query('yrbuilt > 0'), title = "Number of Bedrooms Denver Dwellings by Year")
.encode(
alt.X('numbdrm:O', scale = alt.Scale(zero = False)),
alt.Y('yrbuilt', scale = alt.Scale(zero = False))
)
.mark_boxplot()
)
# %%
# Let's try a tree model.
tree_clf = tree.DecisionTreeClassifier()
tree_clf.fit(X_train, y_train)
y_pred = tree_clf.predict(X_test)
# %%
feature_dat = pd.DataFrame({
"values": tree_clf.feature_importances_,
"features": X_train.columns
})
alt.Chart(feature_dat.query('values > .02')).encode(
alt.X('values'),
alt.Y('features', sort = "-x")).mark_bar()
# %%
# look at model performance
print(metrics.confusion_matrix(y_test, y_pred))
metrics.plot_confusion_matrix(tree_clf, X_test, y_test)
# %%
#print(metrics.classification_report(y_test, y_pred))
s = pd.Series(metrics.classification_report(y_test, y_pred))
print(s.to_markdown())
# %%
metrics.accuracy_score(y_test, y_pred)
#------Model Presented in the Markdown-------
# %%
from sklearn.ensemble import GradientBoostingClassifier
boost = GradientBoostingClassifier(random_state=76)
# %%
boost.fit(X_train, y_train)
y_pred_boost = boost.predict(X_test)
# %%
feature_dat_boost = pd.DataFrame({
"values": boost.feature_importances_,
"features": X_train.columns
})
alt.Chart(feature_dat_boost.query('values > .02')).encode(
alt.X('values'),
alt.Y('features', sort = "-x")).mark_bar()
# %%
# look at model performance
print(metrics.confusion_matrix(y_test, y_pred_boost))
metrics.plot_confusion_matrix(boost, X_test, y_test)
# %%
print(metrics.classification_report(y_test, y_pred_boost))
# %%
metrics.accuracy_score(y_test, y_pred_boost)
# %%
To learn more about me and discover additional insights visit https://danhergir.com. You can also explore my Medium articles by visiting https://danhergir.medium.com for more in-depth content or connect with me on Twitter @ https://twitter.com/danhergir