Income Prediction with Random Forests and Scikit-learnš³
Decision trees are a popular and intuitive tool used in machine learning and data mining for classification and regression tasks. They represent a flowchart-like structure where each internal node represents a ātestā on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label or a numerical value. They are called ātreesā because they start with a single node, which then splits into branches, and these branches can further split into more branches, forming a tree-like structure.
Now, Random Forests? Itās like having a bunch of decision trees working together on the same problem. Each tree gets its own slice of the data to train on, and then they all come together to vote on the best solution. Itās like teamwork for models! And yeah, itās super handy because it helps prevent overfitting and deals well with noisy data. Plus, itās quick since each tree can be trained independently. Overall, itās a cool method for both classification and regression tasks.
Project Introduction
In this opportunity, we are going to be using Random Forests to do some classification using the following dataset: Census+Income
Weāre going to predict whether income exceeds $50K/yr or not, and play with the features which contribute the most to our model.
Letās start by importing the data.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df1 = pd.read_table('./data/income/adult.data', delimiter=',', header=None)
df2 = pd.read_table('./data/income/adult.test', delimiter=',', header=None)
# Merge train and test dataframes
df = pd.concat([df1, df2])
I know that both datasets are already split, but this is something I did on purpose to play with randomness, also if you explore the dataset youāre going to find a lot of null values, and this is something weād like to avoid, so our model can perform better.
Letās change the name of our columns.
df.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'target']
I found out that null values in this dataset actually can be found with an ā ?ā sign.
contains_question_mark = df.applymap(lambda x: ' ?' in str(x)).any().any()
print("Does the DataFrame contain '?' anywhere?", contains_question_mark)
# If yes, count the occurrences
if contains_question_mark:
count_question_marks = (df.applymap(lambda x: x == ' ?')).sum().sum()
print("Number of occurrences of ' ?':", count_question_marks)
Hereās how the null values are distributed in the different columns.
df = df.replace(' ?', pd.NA)
df.isnull().sum()
Now we get rid of the null values. This is something I donāt suggest to do all the time, but just this time it wonāt really affect the performance of our model
df.dropna(inplace=True)
Now, weāre going to find the unique values of our target column.
df_new = pd.get_dummies(df, columns=['target'], drop_first=True)
df.target.unique()
We see we have the following unique values:
ā <=50Kā, ā >50Kā, ā <=50K.ā, ā >50K.ā
This is something we want to standardize, so instead of working with categorical values, we can work with numerical ones.
def modify_value(value):
if value == ' <=50K' or value == ' <=50K.':
return 0
else:
return 1
# Apply the function to the column using apply()
df['target'] = df['target'].apply(modify_value)
Train & Test split data
I excluded these specific columns due to their continuous nature, which could introduce overfitting into our model. Our test data is going to have 30% of our total data.
X = df.drop(['target', 'fnlwgt', 'age', 'capital-gain', 'capital-loss'], axis=1)
y = df.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=00000)
Encoding our columns
This code uses the OrdinalEncoder
from the category_encoders
library to convert categorical variables into numerical values. It specifies certain columns to encode, fits the encoder to the training data, and then transforms both the training and test data accordingly.
Itās necessary to perform this encoding to ensure that the categorical variables are compatible with machine learning algorithms, which typically operate on numerical data
import category_encoders as ce
encoder = ce.OrdinalEncoder(cols = ['workclass', 'education', 'marital-status', 'occupation','relationship', 'race', 'sex', 'native-country'])
X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)
Training model
Here n_estimators makes reference to the number of Decision Trees that weāre going to use on this model
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators=20, random_state=00000)
random_forest.fit(X_train, y_train)
Evaluating our model
Here we try to obtain the accuracy score for our train data and for our test data.
The accuracy score represents the proportion of correctly classified instances out of the total instances in a dataset. It is calculated by dividing the number of correct predictions by the total number of predictions made. In classification tasks, accuracy is a commonly used metric to evaluate the performance of a model, indicating how well it predicts the correct class labels.
from sklearn.metrics import accuracy_score
train_accuracy_rf = accuracy_score(y_train, y_train_pred_rf)
test_accuracy_rf = accuracy_score(y_test, y_test_pred_rf)
print(train_accuracy_rf, test_accuracy_rf)
Train: 0.8946138050860843 / Test: 0.8180880076656594
The training accuracy of 0.8946 suggests that the model is able to correctly classify approximately 89.46% of the instances in the training dataset. This indicates that the model performs relatively well on the data it was trained on.
The test accuracy of 0.8181 indicates that the model can correctly classify about 81.81% of the instances in the unseen test dataset. While this is lower than the training accuracy, it still shows that the model generalizes reasonably well to new, unseen data.
Features Performance
This chart shows us the features which contributed the most to our model. Something I would suggest based on these results is to go back and play with the features we dropped back when we were splitting our dataset, and see how these changes the performance of our model.
The correlation between the number of hours worked per week and financial prosperity is apparent. Similarly, the influence of education on income seems intuitive. Additionally, the impact of relationships and marital status on financial success is evident, as commitments and responsibilities often necessitate increased effort and earnings.
import seaborn as sns
import matplotlib.pyplot as plt
importances = random_forest.feature_importances_
columns = X.columns
sns.barplot(x=importances , y = columns)
plt.xlabel('Features Importance Score')
plt.ylabel('Feature')
plt.title("Visualizing Features Importances")
plt.show()
Finally, our classification report:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_test_pred_rf))
precision recall f1-score support
0 0.86 0.90 0.88 10258
1 0.65 0.55 0.60 3309
accuracy 0.82 13567
macro avg 0.76 0.73 0.74 13567
weighted avg 0.81 0.82 0.81 13567
- Recall: It captured 90% of class 0 instances and 55% of class 1 instances.
- F1-score: A balance of precision and recall, with higher scores indicating better performance. Class 0 has an F1-score of 0.88, while class 1 has 0.60.
- Accuracy: Overall correctness of the model, which is 82%.
- Support: Number of instances for each class.
- Macro avg: Average precision, recall, and F1-score across classes.
- Weighted avg: Average precision, recall, and F1-score considering class imbalance.
In summary, the model performs better at classifying class 0 but struggles with class 1 due to lower precision, recall, and F1-score.
In conclusion, our model could be better, but thatās something you can improve.
Iād ā¤ļø to see you to play with this by dropping or adding new columns.
In this opportunity you learned about Decision Trees and Random Forests with Scikit-learn, and you were able to obtain the features that would make your model perform better at classifying people that can make more or less than $50K.
More models here: https://github.com/danhergir/decision-trees
More info about this project: https://cseweb.ucsd.edu/classes/sp15/cse190-c/reports/sp15/048.pdf
To learn more about me and discover additional insights visit https://danhergir.com. You can also explore my Medium articles by visiting https://danhergir.medium.com for more in-depth content or connect with me on Twitter @ https://twitter.com/danhergir