Income Prediction with Random Forests and Scikit-learn🌳

6 min readFeb 17, 2024

Decision trees are a popular and intuitive tool used in machine learning and data mining for classification and regression tasks. They represent a flowchart-like structure where each internal node represents a “test” on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label or a numerical value. They are called “trees” because they start with a single node, which then splits into branches, and these branches can further split into more branches, forming a tree-like structure.

IBM | https://www.ibm.com/topics/decision-trees

Now, Random Forests? It’s like having a bunch of decision trees working together on the same problem. Each tree gets its own slice of the data to train on, and then they all come together to vote on the best solution. It’s like teamwork for models! And yeah, it’s super handy because it helps prevent overfitting and deals well with noisy data. Plus, it’s quick since each tree can be trained independently. Overall, it’s a cool method for both classification and regression tasks.

IBM | https://www.ibm.com/topics/random-forest

Project Introduction

In this opportunity, we are going to be using Random Forests to do some classification using the following dataset: Census+Income

We’re going to predict whether income exceeds $50K/yr or not, and play with the features which contribute the most to our model.

Let’s start by importing the data.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df1 = pd.read_table('./data/income/adult.data', delimiter=',', header=None)

df2 = pd.read_table('./data/income/adult.test', delimiter=',', header=None)

# Merge train and test dataframes
df = pd.concat([df1, df2])

I know that both datasets are already split, but this is something I did on purpose to play with randomness, also if you explore the dataset you’re going to find a lot of null values, and this is something we’d like to avoid, so our model can perform better.

Let’s change the name of our columns.

df.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'target']

I found out that null values in this dataset actually can be found with an ‘ ?’ sign.

contains_question_mark = df.applymap(lambda x: ' ?' in str(x)).any().any()

print("Does the DataFrame contain '?' anywhere?", contains_question_mark)

# If yes, count the occurrences
if contains_question_mark:
    count_question_marks = (df.applymap(lambda x: x == ' ?')).sum().sum()
    print("Number of occurrences of ' ?':", count_question_marks)

Here’s how the null values are distributed in the different columns.

df = df.replace(' ?', pd.NA)
df.isnull().sum()

Now we get rid of the null values. This is something I don’t suggest to do all the time, but just this time it won’t really affect the performance of our model

df.dropna(inplace=True)

Now, we’re going to find the unique values of our target column.

df_new = pd.get_dummies(df, columns=['target'], drop_first=True)
df.target.unique()

We see we have the following unique values:

‘ <=50K’, ‘ >50K’, ‘ <=50K.’, ‘ >50K.’

This is something we want to standardize, so instead of working with categorical values, we can work with numerical ones.

def modify_value(value):
    if value == ' <=50K' or value == ' <=50K.':
        return 0
    else:
        return 1

# Apply the function to the column using apply()
df['target'] = df['target'].apply(modify_value)

Train & Test split data

I excluded these specific columns due to their continuous nature, which could introduce overfitting into our model. Our test data is going to have 30% of our total data.

X = df.drop(['target', 'fnlwgt', 'age', 'capital-gain', 'capital-loss'], axis=1)
y = df.target

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=00000)

Encoding our columns

This code uses the OrdinalEncoder from the category_encoders library to convert categorical variables into numerical values. It specifies certain columns to encode, fits the encoder to the training data, and then transforms both the training and test data accordingly.

It’s necessary to perform this encoding to ensure that the categorical variables are compatible with machine learning algorithms, which typically operate on numerical data

import category_encoders as ce

encoder = ce.OrdinalEncoder(cols = ['workclass', 'education', 'marital-status', 'occupation','relationship', 'race', 'sex', 'native-country'])

X_train = encoder.fit_transform(X_train)

X_test = encoder.transform(X_test)

Training model

Here n_estimators makes reference to the number of Decision Trees that we’re going to use on this model

from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier(n_estimators=20, random_state=00000)

random_forest.fit(X_train, y_train)

Evaluating our model

Here we try to obtain the accuracy score for our train data and for our test data.

The accuracy score represents the proportion of correctly classified instances out of the total instances in a dataset. It is calculated by dividing the number of correct predictions by the total number of predictions made. In classification tasks, accuracy is a commonly used metric to evaluate the performance of a model, indicating how well it predicts the correct class labels.

from sklearn.metrics import accuracy_score

train_accuracy_rf = accuracy_score(y_train, y_train_pred_rf)

test_accuracy_rf = accuracy_score(y_test, y_test_pred_rf)

print(train_accuracy_rf, test_accuracy_rf)

Train: 0.8946138050860843 / Test: 0.8180880076656594

The training accuracy of 0.8946 suggests that the model is able to correctly classify approximately 89.46% of the instances in the training dataset. This indicates that the model performs relatively well on the data it was trained on.

The test accuracy of 0.8181 indicates that the model can correctly classify about 81.81% of the instances in the unseen test dataset. While this is lower than the training accuracy, it still shows that the model generalizes reasonably well to new, unseen data.

Features Performance

This chart shows us the features which contributed the most to our model. Something I would suggest based on these results is to go back and play with the features we dropped back when we were splitting our dataset, and see how these changes the performance of our model.

The correlation between the number of hours worked per week and financial prosperity is apparent. Similarly, the influence of education on income seems intuitive. Additionally, the impact of relationships and marital status on financial success is evident, as commitments and responsibilities often necessitate increased effort and earnings.

import seaborn as sns
import matplotlib.pyplot as plt

importances = random_forest.feature_importances_
columns = X.columns

sns.barplot(x=importances , y = columns)
plt.xlabel('Features Importance Score')
plt.ylabel('Feature')
plt.title("Visualizing Features Importances")
plt.show()

Finally, our classification report:

from sklearn.metrics import classification_report

print(classification_report(y_test,y_test_pred_rf))

              precision    recall  f1-score   support

           0       0.86      0.90      0.88     10258
           1       0.65      0.55      0.60      3309

    accuracy                           0.82     13567
   macro avg       0.76      0.73      0.74     13567
weighted avg       0.81      0.82      0.81     13567

Recall: It captured 90% of class 0 instances and 55% of class 1 instances.
F1-score: A balance of precision and recall, with higher scores indicating better performance. Class 0 has an F1-score of 0.88, while class 1 has 0.60.
Accuracy: Overall correctness of the model, which is 82%.
Support: Number of instances for each class.
Macro avg: Average precision, recall, and F1-score across classes.
Weighted avg: Average precision, recall, and F1-score considering class imbalance.

In summary, the model performs better at classifying class 0 but struggles with class 1 due to lower precision, recall, and F1-score.

In conclusion, our model could be better, but that’s something you can improve.

I’d ❤️ to see you to play with this by dropping or adding new columns.

In this opportunity you learned about Decision Trees and Random Forests with Scikit-learn, and you were able to obtain the features that would make your model perform better at classifying people that can make more or less than $50K.

More models here: https://github.com/danhergir/decision-trees

More info about this project: https://cseweb.ucsd.edu/classes/sp15/cse190-c/reports/sp15/048.pdf

To learn more about me and discover additional insights visit https://danhergir.com. You can also explore my Medium articles by visiting https://danhergir.medium.com for more in-depth content or connect with me on Twitter @ https://twitter.com/danhergir

Income Prediction with Random Forests and Scikit-learn🌳

Written by Daniel Hernandez

No responses yet