Implementing Multi-class Logistic Regression with Scikit-learn 🎯

Daniel Hernandez
5 min readFeb 10, 2024

--

Beans (Multiple Classes)

Normally we use logistic regression to classify elements in a binary manner, meaning that a prediction is either true or false. But using this same method, we can also perform classification of multiple classes or values.

To perform logistic regressions, we rely on the sigmoid function, which allows us to establish values between 0 and 1 in a single category. In contrast, multinomial logistic regression uses the softmax function. Softmax assigns probabilities to each category such that the sum of the probabilities for all categories equals one. This ensures that the predicted probabilities represent a valid probability distribution.

Let’s get into it! ✨

You can download the dataset from here

Here are some the requirements you’ll need to install to start this project:

pip install matplotlib pandas scikit-learn imblearn

Let’s import some dependencies and take a look at our data

import matplotlib.pyplot as plt
import numpy as numpy
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, ConfusionMatrixDisplay
from sklearn.preprocessing import StandardScaler
import seaborn as sns
sns.set()

df = pd.read_csv('./data/Dry_Bean.csv')
df['Class'].unique()

The “Class” column refers to the categories of dry beans in our dataset which are: Seker, Barbunya, Bombay, Cali, Horoz, Sira, Dermason.

We can appreciate that our dataset does not have null or na values which is going to make our exploratory analysis way easier. You can check this by running the following line.

df.isnull().sum()

Exploring categories and resampling 🚀

We’re going to check the number of values we have for our different classes

sns.countplot(x='Class', data=df)
plt.xticks(rotation=45)
plt.show()

First thing we can appreciate out of this chart is the lack of values for some classes, in order to properly train our model it would be nice to have a more balanced distribution in our dataset.

Bean Classes Count

In the piece of code below, we are using a technique called random under-sampling, it involves randomly selecting examples from the majority class and deleting them from the training dataset. In the random under-sampling, the majority class instances are discarded at random until a more balanced distribution is reached

import imblearn
from imblearn.under_sampling import RandomUnderSampler

undersample = RandomUnderSampler(random_state=42)

X = df.drop('Class', axis=1)
y = df.Class

X_over, y_over = undersample.fit_resample(X, y)

sns.countplot(x=y_over, data=df)
plt.xticks(rotation=45)
plt.show()

This how a more balanced distribution looks like 👀

Under-sampling class distribution

If you run the following code, you’ll see that we ended up with less rows in our dataset.

# This is the shape of our dataset before balancing
df.shape

# Shape of dataset after balance
X_over.shape

Now, let’s focus on doing the numerical conversion of our categories, and give a reference number to every class.

import numpy as np
y_over.replace(list(np.unique(y_over)), [1, 2, 3, 4, 5, 6, 7], inplace=True)

Once we’ve done this, we can move into finding correlations between our data columns.

df_dea = X_over
df_dea['Class'] = y_over

plt.figure(figsize=(15, 10))
sns.heatmap(df_dea.corr(), annot=True)
plt.show()

Here we can observe how there’s a high correlation between the columns ‘convexArea’ and ‘EquivDiameter’.

Correlation Heatmap

The high correlation between those two columns could make us create an overfitted model, and that’s something we really want to avoid. Let’s get rid out of those columns.

# This columns may create an overfitted model
X_over.drop(['ConvexArea', 'EquivDiameter'], axis=1, inplace=True)

Building our model 🦾

Now, let’s get down to business! We are going to start training our model!

In order to do that, we will start by splitting our train and test data, we’re going to be using the 20% of our dataset for our test data.

X_train, X_test, y_train, y_test = train_test_split(X_over, y_over, random_state=42, shuffle=True, test_size=.2)

Obviously, we can’t forget to scale our data. This is important because it help us to balance the impact of all variables on the distance calculation and can help to improve the performance of the algorithm,

st_x = StandardScaler()
X_train = st_x.fit_transform(X_train)
X_test = st_x.transform(X_test)

For our Multi-class model, we are going to try several solvers and multiclass algorithms, that’s why we are going to use the following function.

def logistic_model(C, solver_, multiclass_):
logistic_regression_model = LogisticRegression(random_state=42, solver=solver_, multi_class=multiclass_, n_jobs=1, C=C)
return logistic_regression_model

And we’re going to pass it the following arguments, obtain the accuracy scores, and then display the results.

multiclass = ['ovr', 'multinomial']
solver_list = ['liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga']
scores = []
params = []

for i in multiclass:
for j in solver_list:
try:
model = logistic_model(1, j, i)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
params.append(i + ' ' + j)
accuracy = accuracy_score(y_test, predictions)
scores.append(accuracy)
except:
None

sns.barplot(x=params, y=scores).set_title('Beans Accuracy')
plt.xticks(rotation=90)

We can appreciate that all of our models had an amazing performance, but we’ll have to stick to the ‘multinomial newton-cg’ which is the one with the highest accuracy score.

Accuracy scores

Here we can appreciate a little bit better the results for our model:

model = logistic_model(1, 'newton-cg', 'multinomial')
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))


# Confusion Matrix Heatmap
cm = confusion_matrix(y_test, predictions, labels=model.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
disp.plot()
plt.show()
[[128   0   0   0   0   0   0]
[ 0 98 0 0 0 0 0]
[ 0 0 102 1 0 0 0]
[ 0 0 0 105 0 0 0]
[ 0 0 1 1 94 0 0]
[ 0 0 0 0 0 90 1]
[ 0 0 0 0 0 0 110]]

Prediction: 0.9945280437756497

Here are the results for our ‘multinomial newton-cg’. Curious how our model made only just 4 mistakes 🎯.

Confusion Matrix Heat map

Conclusion:

If you’d like to explore this code a little bit more or take a look at other Logistic Regression projects, visit this repository. I’d ❤️ to see any kind of PR or feedback.

In this reading, you learned how to use Multi-class Logistic Regression, how to scale your data, and why to use random under-sampling.

To learn more about me and discover additional insights visit https://danhergir.com. You can also explore my Medium articles by visiting https://danhergir.medium.com for more in-depth content or connect with me on Twitter @ https://twitter.com/danhergir

--

--

Daniel Hernandez
Daniel Hernandez

Responses (1)