Implementing Multi-class Logistic Regression with Scikit-learn đŻ
Normally we use logistic regression to classify elements in a binary manner, meaning that a prediction is either true or false. But using this same method, we can also perform classification of multiple classes or values.
To perform logistic regressions, we rely on the sigmoid function, which allows us to establish values between 0 and 1 in a single category. In contrast, multinomial logistic regression uses the softmax function. Softmax assigns probabilities to each category such that the sum of the probabilities for all categories equals one. This ensures that the predicted probabilities represent a valid probability distribution.
Letâs get into it! â¨
You can download the dataset from here
Here are some the requirements youâll need to install to start this project:
pip install matplotlib pandas scikit-learn imblearn
Letâs import some dependencies and take a look at our data
import matplotlib.pyplot as plt
import numpy as numpy
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, ConfusionMatrixDisplay
from sklearn.preprocessing import StandardScaler
import seaborn as sns
sns.set()
df = pd.read_csv('./data/Dry_Bean.csv')
df['Class'].unique()
The âClassâ column refers to the categories of dry beans in our dataset which are: Seker, Barbunya, Bombay, Cali, Horoz, Sira, Dermason.
We can appreciate that our dataset does not have null or na values which is going to make our exploratory analysis way easier. You can check this by running the following line.
df.isnull().sum()
Exploring categories and resampling đ
Weâre going to check the number of values we have for our different classes
sns.countplot(x='Class', data=df)
plt.xticks(rotation=45)
plt.show()
First thing we can appreciate out of this chart is the lack of values for some classes, in order to properly train our model it would be nice to have a more balanced distribution in our dataset.
In the piece of code below, we are using a technique called random under-sampling, it involves randomly selecting examples from the majority class and deleting them from the training dataset. In the random under-sampling, the majority class instances are discarded at random until a more balanced distribution is reached
import imblearn
from imblearn.under_sampling import RandomUnderSampler
undersample = RandomUnderSampler(random_state=42)
X = df.drop('Class', axis=1)
y = df.Class
X_over, y_over = undersample.fit_resample(X, y)
sns.countplot(x=y_over, data=df)
plt.xticks(rotation=45)
plt.show()
This how a more balanced distribution looks like đ
If you run the following code, youâll see that we ended up with less rows in our dataset.
# This is the shape of our dataset before balancing
df.shape
# Shape of dataset after balance
X_over.shape
Now, letâs focus on doing the numerical conversion of our categories, and give a reference number to every class.
import numpy as np
y_over.replace(list(np.unique(y_over)), [1, 2, 3, 4, 5, 6, 7], inplace=True)
Once weâve done this, we can move into finding correlations between our data columns.
df_dea = X_over
df_dea['Class'] = y_over
plt.figure(figsize=(15, 10))
sns.heatmap(df_dea.corr(), annot=True)
plt.show()
Here we can observe how thereâs a high correlation between the columns âconvexAreaâ and âEquivDiameterâ.
The high correlation between those two columns could make us create an overfitted model, and thatâs something we really want to avoid. Letâs get rid out of those columns.
# This columns may create an overfitted model
X_over.drop(['ConvexArea', 'EquivDiameter'], axis=1, inplace=True)
Building our model đŚž
Now, letâs get down to business! We are going to start training our model!
In order to do that, we will start by splitting our train and test data, weâre going to be using the 20% of our dataset for our test data.
X_train, X_test, y_train, y_test = train_test_split(X_over, y_over, random_state=42, shuffle=True, test_size=.2)
Obviously, we canât forget to scale our data. This is important because it help us to balance the impact of all variables on the distance calculation and can help to improve the performance of the algorithm,
st_x = StandardScaler()
X_train = st_x.fit_transform(X_train)
X_test = st_x.transform(X_test)
For our Multi-class model, we are going to try several solvers and multiclass algorithms, thatâs why we are going to use the following function.
def logistic_model(C, solver_, multiclass_):
logistic_regression_model = LogisticRegression(random_state=42, solver=solver_, multi_class=multiclass_, n_jobs=1, C=C)
return logistic_regression_model
And weâre going to pass it the following arguments, obtain the accuracy scores, and then display the results.
multiclass = ['ovr', 'multinomial']
solver_list = ['liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga']
scores = []
params = []
for i in multiclass:
for j in solver_list:
try:
model = logistic_model(1, j, i)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
params.append(i + ' ' + j)
accuracy = accuracy_score(y_test, predictions)
scores.append(accuracy)
except:
None
sns.barplot(x=params, y=scores).set_title('Beans Accuracy')
plt.xticks(rotation=90)
We can appreciate that all of our models had an amazing performance, but weâll have to stick to the âmultinomial newton-cgâ which is the one with the highest accuracy score.
Here we can appreciate a little bit better the results for our model:
model = logistic_model(1, 'newton-cg', 'multinomial')
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))
# Confusion Matrix Heatmap
cm = confusion_matrix(y_test, predictions, labels=model.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
disp.plot()
plt.show()
[[128 0 0 0 0 0 0]
[ 0 98 0 0 0 0 0]
[ 0 0 102 1 0 0 0]
[ 0 0 0 105 0 0 0]
[ 0 0 1 1 94 0 0]
[ 0 0 0 0 0 90 1]
[ 0 0 0 0 0 0 110]]
Prediction: 0.9945280437756497
Here are the results for our âmultinomial newton-cgâ. Curious how our model made only just 4 mistakes đŻ.
Conclusion:
If youâd like to explore this code a little bit more or take a look at other Logistic Regression projects, visit this repository. Iâd â¤ď¸ to see any kind of PR or feedback.
In this reading, you learned how to use Multi-class Logistic Regression, how to scale your data, and why to use random under-sampling.
To learn more about me and discover additional insights visit https://danhergir.com. You can also explore my Medium articles by visiting https://danhergir.medium.com for more in-depth content or connect with me on Twitter @ https://twitter.com/danhergir