College admit prediction - Logistic Regression¶

1.1.0 Import packages¶

In [19]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import seaborn as sns

In [2]:
#import dataset
ohe_df = pd.get_dummies(df['rank'],prefix='rank',drop_first=True)
df = pd.concat([df,ohe_df],axis=1)

Out[2]:
admit gre gpa rank rank_2 rank_3 rank_4
0 0 380 3.61 3 0 1 0
1 1 660 3.67 3 0 1 0
2 1 800 4.00 1 0 0 0
3 1 640 3.19 4 0 0 1
4 0 520 2.93 4 0 0 1
In [22]:
# covariance matrix of the ohe concat matrix
sns.heatmap(correlation,annot=True)

Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f567e664d50>

2.1.0 Pre-Processing data¶

• There are no missing values in the dataset - so we got lucky here
• The interesting thing about the features gre,gpa,rank is that they are of drastically different scales
• This means we need to normalize the features
• Ideally Regression problems don't require normalization
• The caveat however is that for some value $x$, $e^x$ might become too small or too large to be represented if we don't normalize the features
• This means that the sigmoid function would be non-computable at these above values of $x$
• I've chosen to normalize each feature based on its min() and max()
• I've chosen to normalize features just before starting an epoch of gradient descent - this handles cases where user forgot to provide normalized features to the gradient descent function

3.1.0 Splitting data sets¶

• Final Feature space
7
• Training Data Validation Data Test Data Total Data Points
70% 20% 10% 400
In [23]:
# doing a randomized split of the dataset so as to reduce training bias

df_train = df.sample(n=int((70/100)*(400)))
df_validation = df.sample(n=int((20/100)*(400)))
df_test = df.sample(n=int((10/100)*(400)))
df_all = df.sample(df.shape[0])

In [24]:
# Data point counts

print('{0:15} : ({1},{2})'.format('Training Data',df_train.shape[0],df_train.shape[1]))
print('{0:15} : ({1},{2})'.format('Validation Data',df_validation.shape[0],df_validation.shape[1]))
print('{0:15} : ({1},{2})'.format('Test Data',df_test.shape[0],df_test.shape[1]))

Training Data   : (280,7)
Validation Data : (80,7)
Test Data       : (40,7)


4.1.0 Logistic Regression - Gradient Ascent - Maximizing the log-likelihood¶

• The key idea behind logistic regression is the sigmoid function
• Our model is
• $h(\vec{x}) = \sigma(\vec{\text{w}}^T\vec{x})$
• The key idea here is to figure out the weights $\vec{\text{w}}$
• Our goal is to maximize the log-likelihood given by
• $l(\text{w}) = \displaystyle\sum_{k=1}^m \hat{y}_k \ln(\sigma(\text{w}^T\hat{x}_k)) + (1-\hat{y}_k)\ln(1 - \sigma(\text{w}^T\hat{x}_k))$ , where $\text{w}, \hat{y}_k, \hat{x}_k$ are vectors
• The perceptron learns by updating it's weights using learning rate $\alpha$ as follows
• $\text{w} = \text{w} + \alpha\displaystyle\sum_{k=1}^m(\hat{y}_k - \sigma(\text{w}^T\hat{x}_k))\hat{x}_k$
In [25]:
# helper functions
def sigmoid(_X):
return 1/(1 + np.exp(-_X))

# normalizing features
for col in range(_X.shape[1]):
_X[:,col] = (_X[:,col] - _X[:,col].min())/(_X[:,col].max() - _X[:,col].min())

# initializing weights to 1.0f and accounting for w0
W = np.ones([_X.shape[1]+1,1])
X = np.concatenate((np.ones([_X.shape[0],1]),_X),axis=1)

for i in range(_epochs):
h = sigmoid(np.dot(X,W))
W = W + _alpha*X.T.dot((_Y - h))
if _verbose and i%1000 == 0:
cost = (-_Y*np.log(h) - (1-_Y)*np.log(1-h))
print('epoch( {0:4d}/{1:} ) loss_avg : {2}'.format(i,_epochs,((cost - cost.min())/(cost.max() - cost.min())).mean()))
return W

def train(_X,_Y,_epochs,_alpha,_verbose=True):

def predict(_X,_W,show_prob=False):
# normalizing features
for col in range(_X.shape[1]):
_X[:,col] = (_X[:,col] - _X[:,col].min())/(_X[:,col].max() - _X[:,col].min())
X = np.insert(_X,0,np.ones(_X.shape[0]),axis=1)

if show_prob:
return sigmoid(np.dot(X,_W))

return sigmoid(np.dot(X,_W)).round().astype(int)

def plot(df,feature):
plt.xlabel(feature)
plt.legend(loc='best')
plt.show()

def pair_plot(df,f1,f2):
# plots
plt.legend()
plt.show()


4.2.0 Logistic Regression - Model Training¶

In [26]:
x_cols = ['gre','rank_2','rank_3','rank_4']
W = train(df_train[x_cols].to_numpy(),df_train[y_col].to_numpy(),20000,0.001,True)

epoch(    0/20000 ) loss_avg : 0.46142299532601927
epoch( 1000/20000 ) loss_avg : 0.2724538350706027
epoch( 2000/20000 ) loss_avg : 0.2689768182124872
epoch( 3000/20000 ) loss_avg : 0.26855544265727493
epoch( 4000/20000 ) loss_avg : 0.2685016694646699
epoch( 5000/20000 ) loss_avg : 0.26849476750003903
epoch( 6000/20000 ) loss_avg : 0.2684938810321483
epoch( 7000/20000 ) loss_avg : 0.26849376716845014
epoch( 8000/20000 ) loss_avg : 0.268493752542935
epoch( 9000/20000 ) loss_avg : 0.2684937506643213
epoch( 10000/20000 ) loss_avg : 0.26849375042301776
epoch( 11000/20000 ) loss_avg : 0.26849375039202283
epoch( 12000/20000 ) loss_avg : 0.26849375038804163
epoch( 13000/20000 ) loss_avg : 0.2684937503875302
epoch( 14000/20000 ) loss_avg : 0.2684937503874647
epoch( 15000/20000 ) loss_avg : 0.2684937503874562
epoch( 16000/20000 ) loss_avg : 0.2684937503874559
epoch( 17000/20000 ) loss_avg : 0.2684937503874559
epoch( 18000/20000 ) loss_avg : 0.2684937503874559
epoch( 19000/20000 ) loss_avg : 0.2684937503874559


4.3.0 Logistic Regression - Model Prediction¶

In [27]:
df_validation['predicted admit'] = predict(df_validation[x_cols].to_numpy(),W)

Out[27]:
186 0 560 3.61 3 0 1 0 0
225 1 720 3.50 3 0 1 0 0
302 1 400 3.15 2 1 0 0 0
54 0 660 3.34 3 0 1 0 0
309 0 440 2.98 3 0 1 0 0

4.4.0 Logistic Regression - Model Validation against scikit¶

• We've finished training our model and running it on validation data, but if we don't have any standard to compare it with then we basically don't know if the model is working as it should

• Let's use scikit-learn on the same data and check predictions

In [28]:
# setting up scikit learns logistic regression

clf = LogisticRegression(random_state=0).fit(df_train[x_cols].to_numpy(),df_train[y_col].to_numpy().ravel())

In [29]:
# generating scikit learn predictions - validation data


Out[29]:
186 0 560 3.61 3 0 1 0 0 0
225 1 720 3.50 3 0 1 0 0 0
302 1 400 3.15 2 1 0 0 0 0
54 0 660 3.34 3 0 1 0 0 0
309 0 440 2.98 3 0 1 0 0 0

5.1.0 Model Results¶

• We've finished training, validating and testing out model - let's now actually graphically see our results
In [30]:
plot(df_validation,'gre')

In [31]:
plot(df_validation,'gpa')

In [32]:
plot(df_validation,'rank')

In [33]:
pair_plot(df_validation,'gre','gpa')

In [34]:
plot(df_validation,'rank')

In [35]:
df_validation.dtypes

Out[35]:
admit                       int64
gre                         int64
gpa                       float64
rank                        int64
rank_2                      uint8
rank_3                      uint8
rank_4                      uint8
dtype: object

5.2.0 Model Performance and comparision¶

In [41]:
#test data predictions