College admit prediction - Logistic Regression


1.1.0 Import packages

In [19]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import seaborn as sns
In [2]:
#import dataset
df = pd.read_csv("data/RAW_Admit.csv")
df.head()
ohe_df = pd.get_dummies(df['rank'],prefix='rank',drop_first=True)
df = pd.concat([df,ohe_df],axis=1)
df.head()
Out[2]:
admit gre gpa rank rank_2 rank_3 rank_4
0 0 380 3.61 3 0 1 0
1 1 660 3.67 3 0 1 0
2 1 800 4.00 1 0 0 0
3 1 640 3.19 4 0 0 1
4 0 520 2.93 4 0 0 1
In [22]:
# covariance matrix of the ohe concat matrix 
correlation = df.drop(['admit','rank','gpa'],axis=1).corr()
sns.heatmap(correlation,annot=True)
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f567e664d50>

2.1.0 Pre-Processing data

  • There are no missing values in the dataset - so we got lucky here
  • The interesting thing about the features gre,gpa,rank is that they are of drastically different scales
  • This means we need to normalize the features
  • Ideally Regression problems don't require normalization
  • The caveat however is that for some value $x$, $e^x$ might become too small or too large to be represented if we don't normalize the features
  • This means that the sigmoid function would be non-computable at these above values of $x$
  • I've chosen to normalize each feature based on its min() and max()
  • I've chosen to normalize features just before starting an epoch of gradient descent - this handles cases where user forgot to provide normalized features to the gradient descent function

3.1.0 Splitting data sets

  • Final Feature space
    7
  • Training Data Validation Data Test Data Total Data Points
    70% 20% 10% 400
In [23]:
# doing a randomized split of the dataset so as to reduce training bias

df_train = df.sample(n=int((70/100)*(400)))
df_validation = df.sample(n=int((20/100)*(400)))
df_test = df.sample(n=int((10/100)*(400)))
df_all = df.sample(df.shape[0])
In [24]:
# Data point counts

print('{0:15} : ({1},{2})'.format('Training Data',df_train.shape[0],df_train.shape[1]))
print('{0:15} : ({1},{2})'.format('Validation Data',df_validation.shape[0],df_validation.shape[1]))
print('{0:15} : ({1},{2})'.format('Test Data',df_test.shape[0],df_test.shape[1]))
Training Data   : (280,7)
Validation Data : (80,7)
Test Data       : (40,7)

4.1.0 Logistic Regression - Gradient Ascent - Maximizing the log-likelihood

  • The key idea behind logistic regression is the sigmoid function
  • Our model is
    • $h(\vec{x}) = \sigma(\vec{\text{w}}^T\vec{x})$
  • The key idea here is to figure out the weights $\vec{\text{w}}$
  • Our goal is to maximize the log-likelihood given by
    • $l(\text{w}) = \displaystyle\sum_{k=1}^m \hat{y}_k \ln(\sigma(\text{w}^T\hat{x}_k)) + (1-\hat{y}_k)\ln(1 - \sigma(\text{w}^T\hat{x}_k))$ , where $\text{w}, \hat{y}_k, \hat{x}_k$ are vectors
    • The perceptron learns by updating it's weights using learning rate $\alpha$ as follows
    • $\text{w} = \text{w} + \alpha\displaystyle\sum_{k=1}^m(\hat{y}_k - \sigma(\text{w}^T\hat{x}_k))\hat{x}_k$
In [25]:
# helper functions
def sigmoid(_X):
    return 1/(1 + np.exp(-_X))

def gradient_descent(_X,_Y,_epochs,_alpha,_verbose):
    # normalizing features
    for col in range(_X.shape[1]):
        _X[:,col] = (_X[:,col] - _X[:,col].min())/(_X[:,col].max() - _X[:,col].min())
        
    # initializing weights to 1.0f and accounting for w0
    W = np.ones([_X.shape[1]+1,1])
    X = np.concatenate((np.ones([_X.shape[0],1]),_X),axis=1)
    
    # gradient descents
    for i in range(_epochs):
        h = sigmoid(np.dot(X,W))
        W = W + _alpha*X.T.dot((_Y - h))
        if _verbose and i%1000 == 0:
            cost = (-_Y*np.log(h) - (1-_Y)*np.log(1-h))
            print('epoch( {0:4d}/{1:} ) loss_avg : {2}'.format(i,_epochs,((cost - cost.min())/(cost.max() - cost.min())).mean()))
    return W

def train(_X,_Y,_epochs,_alpha,_verbose=True):
    return gradient_descent(_X,_Y,_epochs,_alpha,_verbose)

def predict(_X,_W,show_prob=False):
    # normalizing features
    for col in range(_X.shape[1]):
        _X[:,col] = (_X[:,col] - _X[:,col].min())/(_X[:,col].max() - _X[:,col].min())
    X = np.insert(_X,0,np.ones(_X.shape[0]),axis=1)
    
    if show_prob:
        return sigmoid(np.dot(X,_W))
    
    return sigmoid(np.dot(X,_W)).round().astype(int)

def plot(df,feature):
    plt.plot(df[feature],df['admit'],'.',color='xkcd:azure',label='real admit')
    plt.plot(df[feature],df['predicted admit'],'+',color='xkcd:orange',label='predicted admit')
    plt.xlabel(feature)
    plt.ylabel('admit')
    plt.legend(loc='best')
    plt.show()
    
def pair_plot(df,f1,f2):
    # plots
    plt.scatter(df[df['admit']==1][f1], df[df['admit']==1][f2], s=10, label='admit')
    plt.scatter(df[df['admit']==0][f1], df[df['admit']==0][f2], s=10, label='not admit')
    plt.legend()
    plt.show()

4.2.0 Logistic Regression - Model Training

In [26]:
x_cols = ['gre','rank_2','rank_3','rank_4']
y_col = ['admit']
W = train(df_train[x_cols].to_numpy(),df_train[y_col].to_numpy(),20000,0.001,True)
df_train['predicted admit'] = predict(df_train[x_cols].to_numpy(),W)
epoch(    0/20000 ) loss_avg : 0.46142299532601927
epoch( 1000/20000 ) loss_avg : 0.2724538350706027
epoch( 2000/20000 ) loss_avg : 0.2689768182124872
epoch( 3000/20000 ) loss_avg : 0.26855544265727493
epoch( 4000/20000 ) loss_avg : 0.2685016694646699
epoch( 5000/20000 ) loss_avg : 0.26849476750003903
epoch( 6000/20000 ) loss_avg : 0.2684938810321483
epoch( 7000/20000 ) loss_avg : 0.26849376716845014
epoch( 8000/20000 ) loss_avg : 0.268493752542935
epoch( 9000/20000 ) loss_avg : 0.2684937506643213
epoch( 10000/20000 ) loss_avg : 0.26849375042301776
epoch( 11000/20000 ) loss_avg : 0.26849375039202283
epoch( 12000/20000 ) loss_avg : 0.26849375038804163
epoch( 13000/20000 ) loss_avg : 0.2684937503875302
epoch( 14000/20000 ) loss_avg : 0.2684937503874647
epoch( 15000/20000 ) loss_avg : 0.2684937503874562
epoch( 16000/20000 ) loss_avg : 0.2684937503874559
epoch( 17000/20000 ) loss_avg : 0.2684937503874559
epoch( 18000/20000 ) loss_avg : 0.2684937503874559
epoch( 19000/20000 ) loss_avg : 0.2684937503874559

4.3.0 Logistic Regression - Model Prediction

In [27]:
df_validation['predicted admit'] = predict(df_validation[x_cols].to_numpy(),W)
df_validation.head()
Out[27]:
admit gre gpa rank rank_2 rank_3 rank_4 predicted admit
186 0 560 3.61 3 0 1 0 0
225 1 720 3.50 3 0 1 0 0
302 1 400 3.15 2 1 0 0 0
54 0 660 3.34 3 0 1 0 0
309 0 440 2.98 3 0 1 0 0

4.4.0 Logistic Regression - Model Validation against scikit

  • We've finished training our model and running it on validation data, but if we don't have any standard to compare it with then we basically don't know if the model is working as it should

  • Let's use scikit-learn on the same data and check predictions

In [28]:
# setting up scikit learns logistic regression

clf = LogisticRegression(random_state=0).fit(df_train[x_cols].to_numpy(),df_train[y_col].to_numpy().ravel())
In [29]:
# generating scikit learn predictions - validation data

df_validation['scikit predicted admit'] = clf.predict(df_validation[x_cols].to_numpy())
df_validation.head()
Out[29]:
admit gre gpa rank rank_2 rank_3 rank_4 predicted admit scikit predicted admit
186 0 560 3.61 3 0 1 0 0 0
225 1 720 3.50 3 0 1 0 0 0
302 1 400 3.15 2 1 0 0 0 0
54 0 660 3.34 3 0 1 0 0 0
309 0 440 2.98 3 0 1 0 0 0

5.1.0 Model Results

  • We've finished training, validating and testing out model - let's now actually graphically see our results
In [30]:
plot(df_validation,'gre')
In [31]:
plot(df_validation,'gpa')
In [32]:
plot(df_validation,'rank')
In [33]:
pair_plot(df_validation,'gre','gpa')
In [34]:
plot(df_validation,'rank')
In [35]:
df_validation.dtypes
Out[35]:
admit                       int64
gre                         int64
gpa                       float64
rank                        int64
rank_2                      uint8
rank_3                      uint8
rank_4                      uint8
predicted admit             int64
scikit predicted admit      int64
dtype: object

5.2.0 Model Performance and comparision

In [41]:
#test data predictions
df_test['predicted admit'] = predict(df_test[x_cols].to_numpy(),W)
df_test['scikit predicted admit'] = clf.predict(df_test[x_cols].to_numpy())

# accuracy scores
my_accuracy = accuracy_score(df_test[['predicted admit']].to_numpy(),df_test[y_col].to_numpy())
scikit_accuracy = accuracy_score(df_test[['scikit predicted admit']].to_numpy(),df_test[y_col].to_numpy())
df_accuracy = pd.DataFrame([[my_accuracy,scikit_accuracy]],columns=['MY ACCURACY','SCIKIT ACCURACY'])
df_accuracy.head()
Out[41]:
MY ACCURACY SCIKIT ACCURACY
0 0.7 0.725