College admit prediction - Logistic Regression¶

`1.1.0` Import packages¶

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import seaborn as sns

#import dataset
df = pd.read_csv("data/RAW_Admit.csv")
df.head()
ohe_df = pd.get_dummies(df['rank'],prefix='rank',drop_first=True)
df = pd.concat([df,ohe_df],axis=1)
df.head()

# covariance matrix of the ohe concat matrix 
correlation = df.drop(['admit','rank','gpa'],axis=1).corr()
sns.heatmap(correlation,annot=True)

<matplotlib.axes._subplots.AxesSubplot at 0x7f567e664d50>

`2.1.0` Pre-Processing data¶

There are no missing values in the dataset - so we got lucky here
The interesting thing about the features gre,gpa,rank is that they are of drastically different scales
This means we need to normalize the features
Ideally Regression problems don't require normalization
The caveat however is that for some value $x$, $e^x$ might become too small or too large to be represented if we don't normalize the features
This means that the sigmoid function would be non-computable at these above values of $x$
I've chosen to normalize each feature based on its min() and max()
I've chosen to normalize features just before starting an epoch of gradient descent - this handles cases where user forgot to provide normalized features to the gradient descent function

`3.1.0` Splitting data sets¶

Final Feature space

7
Training Data Validation Data Test Data Total Data Points

70% 20% 10% 400

# doing a randomized split of the dataset so as to reduce training bias

df_train = df.sample(n=int((70/100)*(400)))
df_validation = df.sample(n=int((20/100)*(400)))
df_test = df.sample(n=int((10/100)*(400)))
df_all = df.sample(df.shape[0])

# Data point counts

print('{0:15} : ({1},{2})'.format('Training Data',df_train.shape[0],df_train.shape[1]))
print('{0:15} : ({1},{2})'.format('Validation Data',df_validation.shape[0],df_validation.shape[1]))
print('{0:15} : ({1},{2})'.format('Test Data',df_test.shape[0],df_test.shape[1]))

Training Data   : (280,7)
Validation Data : (80,7)
Test Data       : (40,7)

`4.1.0` Logistic Regression - Gradient Ascent - Maximizing the log-likelihood¶

The key idea behind logistic regression is the sigmoid function
Our model is
- $h(\vec{x}) = \sigma(\vec{\text{w}}^T\vec{x})$
The key idea here is to figure out the weights $\vec{\text{w}}$
Our goal is to maximize the log-likelihood given by
- $l(\text{w}) = \displaystyle\sum_{k=1}^m \hat{y}_k \ln(\sigma(\text{w}^T\hat{x}_k)) + (1-\hat{y}_k)\ln(1 - \sigma(\text{w}^T\hat{x}_k))$ , where $\text{w}, \hat{y}_k, \hat{x}_k$ are vectors
- The perceptron learns by updating it's weights using learning rate $\alpha$ as follows
- $\text{w} = \text{w} + \alpha\displaystyle\sum_{k=1}^m(\hat{y}_k - \sigma(\text{w}^T\hat{x}_k))\hat{x}_k$

# helper functions
def sigmoid(_X):
    return 1/(1 + np.exp(-_X))

def gradient_descent(_X,_Y,_epochs,_alpha,_verbose):
    # normalizing features
    for col in range(_X.shape[1]):
        _X[:,col] = (_X[:,col] - _X[:,col].min())/(_X[:,col].max() - _X[:,col].min())
        
    # initializing weights to 1.0f and accounting for w0
    W = np.ones([_X.shape[1]+1,1])
    X = np.concatenate((np.ones([_X.shape[0],1]),_X),axis=1)
    
    # gradient descents
    for i in range(_epochs):
        h = sigmoid(np.dot(X,W))
        W = W + _alpha*X.T.dot((_Y - h))
        if _verbose and i%1000 == 0:
            cost = (-_Y*np.log(h) - (1-_Y)*np.log(1-h))
            print('epoch( {0:4d}/{1:} ) loss_avg : {2}'.format(i,_epochs,((cost - cost.min())/(cost.max() - cost.min())).mean()))
    return W

def train(_X,_Y,_epochs,_alpha,_verbose=True):
    return gradient_descent(_X,_Y,_epochs,_alpha,_verbose)

def predict(_X,_W,show_prob=False):
    # normalizing features
    for col in range(_X.shape[1]):
        _X[:,col] = (_X[:,col] - _X[:,col].min())/(_X[:,col].max() - _X[:,col].min())
    X = np.insert(_X,0,np.ones(_X.shape[0]),axis=1)
    
    if show_prob:
        return sigmoid(np.dot(X,_W))
    
    return sigmoid(np.dot(X,_W)).round().astype(int)

def plot(df,feature):
    plt.plot(df[feature],df['admit'],'.',color='xkcd:azure',label='real admit')
    plt.plot(df[feature],df['predicted admit'],'+',color='xkcd:orange',label='predicted admit')
    plt.xlabel(feature)
    plt.ylabel('admit')
    plt.legend(loc='best')
    plt.show()
    
def pair_plot(df,f1,f2):
    # plots
    plt.scatter(df[df['admit']==1][f1], df[df['admit']==1][f2], s=10, label='admit')
    plt.scatter(df[df['admit']==0][f1], df[df['admit']==0][f2], s=10, label='not admit')
    plt.legend()
    plt.show()

`4.2.0` Logistic Regression - Model Training¶

x_cols = ['gre','rank_2','rank_3','rank_4']
y_col = ['admit']
W = train(df_train[x_cols].to_numpy(),df_train[y_col].to_numpy(),20000,0.001,True)
df_train['predicted admit'] = predict(df_train[x_cols].to_numpy(),W)

epoch(    0/20000 ) loss_avg : 0.46142299532601927
epoch( 1000/20000 ) loss_avg : 0.2724538350706027
epoch( 2000/20000 ) loss_avg : 0.2689768182124872
epoch( 3000/20000 ) loss_avg : 0.26855544265727493
epoch( 4000/20000 ) loss_avg : 0.2685016694646699
epoch( 5000/20000 ) loss_avg : 0.26849476750003903
epoch( 6000/20000 ) loss_avg : 0.2684938810321483
epoch( 7000/20000 ) loss_avg : 0.26849376716845014
epoch( 8000/20000 ) loss_avg : 0.268493752542935
epoch( 9000/20000 ) loss_avg : 0.2684937506643213
epoch( 10000/20000 ) loss_avg : 0.26849375042301776
epoch( 11000/20000 ) loss_avg : 0.26849375039202283
epoch( 12000/20000 ) loss_avg : 0.26849375038804163
epoch( 13000/20000 ) loss_avg : 0.2684937503875302
epoch( 14000/20000 ) loss_avg : 0.2684937503874647
epoch( 15000/20000 ) loss_avg : 0.2684937503874562
epoch( 16000/20000 ) loss_avg : 0.2684937503874559
epoch( 17000/20000 ) loss_avg : 0.2684937503874559
epoch( 18000/20000 ) loss_avg : 0.2684937503874559
epoch( 19000/20000 ) loss_avg : 0.2684937503874559

`4.3.0` Logistic Regression - Model Prediction¶

df_validation['predicted admit'] = predict(df_validation[x_cols].to_numpy(),W)
df_validation.head()

`4.4.0` Logistic Regression - Model Validation against `scikit`¶

We've finished training our model and running it on validation data, but if we don't have any standard to compare it with then we basically don't know if the model is working as it should
Let's use scikit-learn on the same data and check predictions

# setting up scikit learns logistic regression

clf = LogisticRegression(random_state=0).fit(df_train[x_cols].to_numpy(),df_train[y_col].to_numpy().ravel())

# generating scikit learn predictions - validation data

df_validation['scikit predicted admit'] = clf.predict(df_validation[x_cols].to_numpy())
df_validation.head()

`5.1.0` Model Results¶

We've finished training, validating and testing out model - let's now actually graphically see our results

plot(df_validation,'gre')

plot(df_validation,'gpa')

plot(df_validation,'rank')

pair_plot(df_validation,'gre','gpa')

plot(df_validation,'rank')

df_validation.dtypes

admit                       int64
gre                         int64
gpa                       float64
rank                        int64
rank_2                      uint8
rank_3                      uint8
rank_4                      uint8
predicted admit             int64
scikit predicted admit      int64
dtype: object

`5.2.0` Model Performance and comparision¶

#test data predictions
df_test['predicted admit'] = predict(df_test[x_cols].to_numpy(),W)
df_test['scikit predicted admit'] = clf.predict(df_test[x_cols].to_numpy())

# accuracy scores
my_accuracy = accuracy_score(df_test[['predicted admit']].to_numpy(),df_test[y_col].to_numpy())
scikit_accuracy = accuracy_score(df_test[['scikit predicted admit']].to_numpy(),df_test[y_col].to_numpy())
df_accuracy = pd.DataFrame([[my_accuracy,scikit_accuracy]],columns=['MY ACCURACY','SCIKIT ACCURACY'])
df_accuracy.head()

	admit	gre	gpa	rank	rank_3	rank_4
0	0	380	3.61	3	1	0
1	1	660	3.67	3	1	0
2	1	800	4.00	1	0	0
3	1	640	3.19	4	0	1
4	0	520	2.93	4	0	1

	admit	gre	gpa	rank	rank_2	rank_3
186	0	560	3.61	3	0	1
225	1	720	3.50	3	0	1
302	1	400	3.15	2	1	0
54	0	660	3.34	3	0	1
309	0	440	2.98	3	0	1

College admit prediction - Logistic Regression¶

1.1.0 Import packages¶

2.1.0 Pre-Processing data¶

3.1.0 Splitting data sets¶

4.1.0 Logistic Regression - Gradient Ascent - Maximizing the log-likelihood¶

4.2.0 Logistic Regression - Model Training¶

4.3.0 Logistic Regression - Model Prediction¶

4.4.0 Logistic Regression - Model Validation against scikit¶

5.1.0 Model Results¶

5.2.0 Model Performance and comparision¶

`1.1.0` Import packages¶

`2.1.0` Pre-Processing data¶

`3.1.0` Splitting data sets¶

`4.1.0` Logistic Regression - Gradient Ascent - Maximizing the log-likelihood¶

`4.2.0` Logistic Regression - Model Training¶

`4.3.0` Logistic Regression - Model Prediction¶

`4.4.0` Logistic Regression - Model Validation against `scikit`¶

`5.1.0` Model Results¶

`5.2.0` Model Performance and comparision¶