1.1.0 Import Packages¶import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.metrics import explained_variance_score,r2_score
from sklearn import linear_model
2.1.0 Pre - Processing Raw Data¶SALE TYPE,SOLD DATA,FAVORITE,INTERESTED,STATE,STATUS,NEXT OPEN HOUSE START TIME,NEXT OPEN HOUSE END TIME,SOURCEHOA/MONTH - ignoring this feature as it has too many 60 missing data points to be reliable# raw data
raw_df = pd.read_csv(r'data/RAW_redfin_2020-02-01-16-52-40.csv')
raw_df.shape
# modified data
df = pd.read_csv(r'data/MOD_redfin_2020-02-01-16-52-40.csv')
df.shape
2.2.0 Measures of Central Tendencies¶# some state on the numerical features to get a general sense of the dataset
cols = ['PRICE','BEDS','BATHS','SQUARE FEET','$/SQUARE FEET','HOA/MONTH','DAYS ON MARKET']
mean = {}
std = {}
median = {}
dfprint = pd.DataFrame(index=cols)
for col in cols:
std[col] = df[col].std()
mean[col] = df[col].mean()
median[col] = df[col].median()
dfprint['MEAN'] = mean.values()
dfprint['MEDIAN'] = median.values()
dfprint['STD DEVIATION'] = std.values()
dfprint.head(len(cols))
2.3.0 Determining the viability of the features¶BEDSBATHSSQUARE FEETLOT SIZE| Original Feature space | Suggested Feature space |
|---|---|
| 27 | 4 |
A useful indicator might be the correlation matrix between the chosen features.
# correlation between the above features and price
df_temp = df[['PRICE','SQUARE FEET','LOT SIZE','ZIP OR POSTAL CODE']]
df = df[['BEDS','BATHS','SQUARE FEET','LOT SIZE','ZIP OR POSTAL CODE','PRICE']]
correlation = df.corr()
sns.heatmap(correlation,annot=True)
2.3.1 Explaining the correlation matrix above¶(BEDS,SQUARE FEET) - very strong correlation(BATHS,SQUARE FEET) - very strong correlationBEDS,BATHS,SQUARE FEET but (BEDS,BATHS) are not as strong indicators of each other as SQUARE FEET(BEDS,BATHS) - strong correlation not that significantBEDS and BATHS(SQUARE FEET,LOT SIZE) the negative correlation makes sense that as SQUARE FEET grows LOT SIZE likely shrinks It is correlated but it's not very strong as in the previous cases.(LOT SIZE,SQUARE FEET) - we can probably keep both these featuresZIP OR POSTAL CODE atcually looks very promising looking at it's correlation matrix - this is a prime candidate for further exploration2.4.0 Dealing with missing values¶SQUARE FEET - setting value to 0 - these are just empty land propertiesLOT SIZE - setting value to 0 - these are zero-lot-line houses with structures almost at the end of the propertyZIP CODE next and it doesn't have any missing data so it seems to be our lucky day herecols = ['SQUARE FEET','PRICE','LOT SIZE','BEDS','BATHS']
for col in cols:
df[col].fillna(0,inplace=True)
2.5.0 Exploring other features¶ZIP CODEZIP CODE's, we can use some thing called one-hot encoding ZIP CODE as a new feature and each row marks 1 IF IT BELONGS 0 IF IT DOES NOT BELONG to the ZIP CODE# one hot encoding the ZIP CODE feature
ohe_df = pd.get_dummies(df['ZIP OR POSTAL CODE'],prefix='ZIP',drop_first=True)
df = pd.concat([df,ohe_df],axis=1)
2.5.1 Determining the viability of the new features¶# covariance matrix of the ohe concat matrix
correlation = df[['PRICE','BEDS','BATHS','SQUARE FEET','ZIP_98012','ZIP_98021','ZIP_98028','ZIP_98036']].corr()
sns.heatmap(correlation,annot=True)
df = df[['PRICE','BEDS','BATHS','SQUARE FEET','ZIP_98012','ZIP_98021','ZIP_98028','ZIP_98036']]
2.5.2 Explaining the above correlation matrix¶2.6.0 Visualizing the data¶# The pairwise plot of each of the features - giving us a sense of the spread of the data
sns.pairplot(df_temp,hue='ZIP OR POSTAL CODE',diag_kind='hist')
plt.show()
3.1.0 Splitting data sets¶| Original Feature space | Final Feature space |
|---|---|
| 27 | 7 |
| Training Data | Validation Data | Test Data | Total Data Points |
|---|---|---|---|
| 60% | 20% | 20% | 120 |
# doing a randomized split of the dataset so as to reduce training bias
df_train = df.sample(n=72)
df_validation = df.sample(n=24)
df_test = df.sample(n=24)
# Data point counts
print('Training Data : ',df_train.shape)
print('Validation Data : ',df_validation.shape)
print('Test Data : ',df_test.shape)
4.1.0 Linear Regression - Using The Normal Equation¶# helper functions
def train(df,num_features,epsilon):
column = np.full(df.shape[0],1.0)
mat_xy = df.to_numpy()
X = mat_xy[:,1:num_features+1]
X = np.insert(X,0,column,axis=1) #accounting for w0
Y = mat_xy[:,0]
W = np.linalg.inv(X.T.dot(X) + epsilon*np.identity(X.shape[1])).dot(X.T).dot(Y)
return W
def predict(df,w):
column = np.full(df.shape[0],1.0)
mat_xy = df.to_numpy()
mat_xy = np.insert(mat_xy,1,column,axis=1) #accounting for w0
return mat_xy[:,1:(len(w)+1)].dot(w)
def plot(df,feature):
plt.plot(df[feature],df['PRICE'],'.',color='xkcd:azure',label='REAL PRICE')
plt.plot(df[feature],df['PREDICTED PRICE'],'+',color='xkcd:orange',label='PREDICTED PRICE')
plt.xlabel(feature)
plt.ylabel('PRICE')
plt.legend(loc='best')
plt.show()
4.2.0 Linear Regression - Model Training¶W = train(df_train,7,0.00000000000000000000000000000001)
4.3.0 Linear Regression - Model Prediction¶df_validation['PREDICTED PRICE'] = predict(df_validation,W)
df_validation.head()
4.4.0 Linear Regression - Model Validation against scikit¶We've finished training our model and running it on validation data, but if we don't have any standard to compare it with then we basically don't know if the model is working as it should
Let's use scikit-learn on the same data and check if got the correct values for the weights w
# cross checking my model against scikit-learn's Linear Regressor
regr = linear_model.LinearRegression()
regr.fit(df_train.iloc[:,1:].values,df_train.iloc[:,0].values)
# comparing feature weights
df_weights = pd.DataFrame(W[1:],columns=['MY WEIGHTS'])
df_weights['SCIKIT WEIGHTS'] = pd.DataFrame(regr.coef_,columns=['SCIKIT WEIGHTS'])
df_weights.head()
4.4.1 Linear Regression - Model Testing¶df_test['PREDICTED PRICE'] = predict(df_test,W)
df_test.head()
5.1.0 Model Results¶PRICE and actually compare it with the PREDICTED PRICE that was our predictionn-dimensional regression line# plotting features vs price comparision graphs
plot(df_test,'SQUARE FEET')
plot(df_test,'BEDS')
5.2.0 Model Performance and comparision¶df_test.head()
# comparing r2 scores of my model vs scikit learn
my_r2 = r2_score(df_validation.iloc[:,0],regr.predict(df_validation.iloc[:,1:8]))
scikit_r2 = r2_score(df_validation.iloc[:,0],predict(df_validation,W))
df_r2 = pd.DataFrame([[my_r2,scikit_r2]],columns=['MY R2 SCORE','SCIKIT R2 SCORE'])
df_r2.head()
Data sourced from - Redfin - Homes in Bothell