1.1.0
Import Packages¶import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.metrics import explained_variance_score,r2_score
from sklearn import linear_model
2.1.0
Pre - Processing Raw Data¶SALE TYPE
,SOLD DATA
,FAVORITE
,INTERESTED
,STATE
,STATUS
,NEXT OPEN HOUSE START TIME
,NEXT OPEN HOUSE END TIME
,SOURCE
HOA/MONTH
- ignoring this feature as it has too many 60
missing data points to be reliable# raw data
raw_df = pd.read_csv(r'data/RAW_redfin_2020-02-01-16-52-40.csv')
raw_df.shape
# modified data
df = pd.read_csv(r'data/MOD_redfin_2020-02-01-16-52-40.csv')
df.shape
2.2.0
Measures of Central Tendencies¶# some state on the numerical features to get a general sense of the dataset
cols = ['PRICE','BEDS','BATHS','SQUARE FEET','$/SQUARE FEET','HOA/MONTH','DAYS ON MARKET']
mean = {}
std = {}
median = {}
dfprint = pd.DataFrame(index=cols)
for col in cols:
std[col] = df[col].std()
mean[col] = df[col].mean()
median[col] = df[col].median()
dfprint['MEAN'] = mean.values()
dfprint['MEDIAN'] = median.values()
dfprint['STD DEVIATION'] = std.values()
dfprint.head(len(cols))
2.3.0
Determining the viability of the features¶BEDS
BATHS
SQUARE FEET
LOT SIZE
Original Feature space | Suggested Feature space |
---|---|
27 | 4 |
A useful indicator might be the correlation matrix between the chosen features.
# correlation between the above features and price
df_temp = df[['PRICE','SQUARE FEET','LOT SIZE','ZIP OR POSTAL CODE']]
df = df[['BEDS','BATHS','SQUARE FEET','LOT SIZE','ZIP OR POSTAL CODE','PRICE']]
correlation = df.corr()
sns.heatmap(correlation,annot=True)
2.3.1
Explaining the correlation matrix above¶(BEDS,SQUARE FEET)
- very strong correlation(BATHS,SQUARE FEET)
- very strong correlationBEDS,BATHS,SQUARE FEET
but (BEDS,BATHS)
are not as strong indicators of each other as SQUARE FEET
(BEDS,BATHS)
- strong correlation not that significantBEDS
and BATHS
(SQUARE FEET,LOT SIZE)
the negative correlation makes sense that as SQUARE FEET
grows LOT SIZE
likely shrinks It is correlated but it's not very strong as in the previous cases.(LOT SIZE,SQUARE FEET)
- we can probably keep both these featuresZIP OR POSTAL CODE
atcually looks very promising looking at it's correlation matrix - this is a prime candidate for further exploration2.4.0
Dealing with missing values¶SQUARE FEET
- setting value to 0
- these are just empty land propertiesLOT SIZE
- setting value to 0
- these are zero-lot-line houses with structures almost at the end of the propertyZIP CODE
next and it doesn't have any missing data so it seems to be our lucky day herecols = ['SQUARE FEET','PRICE','LOT SIZE','BEDS','BATHS']
for col in cols:
df[col].fillna(0,inplace=True)
2.5.0
Exploring other features¶ZIP CODE
ZIP CODE
's, we can use some thing called one-hot encoding ZIP CODE
as a new feature and each row marks 1 IF IT BELONGS
0 IF IT DOES NOT BELONG
to the ZIP CODE
# one hot encoding the ZIP CODE feature
ohe_df = pd.get_dummies(df['ZIP OR POSTAL CODE'],prefix='ZIP',drop_first=True)
df = pd.concat([df,ohe_df],axis=1)
2.5.1
Determining the viability of the new features¶# covariance matrix of the ohe concat matrix
correlation = df[['PRICE','BEDS','BATHS','SQUARE FEET','ZIP_98012','ZIP_98021','ZIP_98028','ZIP_98036']].corr()
sns.heatmap(correlation,annot=True)
df = df[['PRICE','BEDS','BATHS','SQUARE FEET','ZIP_98012','ZIP_98021','ZIP_98028','ZIP_98036']]
2.5.2
Explaining the above correlation matrix¶2.6.0
Visualizing the data¶# The pairwise plot of each of the features - giving us a sense of the spread of the data
sns.pairplot(df_temp,hue='ZIP OR POSTAL CODE',diag_kind='hist')
plt.show()
3.1.0
Splitting data sets¶Original Feature space | Final Feature space |
---|---|
27 | 7 |
Training Data | Validation Data | Test Data | Total Data Points |
---|---|---|---|
60% | 20% | 20% | 120 |
# doing a randomized split of the dataset so as to reduce training bias
df_train = df.sample(n=72)
df_validation = df.sample(n=24)
df_test = df.sample(n=24)
# Data point counts
print('Training Data : ',df_train.shape)
print('Validation Data : ',df_validation.shape)
print('Test Data : ',df_test.shape)
4.1.0
Linear Regression - Using The Normal Equation¶# helper functions
def train(df,num_features,epsilon):
column = np.full(df.shape[0],1.0)
mat_xy = df.to_numpy()
X = mat_xy[:,1:num_features+1]
X = np.insert(X,0,column,axis=1) #accounting for w0
Y = mat_xy[:,0]
W = np.linalg.inv(X.T.dot(X) + epsilon*np.identity(X.shape[1])).dot(X.T).dot(Y)
return W
def predict(df,w):
column = np.full(df.shape[0],1.0)
mat_xy = df.to_numpy()
mat_xy = np.insert(mat_xy,1,column,axis=1) #accounting for w0
return mat_xy[:,1:(len(w)+1)].dot(w)
def plot(df,feature):
plt.plot(df[feature],df['PRICE'],'.',color='xkcd:azure',label='REAL PRICE')
plt.plot(df[feature],df['PREDICTED PRICE'],'+',color='xkcd:orange',label='PREDICTED PRICE')
plt.xlabel(feature)
plt.ylabel('PRICE')
plt.legend(loc='best')
plt.show()
4.2.0
Linear Regression - Model Training¶W = train(df_train,7,0.00000000000000000000000000000001)
4.3.0
Linear Regression - Model Prediction¶df_validation['PREDICTED PRICE'] = predict(df_validation,W)
df_validation.head()
4.4.0
Linear Regression - Model Validation against scikit
¶We've finished training our model and running it on validation data, but if we don't have any standard to compare it with then we basically don't know if the model is working as it should
Let's use scikit-learn
on the same data and check if got the correct values for the weights w
# cross checking my model against scikit-learn's Linear Regressor
regr = linear_model.LinearRegression()
regr.fit(df_train.iloc[:,1:].values,df_train.iloc[:,0].values)
# comparing feature weights
df_weights = pd.DataFrame(W[1:],columns=['MY WEIGHTS'])
df_weights['SCIKIT WEIGHTS'] = pd.DataFrame(regr.coef_,columns=['SCIKIT WEIGHTS'])
df_weights.head()
4.4.1
Linear Regression - Model Testing¶df_test['PREDICTED PRICE'] = predict(df_test,W)
df_test.head()
5.1.0
Model Results¶PRICE
and actually compare it with the PREDICTED PRICE
that was our predictionn-dimensional regression line
# plotting features vs price comparision graphs
plot(df_test,'SQUARE FEET')
plot(df_test,'BEDS')
5.2.0
Model Performance and comparision¶df_test.head()
# comparing r2 scores of my model vs scikit learn
my_r2 = r2_score(df_validation.iloc[:,0],regr.predict(df_validation.iloc[:,1:8]))
scikit_r2 = r2_score(df_validation.iloc[:,0],predict(df_validation,W))
df_r2 = pd.DataFrame([[my_r2,scikit_r2]],columns=['MY R2 SCORE','SCIKIT R2 SCORE'])
df_r2.head()
Data sourced from - Redfin - Homes in Bothell