House price prediction-Multivariate regression Flashcards

Question 1

Q

what is the difference between bunch and dictionary?

Answer

A

Bunch is a subclass of the Dict class and supports all the methods as dict does. In addition, it allows you to use the keys as attributes.

Question 2

Q

what are the things to do before dealing with a dataset?

Answer

A

1-where does the data come from(source)
2-find a brief description of the data
3-how big is the data
4-how many features are there for the data
5-name and description of what that features does
6-unit of the data

Question 3

Q

what is the function used to get information about all methods and attributes about an object?

Question 4

Q

How to create a Dataframe? how will you add an additional column to that dataframe?

Answer

A

TODO create pandas datframe

data=pd.DataFrame(data=boston_data.data,columns=boston_data.feature_names)

data[“PRICE”]=boston_data.target

Question 5

Q

how to access first and last couple of data in a dataframe?

Answer

A

use head() and tail() functions

Question 6

Q

what does count method do for a dataframe?

Answer

A

count method returns the count of all rows in every column

Question 7

Q

How do you check for missing values in a dataframe?

Answer

A

pd.isnull(data).any()-returns False if no data is missing in a column ,True if otherwise

CRIM False
ZN False
INDUS False
CHAS False
NOX False
RM False
AGE False
DIS False
RAD False
TAX False
PTRATIO False
B False
LSTAT False
PRICE False

data.info()

0 CRIM 506 non-null float64
1 ZN 506 non-null float64
2 INDUS 506 non-null float64
3 CHAS 506 non-null float64
4 NOX 506 non-null float64
5 RM 506 non-null float64
6 AGE 506 non-null float64
7 DIS 506 non-null float64
8 RAD 506 non-null float64
9 TAX 506 non-null float64
10 PTRATIO 506 non-null float64
11 B 506 non-null float64
12 LSTAT 506 non-null float64
13 PRICE 506 non-null float64

Question 8

Q

what is the difference between float32 and float64?

Answer

A

the floating number takes up 32 bit and 64 bit space
the number of digit is double for 64 bit number

Question 9

Q

how to plot a histogram using matplotlib?

Answer

A

plt.figure(figsize=(10,6))
plt.hist(data[‘PRICE’],ec=’black’,bins=50,color=’#2196f3’)
plt.grid(color=’black’,alpha=0.4)
plt.xlabel(‘Price in $1000’)
plt.ylabel(‘no:of houses’)
plt.show()

Question 10

Q

what is seaborn module?

Answer

A

it is data visualization library based on matplotlib.it provides different types of graphs and plots

Question 11

Q

what is distplot ? what is kde and pdf?

Answer

A

distplot gives us a combination of both histogram and pdf

Kernel density estimation or KDE way to estimate the probability density function of a random variable.

we can use different types of kernel gaussian kernel ,triangular kernel, cosine kernel etc..

Question 12

Q

what does value_counts() do?

Answer

A

returns the no of unique datapoints in a column

data[‘RAD’].value_counts()- all unique elements and their count
data[‘RAD’].value_counts().index-only unique elements
data[‘RAD’].value_counts().axes[0]-same fn as above

Question 13

Q

what is the difference between a barplot and hisplot?

Answer

A

both are similar but,barplot does not require bins so it shows a much better data representation

plt.bar(freq.axes[0],height=freq)

Question 14

Q

Interpret mean and median for normal and other distribution?

Answer

A

mean and median could be different ,say if the number of rich people is greater than the poorer

Question 15

Q

implement descriptive statistics in python?

Answer

A

data[‘PRICE’].min()-for single column
data[‘PRICE’].max()
data.max()-return minimum of every column for the dataframe
data.describe()-50% is the median

Question 16

Q

what is correlation? what is it’s importance?

Answer

A

it is the mutual relationship or connection between two things.
how two features move together
-positive and negative correlation
temperature vs ice cream eaten-positive correlation

ρ(X,Y) = corr(X,Y)
it is a number between +1 and -1 ,+1-perfect positive correlation
-1-perfect negative correlation,0-no correlation

it is important to add features that are correlated to the target variable inorder to create a better machine learning model

Question 17

Q

how do you find the correlation in python?

Answer

A

data[‘PRICE’].corr(data[‘RM’])

data.corr()-gives the correlation between all the features

we look for strength and direction to find the best correlated features

Question 18

Q

what is multi collinearity in machine learning?

Answer

A

-it is when we use two features that are highly correlated to create a machine learning model then our finding become unreliable and unsensible because we use redundant information i.e both gives us the same information(not unique)
eg: using body fat and weight to estimate bone density

hence high value of correlation could also mean multicollinearity if features are not chosen wisely

Question 19

Q

how do you mask the upper triangle and how do you access heatmap?

Answer

A

mask=np.zeros_like(data.corr())
upper_trian_indices=np.triu_indices_from(mask)
mask[upper_trian_indices]=True
print(mask)

[[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]

plt.figure(figsize=(15,10))
sns.heatmap(data.corr(),mask=mask,annot=True,cmap=cm.coolwarm,annot_kws={“size”:14})
sns.set_style(‘white’)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()

heatmap will only show data where the mask is False(ie 0)
set style is used to change the background color

Question 20

Q

limitation of pearson correlation method?

Answer

A

it cannot identify dependent and independent variable both are symmetric ( it doesn’t imply causation-it only shows two variable move together)
it is only valid for continuous data not for dummy variable
it can only find linear relationships

Question 21

Q

what is Anscombe’s Quartet?

Answer

A

all four of them have different distribution but nearly identical descriptive statistics
it demonstrates the effect of outliers and other influential observations on statistical properties
all four of them has the same correlation value of 0.816

all of them have the same regression model(after prediction) but their distribution is completely different ,the effect of outliers and non-linear relationships can only be identified after visualizing them

Question 22

Q

what are TODOs with correlation?

Answer

A

TODO check the correlation with the target

Question 23

Q

what is jointplot and how to implement it in python?

Answer

A

it is a combination of scatter plot and histogram (at both the axis)

sns.set()
sns.set_style(‘white’)
sns.jointplot(data[‘DIS’],data[‘NOX’],size=7,joint_kws={‘alpha’:0.5},kind=’scatter’)
plt.show()

set()-set all the styling to default
for seaborn you have to use alpha differently
kind can be changed to hex,scatter,kde etc….

Question 24

Q

effect of outliers on machine learning model?

Answer

A

a single outlier can change the machine learning model,our model tries to fit one outlier and may end up in a model that is inefficient

Question 25

Q

What is the difference between lmplot and regplot?

Answer

A

lm-linear model

regplot() performs a simple linear regression model fit and plot. lmplot() combines regplot() and FacetGrid

facetgrid plot graph based on hues( plotting graph for male and female sex)

sns.lmplot(x=’TAX’,y=’RAD’,data=data,size=7)
plt.show()

Question 26

Q

what does pairplot do?

Answer

A

it plots the scatter plot between all different features all at once
allowing us to choose the optimal feature for machine learning model

Question 27

Q

what is jupyter notebook micro benchmarking?

Answer

A

A microbenchmark is either a program to measure and test the performance of a single component or task.
it gives out properties like elapsed time,rate of operation,bandwidth etc..

this can be utilized to find out which algorithm is faster

eg: %%time

Question 28

Q

how can you add regression line in pairplot ?

Answer

A

regression line can be added by changing the kind parameter to ‘reg’

sns.pairplot(data,kind=’reg’,plot_kws={‘line_kws’:{‘color’:’cyan’}})
plt.show()

Question 29

Q

what is explanatory and response variable ?

Answer

A

An explanatory variable is the expected cause, and it explains the results.(chosen features)
A response variable is the expected effect, and it responds to other variables.

Question 30

Q

what is multivariable regresssion ?

Answer

A

multivariable regression is used to establish the relationship between a dependent variable and more than 1 independent variable

y^=θ0+θ1X1+θ2X2+θ3X3+……θnXn

log(price)=θ1+θ1RM+θ2NOX+θ3DIS+…….+θ13LSTAT

where the sign of theta depend on the type of relation (positive or negative)

Question 31

Q

write a python code to drop a column from a dataset ?

Answer

A

features=data.drop(‘PRICE’,axis=1)

Question 32

Q

what is train test split?write code for it

Answer

A

from sklearn.model_selection import train_test_split
prices=data[‘PRICE’]
features=data.drop(‘PRICE’,axis=1)
X_train,X_test,y_train,y_test=train_test_split(features,prices,test_size=0.2,random_state=10)

dividing the data for training and testing purposes
random_state=used for shuffling before training the model
it is important to shuffle the data to avoid similarity between adjacent datapoints

Question 33

Q

write a python code to print out intercept and coefficient after training the model

Answer

A

print(‘Intercept : ‘,regr.intercept_)
pd.DataFrame(data=regr.coef_,index=X_train.columns,columns=[‘Coef’])

Question 34

Q

what do you mean by the term skew?

Answer

A

skew is when you have more number of datapoints in one end(one tail)
when some data points extend to higher or lower values on either the right or the left side, it is called as Skewness, or the data is said to be left or right Skewed respectively.
skew is given by the difference b/w the datapoints on both side for a normal distribution skew is 0

Question 35

Q

what is the need for transforming the data before applying the algorithm ? how can you implement it ?

Answer

A

when it comes to skew it is important to reduce it as we are trying to make a linear regression model and presence of skew could make our model deviate from a linear pattern
to remove skew we can use a log transformation i.e by taking log of every house price
ln(7)-1.95 -difference of 5.05
ln(50)-3.91 -difference of 46.09
by doing so large prices can be reduced by a huge amount when compared to the smaller prices thereby reducing skew

we can use log method of numpy to directly find all logarithm value in an array
prices=np.log(arr)

Question 36

Q

What are p values of regression coefficient?

Answer

A

p values and regression coefficient work together to tell you which relationship in your model is statistically significant

p values helps to determine if the relationship you observed in the sample also exist in the larger population

It is standard practice to use the coefficient p-values to decide whether to include variables in the final model.

when a p value in regression is greater than the significance level (p=0.05), it indicates there is insufficient evidence in your sample to conclude that a correlation exists.

on the other hand if the p value is less than the significance level then your sample data provide enough evidence to support the hypothesis for the entire population

null hypothesis-theory suggesting that no statistical relationship and significance exists in a set of given, single, observed variables

Question 37

Q

How do you find the pvalues of regression coefficient?

Answer

A

OLS-ordinary least squares-we can’t acess this using linear regression

import statsmodels.api as sm

X_incl_const=sm.add_constant(X_train)
model=sm.OLS(y_train,X_incl_const)
results=model.fit()
pd.DataFrame({‘Coef’:results.params,’Pvalues’:round(results.pvalues,4)})

Question 38

Q

what are symptoms of multicollinearity?

Answer

A

-loss of reliability
-high variability in θ estimates (if we change any feature then the θ varies abruptly)
-strange findings

Question 39

Q

what is VIF?

Answer

A

VIF-variance inflation factor
it is a factor used to find multicollinearity in features

Question 40

Q

How is VIF calculated?

Answer

A

to find the VIF of a feature say TAX

we first find the regression model equation with target as TAX
TAX=α0+α1RM+α2NOX+…..+α12LSTAT
VIF(tax)=1/(1-R^2)
where R2 is the residual sum of squares of the tax model

if VIF of a variable is greater than 10 then it is considered insignificant

Question 41

Q

how do you round a number using numpy?

Answer

A

np.around(num,2)

Question 42

Q

what is Baysian information criterion (BIC) ?

Answer

A

It is used to measure the efficiency of the parameterized model in terms of predicting the data.

It penalizes the complexity of the model where complexity refers to the number of parameters in the model.

we do this for different types of model and choose the model with least BIC

Question 43

Q

breifly explain about how you can effectively use correlation,multicollinearity and pvalues to remove a feature?

Answer

A

correlation with target—>correlation with other feature—–>pvalue
but we have to test out the complexity of different models using BIC before removing a feature

Question 44

Q

How do you check multicollinearity by observing variability in coefficient after running BIC for different model?

Answer

A

frames=[org_coef,orgcoef_minus_indus,reduced_coef]
pd.concat(frames,axis=1)

NaN-not a number

Question 45

Q

what can you conclude from patterns in residuals after prediction?

Answer

A

after prediction the residuals should be random and there should be no pattern ,if so that means our model can be improved

for a good model the residual pattern should be normalized like a cloud where most of the residuals are centered at the origin

for a normal distribution mean=0 and standard deviation=1 this can be used to verify results from residuals

Question 46

Q

what is the code to find residual from statmodel?

Question 47

Q

can you use correlation on fitted and actual values? what would the graph of actual vs fitt look like?

Answer

A

yes, correlation can be used on actual and predicted value to find the relationship b/w both of them and how much actual data our model was able to fit in

for a perfect fit the graph will look linear

Question 48

Q

how a good residual vs prediction graph looks like? how to find this ?

Answer

A

normalized distribution
for a normalized distribution
mean=0
skew=0
displot can be used to see if it is normalized

Question 49

Q

what happens when you omitt out important key features ?how will the graph of residual vs prediction ,actual vs prediction look like?

Answer

A

when we miss out important features we actually start to see clusters in these graphs (linear)

Question 50

Q

how to access mse and rsquared from statsmodel?explain about their units and what information is obtained from them?

Answer

A

results.mse
results.rsquared

rsquared is a number b/w 0 and 1-it shows the percentage of data that can be fitted using the model
mse has the same unit as the square of the quantity being measured say(19.6-model has an error of 19.6 thousand dollar)

visualize the equation to know how it got that unit

Question 51

Q

how do you bring a range of error in your predictions?visualize standard deviation in the normalized distribution? what is RMSE?what is it’s unit?

Answer

A

RMSE-root mean square error
68% of the values lie b/w 1σ and -1σ
95% of the values lie b/w 2σ and -2σ

if we assume our prediction at mean then 2σ and -2σ can be considered as the error range which can be obtained using RMSE

RMSE gives out the range for 1 standard deviation
RMSE has the same unit as that of the Target

Question 52

Q

how to access RMSE in python? how do you find upper and lower bound limits?

Answer

A

estimate=$30000

np.sqrt(mse)

upper=np.log(30)+2(np.sqrt(reduced_log_mse))
lower=np.log(30)-2(np.sqrt(reduced_log_mse))
print(‘upper bound in log prices :’,upper)
print(‘upper bound in normal prices :’,np.eupper1000)
print(‘lower bound in log prices :’,lower)
print(‘lower bound in normal prices :’,np.e**lower1000)

Question 53

Q

what is docstring?how do you add it?

Answer

A

def get_dollar_estimate(rm,ptratio,chas=False,large_range=True):

"""Estimate the price of property in boston

Keyword arguments:


rm--no:of rooms in the property
ptratio--no:of students per teacher
chas--True if the property is near charles river,False otherwise
large_range--True for a 95% prediction interval,False for a 68% prediction interval

"""