House price prediction-Multivariate regression Flashcards
what is the difference between bunch and dictionary?
Bunch is a subclass of the Dict class and supports all the methods as dict does. In addition, it allows you to use the keys as attributes.
what are the things to do before dealing with a dataset?
1-where does the data come from(source)
2-find a brief description of the data
3-how big is the data
4-how many features are there for the data
5-name and description of what that features does
6-unit of the data
what is the function used to get information about all methods and attributes about an object?
dir()
How to create a Dataframe? how will you add an additional column to that dataframe?
TODO create pandas datframe
data=pd.DataFrame(data=boston_data.data,columns=boston_data.feature_names)
data[“PRICE”]=boston_data.target
how to access first and last couple of data in a dataframe?
use head() and tail() functions
what does count method do for a dataframe?
count method returns the count of all rows in every column
How do you check for missing values in a dataframe?
pd.isnull(data).any()-returns False if no data is missing in a column ,True if otherwise
CRIM False
ZN False
INDUS False
CHAS False
NOX False
RM False
AGE False
DIS False
RAD False
TAX False
PTRATIO False
B False
LSTAT False
PRICE False
data.info()
0 CRIM 506 non-null float64
1 ZN 506 non-null float64
2 INDUS 506 non-null float64
3 CHAS 506 non-null float64
4 NOX 506 non-null float64
5 RM 506 non-null float64
6 AGE 506 non-null float64
7 DIS 506 non-null float64
8 RAD 506 non-null float64
9 TAX 506 non-null float64
10 PTRATIO 506 non-null float64
11 B 506 non-null float64
12 LSTAT 506 non-null float64
13 PRICE 506 non-null float64
what is the difference between float32 and float64?
the floating number takes up 32 bit and 64 bit space
the number of digit is double for 64 bit number
how to plot a histogram using matplotlib?
plt.figure(figsize=(10,6))
plt.hist(data[‘PRICE’],ec=’black’,bins=50,color=’#2196f3’)
plt.grid(color=’black’,alpha=0.4)
plt.xlabel(‘Price in $1000’)
plt.ylabel(‘no:of houses’)
plt.show()
what is seaborn module?
it is data visualization library based on matplotlib.it provides different types of graphs and plots
what is distplot ? what is kde and pdf?
distplot gives us a combination of both histogram and pdf
Kernel density estimation or KDE way to estimate the probability density function of a random variable.
we can use different types of kernel gaussian kernel ,triangular kernel, cosine kernel etc..
what does value_counts() do?
returns the no of unique datapoints in a column
data[‘RAD’].value_counts()- all unique elements and their count
data[‘RAD’].value_counts().index-only unique elements
data[‘RAD’].value_counts().axes[0]-same fn as above
what is the difference between a barplot and hisplot?
both are similar but,barplot does not require bins so it shows a much better data representation
plt.bar(freq.axes[0],height=freq)
Interpret mean and median for normal and other distribution?
mean and median could be different ,say if the number of rich people is greater than the poorer
implement descriptive statistics in python?
data[‘PRICE’].min()-for single column
data[‘PRICE’].max()
data.max()-return minimum of every column for the dataframe
data.describe()-50% is the median
what is correlation? what is it’s importance?
it is the mutual relationship or connection between two things.
how two features move together
-positive and negative correlation
temperature vs ice cream eaten-positive correlation
ρ(X,Y) = corr(X,Y)
it is a number between +1 and -1 ,+1-perfect positive correlation
-1-perfect negative correlation,0-no correlation
it is important to add features that are correlated to the target variable inorder to create a better machine learning model
how do you find the correlation in python?
data[‘PRICE’].corr(data[‘RM’])
data.corr()-gives the correlation between all the features
we look for strength and direction to find the best correlated features
what is multi collinearity in machine learning?
-it is when we use two features that are highly correlated to create a machine learning model then our finding become unreliable and unsensible because we use redundant information i.e both gives us the same information(not unique)
eg: using body fat and weight to estimate bone density
hence high value of correlation could also mean multicollinearity if features are not chosen wisely
how do you mask the upper triangle and how do you access heatmap?
mask=np.zeros_like(data.corr())
upper_trian_indices=np.triu_indices_from(mask)
mask[upper_trian_indices]=True
print(mask)
[[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]
plt.figure(figsize=(15,10))
sns.heatmap(data.corr(),mask=mask,annot=True,cmap=cm.coolwarm,annot_kws={“size”:14})
sns.set_style(‘white’)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()
heatmap will only show data where the mask is False(ie 0)
set style is used to change the background color
limitation of pearson correlation method?
it cannot identify dependent and independent variable both are symmetric ( it doesn’t imply causation-it only shows two variable move together)
it is only valid for continuous data not for dummy variable
it can only find linear relationships
what is Anscombe’s Quartet?
all four of them have different distribution but nearly identical descriptive statistics
it demonstrates the effect of outliers and other influential observations on statistical properties
all four of them has the same correlation value of 0.816
all of them have the same regression model(after prediction) but their distribution is completely different ,the effect of outliers and non-linear relationships can only be identified after visualizing them