152 - 200 Flashcards
numpy.info(object=None, maxwidth=76, output=None, toplevel=’numpy’)
Get help information for a function, class, or module.
np.info(np.polyval) polyval(p, x) Evaluate the polynomial p at x.
np.info('fft') * Found in numpy * Core FFT routines * Found in numpy.fft * fft(a, n=None, axis=-1) *Repeat reference found in numpy.fft.fftpack* *Total of 3 references found. *
numpy.view([dtype][, type])
helps to get a new view of the array with the same data.
a = np.arange(10, dtype ='int16') print("a is: \n", a) 👉 [0 1 2 3 4 5 6 7 8 9]
v = a.view('int32') print("\n After using view() with dtype = 'int32' a is : \n", a) After using view() with dtype = 'int32' a is : 👉 [0 1 2 3 4 5 6 7 8 9]
v += 1 print("\n After using view() with dtype = 'int32' and adding 1 a is : \n", a) After using view() with dtype = 'int32' and adding 1 a is : 👉 [1 1 3 3 5 5 7 7 9 9]
numpy.r_
Translates slice objects to concatenation along the first axis. This is a simple way to build up arrays quickly.
np.r_['r',[1,2,3], [4,5,6]] matrix([[1, 2, 3, 4, 5, 6]])
np.r_['0,2,0', [1,2,3], [4,5,6]] 👉 array([[1],[2],[3],[4],[5],[6]])
np.r_['1,2,0', [1,2,3], [4,5,6]] 👉 array([[1, 4], [2, 5], [3, 6]])
a = np.array([[0, 1, 2], [3, 4, 5]]) np.r_['-1', a, a] # concatenate along last axis 👉 array([[0, 1, 2, 0, 1, 2], [3, 4, 5, 3, 4, 5]])
np.r_['0,2', [1,2,3], [4,5,6]] # concatenate along first axis, dim>=2 👉 array([[1, 2, 3], [4, 5, 6]])
numpy.c_
Translates slice objects to concatenation along the second axis.
np.c_[np.array([1,2,3]), np.array([4,5,6])] 👉 array([[1, 4], [2, 5], [3, 6]])
np.c_[np.array([[1,2,3]]), 0, 0, np.array([[4,5,6]])]
pandas.DataFrame.sum(axis=None, skipna=True, level=None, numeric_only=None, min_count=0, **kwargs)
Return the sum of the values over the requested axis.
idx = pd.MultiIndex.from_arrays([ ['warm', 'warm', 'cold', 'cold'], ['dog', 'falcon', 'fish', 'spider']], names=['blooded', 'animal']) s = pd.Series([4, 2, 0, 8], name='legs', index=idx) s.sum() 👉 14
pandas.DataFrame.dtypes
Return the dtypes in the DataFrame. This returns a Series with the data type of each column. The result’s index is the original DataFrame’s columns.
df = pd.DataFrame({'float': [1.0], 'int': [1], 'datetime': [pd.Timestamp('20180310')], 'string': ['foo']}) df.dtypes float float64 int int64 datetime datetime64[ns] string object dtype: object
pandas.DataFrame.isna()
Detect missing values. Return a boolean same-sized object indicating if the values are NA.
df = pd.DataFrame(dict(age=[5, 6, np.NaN], born=[pd.NaT, pd.Timestamp('1939-05-27'), pd.Timestamp('1940-04-25')], name=['Alfred', 'Batman', ''], toy=[None, 'Batmobile', 'Joker'])) df.isna() age born name toy 0 False True False True 1 False False False False 2 True False False False
statsmodels.formula.api.ols(endog, exog=None, missing=’none’, hasconst=None, **kwargs)
Used to perform linear regression. Get ordinary least squares, and fit() method is used to fit the data in it.
import statsmodels.formula.api as smf df = pd.read_csv('headbrain1.csv') df.columns = ['Head_size', 'Brain_weight'] model = smf.ols(formula='Head_size ~ Brain_weight', data=df).fit() #model summary print(model.summary())
statsmodels. api.OLS(y, x)
statsmodels. api.OLS(y, x) — method of linear regression.
statsmodels. api.add_constant — что бы добавить то от чего будет отталкивается (начальное)
- y : the variable which is dependent on x
- x : the independent variable
.fit() | .summary() | .params | .predict()
import statsmodels.api as sm data = pd.read_csv('train.csv') x = data['x'].tolist() y = data['y'].tolist() x = sm.add_constant(x) result = sm.OLS(y, x).fit() print(result.summary())
Multicollinearity
Occurs when there are two or more independent variables in a multiple regression model, which have a high correlation among themselves. When some features are highly correlated, we might have difficulty in distinguishing between their individual effects on the dependent variable.
statsmodels.api.qqplot(Quantile-Quantile Plot)
The plot provides a summary of whether the distributions of the two variables are similar or not with respect to the locations.
import statsmodels.api as sm import pylab as py data_points = np.random.normal(0, 1, 100) sm.qqplot(data_points, line ='45')
seaborn.load_dataset(name, cache=True, data_home=None, **kws)
Load an example dataset from the online repository (requires internet).
pandas_profiling.ProfileReport(df, **kwargs)
Which generates a basic report on the input DataFrame.
data = pd.DataFrame(dict) print(data)
#forming ProfileReport and save as output.html file profile = pp.ProfileReport(data) profile.to_file("output.html")
statsmodels.api.Logit()
building the model and fitting the data
Function for performing logistic regression. The Logit() function accepts y and X as parameters and returns the Logit object. Logistic regression is the type of regression analysis used to find the probability of a certain event occurring(произойдет событие).
import statsmodels.api as sm df = pd.read_csv('logit_train1.csv', index_col = 0) Xtrain = df[['gmat', 'gpa', 'work_experience']] ytrain = df[['admitted']]
log_reg = sm.Logit(ytrain, Xtrain).fit() print(log_reg.summary())
class.__dict__
contains all the attributes of the class. A dictionary or other mapping object used to store an object’s (writable) attributes.
class Shape(object): def \_\_init\_\_(self, **kwargs): self.\_\_dict\_\_.update(**kwargs)
class Circle(Shape): def \_\_init\_\_(self, **kwargs): super(Circle, self).\_\_init\_\_(**kwargs)
pandas.DataFrame.pivot(index=None, columns=None, values=None)
Return reshaped DataFrame organized by given index/column values. Короче меняет местами колонки и ряды.
df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two', 'two'], 'bar': ['A', 'B', 'C', 'A', 'B', 'C'], 'baz': [1, 2, 3, 4, 5, 6], 'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
f.pivot(index='foo', columns='bar', values='baz') bar A B C foo one 1 2 3 two 4 5 6
df.pivot(index='foo', columns='bar')['baz'] bar A B C foo one 1 2 3 two 4 5 6
df.pivot(index='foo', columns='bar', values=['baz', 'zoo']) baz zoo bar A B C A B C foo one 1 2 3 x y z two 4 5 6 q w t
pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates=’raise’, ordered=True)
Used to separate the array of elements into different bins. The cut function is mainly used to perform statistical analysis on scalar data.
pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3) [(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ... Categories (3, interval[float64, right]): [(0.994, 3.0] < (3.0, 5.0] ...
df= pd.DataFrame({'number': np.random.randint(1, 100, 10)}) df['bins'] = pd.cut(x=df['number'], bins=[1, 20, 40, 60, 80, 100])
numpy.ones_like(a, dtype=None, order=’K’, subok=True, shape=None)
Return an array of ones with the same shape and type as a given array.
x = array([[0, 1, 2], [3, 4, 5]]) np.ones_like(x) 👉 array([[1, 1, 1], [1, 1, 1]])
y = np.arange(3, dtype=float) 👉 array([0., 1., 2.]) np.ones_like(y) 👉 array([1., 1., 1.])
pandas.DataFrame.corrwith(other, axis=0, drop=False, method=’pearson’)
Compute pairwise correlation. Pairwise correlation is computed between rows or columns of DataFrame with rows or columns of Series or DataFrame. DataFrames are first aligned along both axes before computing the correlations.
df1 = pd.DataFrame({"A":[1, 5, 7, 8], "B":[5, 8, 4, 3], "C":[10, 4, 9, 3]}) df2 = pd.DataFrame({"A":[5, 3, 6, 4], "B":[11, 2, 4, 3], "C":[4, 3, 8, 5]}) #To find the correlation among the columns of df1 and df2 along the column axis df1.corrwith(df2, axis = 0)