Data Science Flashcards

Question 1

Q

li径向基核函数 (Radial basis function)

Answer

A

径向基函数是一个取值仅依赖于到原点距离的实值函数，即。此外，也可以按到某一中心点c的距离来定义，即。任一满足的函数都可称作径向函数。其中，范数一般为欧几里得距离，不过亦可使用其他距离函数。可以用于许多向函基数的和来逼近某一给定的函数。这一逼近的过程可看作是一个简单的神经网络。
from sklearn.svm import SVR: regressor=SVR(kernel=’rbf’)
most important SVR parameter is kernel type. It can be linar, polynomial or gussian SVR. 高斯就是rbf
https://en.wikipedia.org/wiki/Radial_basis_function

Question 2

Q

径向基函数网络 (Radial basis function network)

Answer

A

https://towardsdatascience.com/radial-basis-functions-neural-networks-all-we-need-to-know-9a88cc053448
In the field of mathematical modeling, a radial basis function network is an artificial neural network that uses radial basis functions as activation functions. The output of the network is a linear combination of radial basis functions of the inputs and neuron parameters.

Question 3

Q

standardization标准化
1. 公式
2. python code
scale归一化
1. 公式
2. python code

Answer

A

标准化操作（standardization）是将数据按其属性（按列）减去平均值，然后再除以方差。这个过程从几何上理解就是，先将坐标轴零轴平移到均值这条线上，然后再进行一个缩放，涉及到的就是平移和缩放两个动作。这样处理以后的结果就是，对于每个属性（每列）来说，所有数据都聚集在0附近，方差为1。计算时对每个属性/每列分别进行。
1. (X-mean)/std
2. from sklearn.preprocessing import StandardScaler
它是一种缩放就行。归一化操作的过程，首先是把某个属性（按列）的最大值和最小值之间的距离看成是单位1，然后再看x和最小值的距离占总距离的比例。所以它总是一个处于0到1之间的百分数。
1. (X-min)/(max-min)
2. from sklearn.preprocessing import MinMaxScaler

Question 4

Q

sklearn.preprocessing

fit()
transform()
fit_transform()
inverse_transform()

Answer

A

fit(): Method calculates the parameters μ and σ and saves them as internal objects. 解释：简单来说，就是求得训练集X的均值，方差，最大值，最小值,这些训练集X固有的属性。
transform(): Method using these calculated parameters apply the transformation to a particular dataset.
解释：在fit的基础上，进行标准化，降维，归一化等操作（看具体用的是哪个工具，如PCA，StandardScaler等）。
fit_transform(): joins the fit() and transform() method for transformation of dataset.
解释：fit_transform是fit和transform的组合，既包括了训练又包含了转换。
transform()和fit_transform()二者的功能都是对数据进行某种统一处理（比如标准化~N(0,1)，将数据缩放(映射)到某个固定区间，归一化，正则化等）
将标准化后的数据转换为原始数据。

Question 5

Q

reshape(1,-1)
reshape(2,-1)
reshape(-1,1)
reshape(-1,2)
reshape(N, -1)
reshape(-1, N)
-1
numpy.arange(a,b,c)

Answer

A

reshape(1,-1)转化成1行：
reshape(2,-1)转换成两行：
reshape(-1,1)转换成1列：
reshape(-1,2)转化成两列
reshape(N, -1)指定N行，列数自动确定
reshape(-1, N)生成N列，行数自动确定
-1的作用就在此: 自动计算d：d=数组或者矩阵里面所有的元素个数/c, d必须是整数，不然报错）
numpy.arange(a,b,c) 从数字a起, 步长为c, 到b结束，生成array

Question 6

Q

support vector regression 支持向量机

Answer

A

https://www.saedsayad.com/support_vector_machine_reg.htm

Support Vector Machine can also be used as a regression method, maintaining all the main features that characterize the algorithm (maximal margin). The Support Vector Regression (SVR) uses the same principles as the SVM for classification, with only a few minor differences. First of all, because output is a real number it becomes very difficult to predict the information at hand, which has infinite possibilities. In the case of regression, a margin of tolerance (epsilon) is set in approximation to the SVM which would have already requested from the problem. But besides this fact, there is also a more complicated reason, the algorithm is more complicated therefore to be taken in consideration. However, the main idea is always the same: to minimize error, individualizing the hyperplane which maximizes the margin, keeping in mind that part of the error is tolerated.
在机器学习中，支持向量机是在分类与回归分析中分析数据的监督式学习模型与相关的学习算法。给定一组训练实例，每个训练实例被标记为属于两个类别中的一个或另一个，SVM训练算法创建一个将新的实例分配给两个类别之一的模型，使其成为非概率二元线性分类器。SVM模型是将实例表示为空间中的点，这样映射就使得单独类别的实例被尽可能宽的明显的间隔分开。然后，将新的实例映射到同一空间，并基于它们落在间隔的哪一侧来预测所属类别。
除了进行线性分类之外，SVM还可以使用所谓的核技巧有效地进行非线性分类，将其输入隐式映射到高维特征空间中。当数据未被标记时，不能进行监督式学习，需要用非监督式学习，它会尝试找出数据到簇的自然聚类，并将新数据映射到这些已形成的簇。将支持向量机改进的聚类算法被称为支持向量聚类[2]，当数据未被标记或者仅一些数据被标记时，支持向量聚类经常在工业应用中用作分类步骤的预处理。

Question 7

Q

simple regression

Question 8

Q

multiple linear regression

Answer

A

https://towardsdatascience.com/columntransformer-in-scikit-for-labelencoding-and-onehotencoding-in-machine-learning-c6255952731b

Question 9

Q

polynomial regression 多项式回归

X = [[1] [2] [3] [4] [5] [6] [7] [8] [9] [10]]

y = [45000 50000 60000 80000 110000 150000 200000 300000 500000
1000000]

Answer

A

Polynomial Regression is a form of linear regression in which the relationship between the independent variable x and dependent variable y is modeled as an nth degree polynomial. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y |x)

https://www.geeksforgeeks.org/python-implementation-of-polynomial-regression/

Question 10

Q

support vector regression 支持向量机

X = [[1] [2] [3] [4] [5] [6] [7] [8] [9] [10]]

y = [45000 50000 60000 80000 110000 150000 200000 300000 500000
1000000]

Answer

A

https://medium.com/pursuitnotes/support-vector-regression-in-6-steps-with-python-c4569acd062d

Question 11

Q

Decision Tree regression

X = [[1] [2] [3] [4] [5] [6] [7] [8] [9] [10]]

y = [45000 50000 60000 80000 110000 150000 200000 300000 500000
1000000]

Answer

A

Decision tree builds regression or classification models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.

Question 12

Q

Random Forest Regression

X = [[1] [2] [3] [4] [5] [6] [7] [8] [9] [10]]

y = [45000 50000 60000 80000 110000 150000 200000 300000 500000
1000000]

Answer

A

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees’ habit of overfitting to their training set.

Question 13

Q

logistic regression

逻辑回归

a solution for classification
logistic function

Answer

A

https://christophm.github.io/interpretable-ml-book/logistic.html

In statistics, the logistic model is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc. Each object being detected in the image would be assigned a probability between 0 and 1 and the sum adding to one.
linear regression不适用于classification：
- 线性回归不产出probabilities，有可能推导出的可能性结果小于0或大于1.
- 看配图：数据的分布会影响回归结果，对于判断肿瘤是否为良性，图片里是否有狗等是否问题，需要用classification.
Theory:
- Instead of fitting a straight line or hyperplane, the logistic regression model uses the logistic function to squeeze the output of a linear equation between 0 and 1. The logistic function is defined as:

Question 14

Q

Confusion matrix

混淆矩阵

Answer

A

https://www.geeksforgeeks.org/confusion-matrix-machine-learning/

A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix. The confusion matrix shows the ways in which your classification model is confused when it makes predictions. It gives us insight not only into the errors being made by a classifier but more importantly the types of errors that are being made.

Question 15

Q

Supervised Learning

监督式学习

Answer

A

定义：In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output.
- 有数据库
- 知道正确的输出是什么样的
- 关于输入和输出的关系是有想法的
分类：
- Regression: to predict results within a continuous output
- classification: to predict results in a discrete output

Question 16

Q

Unsupervised learning

无监督式学习

Answer

A

Definition: Unsupervised learning allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don’t necessarily know the effect of the variables.
- We can derive this structure by clustering the data based on relationships among the variables in the data.
- With unsupervised learning there is no feedback based on the prediction results.
应用：
- Organize computing clusters
- Social network analysis
- Market segmentation
- Astronomical data analysis

Question 17

Q

Logistic Regression

Python code

Question 18

Q

k-nearest neighbors algorithm

K-近邻算法

Answer

A

https://www.saedsayad.com/k_nearest_neighbors.htm

https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761

K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). KNN has been used in statistical estimation and pattern recognition already in the beginning of 1970’s as a non-parametric techniqu

Question 19

Q

SVM and Kernel SVM

Answer

A

https://towardsdatascience.com/svm-and-kernel-svm-fed02bef1200

SVM是十大top AI算法之一，deal with非线性和高维
it is a supervised learning algorithm
it mostly used for classification but it can be used also for regression.
The main idea is that based on the labeled data (training data) the algorithm tries to find the optimal hyperplane which can be used to classify new data points. In two dimensions the hyperplane is a simple line.
The main idea is that based on the labeled data (training data) the algorithm tries to find the optimal hyperplane which can be used to classify new data points. In two dimensions the hyperplane is a simple line.
As an example, lets consider two classes, apples and lemons.
- Other algorithms will learn the most evident, most representative characteristics of apples and lemons, like apples are green and rounded while lemons are yellow and have elliptic form.
- In contrast, SVM will search for apples that are very similar to lemons, for example apples which are yellow and have elliptic form. This will be a support vector. The other support vector will be a lemon similar to an apple (green and rounded). So other algorithms learnsthe differences while SVM learns similarities.

Question 20

Q

Logistic Regression

Question 21

Q

K-Nearest Neighbors

Question 22

Q

Support Vector Machine

Question 23

Q

kernel Suppoort Venctor Machine

Question 24

Q

Naive Bayes

Question 25

Q

Decision Tree Classification

Question 26

Q

Random Forest Classification

Question 27

Q

K-means clustering

Answer

A

https://www.geeksforgeeks.org/k-means-clustering-introduction/

Question 28

Q

K-Means Clustering

python

Question 29

Q

Hierarchical Clustering

Question 30

Q

lagrange multiplier 拉格朗日乘数

Answer

A

https://medium.com/@ecyY/an-introduction-to-lagrange-multiplier-on-solving-optimization-questions-under-economic-constraints-f9bc9b439169

拉格朗日乘数法是一种寻找多元函数在其变量受到一个或多个条件的约束时的极值的方法。这种方法可以将一个有n个变量与k个约束条件的最优化问题转换为一个解有n + k个变量的方程组的解的问题。

Question 31

Q

准确率召回率 (Precision and recall)

Answer

A

Recall is defined as the number of relevant documents retrieved by a search divided by the total number of existing relevant documents, while precision is defined as the number of relevant documents retrieved by a search divided by the total number of documents retrieved by that search.

从英文翻译而来-在模式识别，信息检索和分类，精度是所检索的实例之间的相关实例的分数，而召回是被实际检索相关实例的总量的比例。因此精度和召回是基于相关性的认识和措施。假设在含有12对狗和猫的一些图象识别狗在照片中识别8只犬的计算机程序。在确定为狗的8个中，有5个实际上是狗，而其余的是猫。该方案的精度是5/8，而其召回为5/12。

Question 32

Q

First-order predicate logic 一阶逻辑

Answer

A

一阶逻辑是使用于数学、哲学、语言学及计算机科学中的一种形式系统。过去一百多年，一阶逻辑出现过许多种名称，包括：一阶断言演算、低端断言演算、量化理论或谓词逻辑。一阶逻辑和命题逻辑的不同之处在于，一阶逻辑有使用量化变量.

First-order logic—also known as predicate logic, quantificational logic, and first-order predicate calculus—is a collection of formal systems used in mathematics, philosophy, linguistics, and computer science.

Question 33

Q

删除表头中的空格

Answer

A

df.columns = [x.strip() for x in df.columns]

Question 34

Q

combining objects

Answer

A

combined_zip=zip(name,hps) 更高效的方法：zip

Question 35

Q

counting with loop

Answer

A

更高效的方法：

from collections import Counter

type_counts=Counter(poke_types)

Question 36

Q

combination with loop

Answer

A

更高效：

Question 37

Q

Engines and connection strings

Import create_engine from the sqlalchemy module.
Using the create_engine() function, create an engine for a local file named census.sqlite with sqlite as the driver. Be sure to enclose the connection string within quotation marks.
Print the output from the .table_names() method on the engine.

Question 38

Q

Autoloading Tables from a database

Question 39

Q

Computing percentiles

计算分位数

Answer

A

percentiles=np.percentile(data,np.array([25,50,75]))

percentiles=np.percentile(data,分位数的array)

Question 40

Q

numpy计算方差和标准差
开方

Answer

A

np.var()，np.std()
np.sqrt()

Question 41

Q

计算协方差

Answer

A

np.cov(x, y)

Question 42

Q

计算皮尔逊相关系数

Answer

A

np.corrcoef(x, y)

Question 43

Q

draw a number between 0 and 1
使得每次随机数据一样，generate reproduciable code

Answer

A

np.random.random()
np. random.random(size=5)产生5个随机数
np.random.seed()

Question 44

Q

create table in sql

Create a table named ‘results’ with 3 VARCHAR columns called track, artist, and album, with lengths 200, 120, and 160, respectively.
Create one integer column called track_length_mins.
SELECT all the columns from your new table. No rows will be returned, but you can confirm that the table has been created.

Question 45

Q

insert SQL

Create a table called tracks with 2 VARCHAR columns named track and album, and one integer column named track_length_mins. Then, select all columns from the new table using the * notation.
Insert the track ‘Basket Case’, from the album ‘Dookie’, with a track length of 3, into the appropriate columns.

Question 46

Q

Linear Model in Anthropology

import LinearRegression from sklearn.linear_model and initialize the model with fit_intercept=False.
Reshape the pre-loaded data arrays legs and heights, from “1-by-N” to “N-by-1” arrays.
Pass the reshaped arrays legs and heights into model.fit().
use model.predict() to predict the value fossil_height for the newly found fossil fossil_leg = 50.7.

Question 47

Q

data importing in python

Question 48

Q

from sqlalchemy import create_engine

Question 49

Q

RSS : Residual Sum of Squares 计算方法（代码）

Answer

A

residuals = y_model - y_data

RSS = np.sum(np.square(residuals))

mean_square_residuals = RSS/len(residuals)

MSE = np.mean(np.square(residuals)) 均方误差

RMSE = np.sqrt(np.mean(np.square(residuals)) ) 均方根误差=均方误差开根号

RMSE= np.std(residuals)

Question 50

Q

计算python

Deviations

Residuals

R-Squared

Question 51

Q

R-Squared 的取值范围怎么解读？

Answer

A

Notice that R-squared varies from 0 to 1, where a value of 1 means that the model and the data are perfectly correlated and all variation in the data is predicted by the model. A value of zero would mean none of the variation in the data is predicted by the model.

Here, the data points are close to the line, so R-squared is closer to 1.0

Question 52

Q

Variation Around the Trend

Question 53

Q

likelihood vs. probability

Question 54

Q

用resampleing的方法去预测整体的时候，可能会发生错误，错误的类型有哪些？

Question 55

Q

cd

ls

..

cd ~

Question 56

Q

Question 57

Q

OKR

Answer

A

OKR(Objectives and Key Results)全称为“目标和关键成果”，是企业进行目标管理的一个简单有效的系统，能够将目标管理自上而下贯穿到基层。这套系统由英特尔公司制定，在谷歌成立不到一年的时间，被投资者约翰·都尔(John-Doerr)引入谷歌，并一直沿用至今。

　　OKR是一套定义和跟踪目标及其完成情况的管理工具和方法: 1999年英特尔公司发明了这种方法,后来被 John Doerr推广到甲骨文,谷歌,领英等高科技公司并逐步流传开来，现在广泛应用于IT、风险投资、游戏、创意等以项目为主要经营单位的大小企业。

Question 58

Q

APIs

Question 59

Q

four steps for running an A/B test

Answer

A

Picking a metric to track
Calculating sample size
Running the experiment
Checking for signicance

Question 60

Q

Counter()

Question 61

Q

Random row selection

随机选择DATAFRAME的75%的行

Question 62

Q

list可以改变

integer不可以改变

Question 63

Q

Question 64

Q

C regularization

C的大小有什么影响？

Answer

A

As you probably noticed, smaller values of C lead to less confident predictions. That’s because smaller C means more regularization, which in turn means smaller coefficients, which means raw model outputs closer to zero and, thus, probabilities closer to 0.5 after the raw model output is squashed through the sigmoid function. That’s quite a chain of events!

Answer 36

A

Bootstrapping the mean

Bootstrapping is a common way to assess variability The bootstrap:

Take a random sample of data with replacement
Calculate the mean of the sample
Repeat this process many times (1000s)
Calculate the percentiles of the result (usually 2.5, 97.5)

The result is a 95% condence interval of the mean of each coecient.