Data Science Flashcards

1
Q

li径向基核函数 (Radial basis function)

A
  • 径向基函数是一个取值仅依赖于到原点距离的实值函数,即。此外,也可以按到某一中心点c的距离来定义, 即。任一满足的函数都可称作径向函数。其中,范数一般为欧几里得距离,不过亦可使用其他距离函数。 可以用于许多向函基数的和来逼近某一给定的函数。这一逼近的过程可看作是一个简单的神经网络。
  • from sklearn.svm import SVR: regressor=SVR(kernel=’rbf’)
  • most important SVR parameter is kernel type. It can be linar, polynomial or gussian SVR. 高斯就是rbf
  • https://en.wikipedia.org/wiki/Radial_basis_function
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

径向基函数网络 (Radial basis function network)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q
  1. standardization标准化
    1. 公式
    2. python code
  2. scale归一化
    1. 公式
    2. python code
A
  1. 标准化操作(standardization)是将数据按其属性(按列)减去平均值,然后再除以方差。这个过程从几何上理解就是,先将坐标轴零轴平移到均值这条线上,然后再进行一个缩放,涉及到的就是平移和缩放两个动作。这样处理以后的结果就是,对于每个属性(每列)来说,所有数据都聚集在0附近,方差为1。计算时对每个属性/每列分别进行。
    1. (X-mean)/std
    2. from sklearn.preprocessing import StandardScaler
  2. 它是一种缩放就行。归一化操作的过程,首先是把某个属性(按列)的最大值和最小值之间的距离看成是单位1,然后再看x和最小值的距离占总距离的比例。所以它总是一个处于0到1之间的百分数。
    1. (X-min)/(max-min)
    2. from sklearn.preprocessing import MinMaxScaler
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

sklearn.preprocessing

  1. fit()
  2. transform()
  3. fit_transform()
  4. inverse_transform()
A
  1. fit(): Method calculates the parameters μ and σ and saves them as internal objects. 解释:简单来说,就是求得训练集X的均值,方差,最大值,最小值,这些训练集X固有的属性。
  2. transform(): Method using these calculated parameters apply the transformation to a particular dataset.
    解释:在fit的基础上,进行标准化,降维,归一化等操作(看具体用的是哪个工具,如PCA,StandardScaler等)。
  3. fit_transform(): joins the fit() and transform() method for transformation of dataset.
    解释:fit_transform是fit和transform的组合,既包括了训练又包含了转换。
    transform()和fit_transform()二者的功能都是对数据进行某种统一处理(比如标准化~N(0,1),将数据缩放(映射)到某个固定区间,归一化,正则化等)
  4. 将标准化后的数据转换为原始数据。
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
  1. reshape(1,-1)
  2. reshape(2,-1)
  3. reshape(-1,1)
  4. reshape(-1,2)
  5. reshape(N, -1)
  6. reshape(-1, N)
  7. -1
  8. numpy.arange(a,b,c)
A
  1. reshape(1,-1)转化成1行:
  2. reshape(2,-1)转换成两行:
  3. reshape(-1,1)转换成1列:
  4. reshape(-1,2)转化成两列
  5. reshape(N, -1)指定N行,列数自动确定
  6. reshape(-1, N)生成N列,行数自动确定
  7. -1的作用就在此: 自动计算d:d=数组或者矩阵里面所有的元素个数/c, d必须是整数,不然报错)
  8. numpy.arange(a,b,c) 从 数字a起, 步长为c, 到b结束,生成array
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

support vector regression 支持向量机

A

https://www.saedsayad.com/support_vector_machine_reg.htm

  • Support Vector Machine can also be used as a regression method, maintaining all the main features that characterize the algorithm (maximal margin). The Support Vector Regression (SVR) uses the same principles as the SVM for classification, with only a few minor differences. First of all, because output is a real number it becomes very difficult to predict the information at hand, which has infinite possibilities. In the case of regression, a margin of tolerance (epsilon) is set in approximation to the SVM which would have already requested from the problem. But besides this fact, there is also a more complicated reason, the algorithm is more complicated therefore to be taken in consideration. However, the main idea is always the same: to minimize error, individualizing the hyperplane which maximizes the margin, keeping in mind that part of the error is tolerated.
  • 在机器学习中,支持向量机是在分类与回归分析中分析数据的监督式学习模型与相关的学习算法。给定一组训练实例,每个训练实例被标记为属于两个类别中的一个或另一个,SVM训练算法创建一个将新的实例分配给两个类别之一的模型,使其成为非概率二元线性分类器。SVM模型是将实例表示为空间中的点,这样映射就使得单独类别的实例被尽可能宽的明显的间隔分开。然后,将新的实例映射到同一空间,并基于它们落在间隔的哪一侧来预测所属类别。
  • 除了进行线性分类之外,SVM还可以使用所谓的核技巧有效地进行非线性分类,将其输入隐式映射到高维特征空间中。 当数据未被标记时,不能进行监督式学习,需要用非监督式学习,它会尝试找出数据到簇的自然聚类,并将新数据映射到这些已形成的簇。将支持向量机改进的聚类算法被称为支持向量聚类[2],当数据未被标记或者仅一些数据被标记时,支持向量聚类经常在工业应用中用作分类步骤的预处理。
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

simple regression

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

multiple linear regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

polynomial regression 多项式回归

X = [[1] [2] [3] [4] [5] [6] [7] [8] [9] [10]]

y = [45000 50000 60000 80000 110000 150000 200000 300000 500000
1000000]

A

Polynomial Regression is a form of linear regression in which the relationship between the independent variable x and dependent variable y is modeled as an nth degree polynomial. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y |x)

https://www.geeksforgeeks.org/python-implementation-of-polynomial-regression/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

support vector regression 支持向量机

X = [[1] [2] [3] [4] [5] [6] [7] [8] [9] [10]]

y = [45000 50000 60000 80000 110000 150000 200000 300000 500000
1000000]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Decision Tree regression

X = [[1] [2] [3] [4] [5] [6] [7] [8] [9] [10]]

y = [45000 50000 60000 80000 110000 150000 200000 300000 500000
1000000]

A

Decision tree builds regression or classification models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Random Forest Regression

X = [[1] [2] [3] [4] [5] [6] [7] [8] [9] [10]]

y = [45000 50000 60000 80000 110000 150000 200000 300000 500000
1000000]

A

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees’ habit of overfitting to their training set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

logistic regression

逻辑回归

  • a solution for classification
  • logistic function
A

https://christophm.github.io/interpretable-ml-book/logistic.html

  • In statistics, the logistic model is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc. Each object being detected in the image would be assigned a probability between 0 and 1 and the sum adding to one.
  • linear regression不适用于classification:
    • 线性回归不产出probabilities,有可能推导出的可能性结果小于0或大于1.
    • 看配图:数据的分布会影响回归结果,对于判断肿瘤是否为良性,图片里是否有狗等是否问题,需要用classification.
  • Theory:
    • ​Instead of fitting a straight line or hyperplane, the logistic regression model uses the logistic function to squeeze the output of a linear equation between 0 and 1. The logistic function is defined as:
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Confusion matrix

混淆矩阵

A

https://www.geeksforgeeks.org/confusion-matrix-machine-learning/

A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix. The confusion matrix shows the ways in which your classification model is confused when it makes predictions. It gives us insight not only into the errors being made by a classifier but more importantly the types of errors that are being made.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Supervised Learning

监督式学习

A
  • 定义:In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output.
    • 有数据库
    • 知道正确的输出是什么样的
    • 关于输入和输出的关系是有想法的
  • 分类:
    • Regression: to predict results within a continuous output
    • classification: to predict results in a discrete output
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Unsupervised learning

无监督式学习

A
  • Definition: Unsupervised learning allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don’t necessarily know the effect of the variables.
    • We can derive this structure by clustering the data based on relationships among the variables in the data.
    • With unsupervised learning there is no feedback based on the prediction results.
  • 应用:
    • Organize computing clusters
    • Social network analysis
    • Market segmentation
    • Astronomical data analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Logistic Regression

Python code

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

k-nearest neighbors algorithm

K-近邻算法

A

https://www.saedsayad.com/k_nearest_neighbors.htm

https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761

K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). KNN has been used in statistical estimation and pattern recognition already in the beginning of 1970’s as a non-parametric techniqu

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

SVM and Kernel SVM

A

https://towardsdatascience.com/svm-and-kernel-svm-fed02bef1200

  • SVM是十大top AI算法之一,deal with非线性和高维
  • it is a supervised learning algorithm
  • it mostly used for classification but it can be used also for regression.
  • The main idea is that based on the labeled data (training data) the algorithm tries to find the optimal hyperplane which can be used to classify new data points. In two dimensions the hyperplane is a simple line.
  • The main idea is that based on the labeled data (training data) the algorithm tries to find the optimal hyperplane which can be used to classify new data points. In two dimensions the hyperplane is a simple line.
  • As an example, lets consider two classes, apples and lemons.
    • Other algorithms will learn the most evident, most representative characteristics of apples and lemons, like apples are green and rounded while lemons are yellow and have elliptic form.
    • In contrast, SVM will search for apples that are very similar to lemons, for example apples which are yellow and have elliptic form. This will be a support vector. The other support vector will be a lemon similar to an apple (green and rounded). So other algorithms learnsthe differences while SVM learns similarities.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Logistic Regression

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

K-Nearest Neighbors

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Support Vector Machine

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

kernel Suppoort Venctor Machine

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Naive Bayes

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Decision Tree Classification

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Random Forest Classification

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

K-means clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

K-Means Clustering

python

A
29
Q

Hierarchical Clustering

A
30
Q

lagrange multiplier 拉格朗日乘数

A

https://medium.com/@ecyY/an-introduction-to-lagrange-multiplier-on-solving-optimization-questions-under-economic-constraints-f9bc9b439169

拉格朗日乘数法是一种寻找多元函数在其变量受到一个或多个条件的约束时的极值的方法。 这种方法可以将一个有n个变量与k个约束条件的最优化问题转换为一个解有n + k个变量的方程组的解的问题。

31
Q

准确率召回率 (Precision and recall)

A

Recall is defined as the number of relevant documents retrieved by a search divided by the total number of existing relevant documents, while precision is defined as the number of relevant documents retrieved by a search divided by the total number of documents retrieved by that search.

从英文翻译而来-在模式识别,信息检索和分类,精度是所检索的实例之间的相关实例的分数,而召回是被实际检索相关实例的总量的比例。因此精度和召回是基于相关性的认识和措施。假设在含有12对狗和猫的一些图象识别狗在照片中识别8只犬的计算机程序。在确定为狗的8个中,有5个实际上是狗,而其余的是猫。该方案的精度是5/8,而其召回为5/12。

32
Q

First-order predicate logic 一阶逻辑

A

一阶逻辑是使用于数学、哲学、语言学及计算机科学中的一种形式系统。 过去一百多年,一阶逻辑出现过许多种名称,包括:一阶断言演算、低端断言演算、量化理论或谓词逻辑。一阶逻辑和命题逻辑的不同之处在于,一阶逻辑有使用量化变量.

First-order logic—also known as predicate logic, quantificational logic, and first-order predicate calculus—is a collection of formal systems used in mathematics, philosophy, linguistics, and computer science.

33
Q

删除表头中的空格

A

df.columns = [x.strip() for x in df.columns]

34
Q

combining objects

A

combined_zip=zip(name,hps) 更高效的方法:zip

35
Q

counting with loop

A

更高效的方法:

from collections import Counter

type_counts=Counter(poke_types)

36
Q

combination with loop

A

更高效:

37
Q

Engines and connection strings

  • Import create_engine from the sqlalchemy module.
  • Using the create_engine() function, create an engine for a local file named census.sqlite with sqlite as the driver. Be sure to enclose the connection string within quotation marks.
  • Print the output from the .table_names() method on the engine.
A
38
Q

Autoloading Tables from a database

A
39
Q

Computing percentiles

计算分位数

A

percentiles=np.percentile(data,np.array([25,50,75]))

percentiles=np.percentile(data,分位数的array)

40
Q
  1. numpy计算方差和标准差
  2. 开方
A
  1. np.var(),np.std()
  2. np.sqrt()
41
Q
  1. 计算协方差
A

np.cov(x, y)

42
Q
  1. 计算皮尔逊相关系数
A

np.corrcoef(x, y)

43
Q
  1. draw a number between 0 and 1
  2. 使得每次随机数据一样,generate reproduciable code
A
  1. np.random.random()
    np. random.random(size=5)产生5个随机数
  2. np.random.seed()
44
Q

create table in sql

  • Create a table named ‘results’ with 3 VARCHAR columns called track, artist, and album, with lengths 200, 120, and 160, respectively.
  • Create one integer column called track_length_mins.
  • SELECT all the columns from your new table. No rows will be returned, but you can confirm that the table has been created.
A
45
Q

insert SQL

  • Create a table called tracks with 2 VARCHAR columns named track and album, and one integer column named track_length_mins. Then, select all columns from the new table using the * notation.
  • Insert the track ‘Basket Case’, from the album ‘Dookie’, with a track length of 3, into the appropriate columns.
A
46
Q

Linear Model in Anthropology

  • import LinearRegression from sklearn.linear_model and initialize the model with fit_intercept=False.
  • Reshape the pre-loaded data arrays legs and heights, from “1-by-N” to “N-by-1” arrays.
  • Pass the reshaped arrays legs and heights into model.fit().
  • use model.predict() to predict the value fossil_height for the newly found fossil fossil_leg = 50.7.
A
47
Q

data importing in python

A
48
Q

from sqlalchemy import create_engine

A
49
Q

RSS : Residual Sum of Squares 计算方法(代码)

A

residuals = y_model - y_data

RSS = np.sum(np.square(residuals))

mean_square_residuals = RSS/len(residuals)

MSE = np.mean(np.square(residuals)) 均方误差

RMSE = np.sqrt(np.mean(np.square(residuals)) ) 均方根误差=均方误差开根号

RMSE= np.std(residuals)

50
Q

计算python

Deviations

Residuals

R-Squared

A
51
Q

R-Squared 的取值范围 怎么解读?

A

Notice that R-squared varies from 0 to 1, where a value of 1 means that the model and the data are perfectly correlated and all variation in the data is predicted by the model. A value of zero would mean none of the variation in the data is predicted by the model.

Here, the data points are close to the line, so R-squared is closer to 1.0

52
Q

Variation Around the Trend

A
53
Q

likelihood vs. probability

A
54
Q

用resampleing的方法去预测整体的时候,可能会发生错误,错误的类型有哪些?

A
55
Q

cd

ls

..

cd ~

A
56
Q
A
57
Q

OKR

A

OKR(Objectives and Key Results)全称为“目标和关键成果”,是企业进行目标管理的一个简单有效的系统,能够将目标管理自上而下贯穿到基层。这套系统由英特尔公司制定,在谷歌成立不到一年的时间,被投资者约翰·都尔(John-Doerr)引入谷歌,并一直沿用至今。

  OKR是一套定义和跟踪目标及其完成情况的管理工具和方法: 1999年 英特尔公司发明了这种方法,后来被 John Doerr推广到甲骨文,谷歌,领英等高科技公司并逐步流传开来,现在广泛应用于IT、风险投资、游戏、创意等以项目为主要经营单位的大小企业。

58
Q

APIs

A
59
Q

four steps for running an A/B test

A
  1. Picking a metric to track
  2. Calculating sample size
  3. Running the experiment
  4. Checking for signicance
60
Q

Counter()

A
61
Q

Random row selection

随机选择DATAFRAME的75%的行

A
62
Q

list可以改变

integer不可以改变

A
63
Q
A
64
Q

C regularization

C的大小有什么影响?

A

As you probably noticed, smaller values of C lead to less confident predictions. That’s because smaller C means more regularization, which in turn means smaller coefficients, which means raw model outputs closer to zero and, thus, probabilities closer to 0.5 after the raw model output is squashed through the sigmoid function. That’s quite a chain of events!

65
Q

bias variance tradeoff

A
66
Q

random forest

A
67
Q

List python

remove data from sets

A
68
Q

List python

Set Operations - Similarities

A
69
Q

Model stability

Bootstrapping the mean

A

Bootstrapping the mean

Bootstrapping is a common way to assess variability The bootstrap:

  1. Take a random sample of data with replacement
  2. Calculate the mean of the sample
  3. Repeat this process many times (1000s)
  4. Calculate the percentiles of the result (usually 2.5, 97.5)

The result is a 95% condence interval of the mean of each coecient.