Machine Learning One Flashcards
Explain the three main data types
Numerical - numbers
- discrete data = counted data that are limited to integers (Number of cars passing by)
- continuous data = measured data that can be any number (price of item, size of item)
Categorical - values that can’t be measured against each other like color or yes/no
Ordinal - Values that can be measured against each other (Like school grades, if A is better than B)
Define the below
- mean
- median
- mode
Mean - average value
sum of all divided by 2
median - Mid point value
sort numbers, number in the middle is the median. If there are two, divde the sum of those two numbers
mode - most common value
Calculate the mean, median, and mode of
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
import numpy
import scipy
numpy.mean()
numpy.median()
scipy.mode()
What is standard deviation?
Number that describes how spread out values are.
Low Standard Deviation - numbers are close together
High Standard Deviation - numbers are further apart
Standard Deviation is often represented by the symbol Sigma: σ
Variance is often represented by the symbol Sigma Squared: σ2
Calculate the standard deviation of the below:
speed = [86,87,88,86,87,85,86]
import numpy
speed = [86,87,88,86,87,85,86]
x = numpy.std(speed)
print(x)
What is variance?
How do you find it?
Indicates how spread out values are.
In fact, the square root of the variance will get you the standard deviation.
if you multiply the standard deviation by itself, you get the variance.
Add all numbers together and divide by the amount of numbers
Next you can subtract the variance from each number then find the square root of the answer to each of these
Next add all of these together and divide by the amount of numbers and you will have your variance
Standard Deviation is often represented by the symbol Sigma: σ
Variance is often represented by the symbol Sigma Squared: σ2
Use a module to help you find the variance
import numpy
speed = [32,111,138,28,59,77,97]
x = numpy.var(speed)
print(x)
What are percentiles?
Used in statistics to give you a number that describes the value that a given percent of the values are lower than.
Example:
What is the 75th % of the following list:
ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]
75% of the people here are 43 or younger
Find the 75th Percentile of the following list
ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]
import numpy
x = numpy.percentile(ages, 75)
print(x)
Create an array containing 250 random floats between 0 and 5
import numpy
x = numpy.random.uniform(0.0, 5.0, 250)
print(x)
Create a histogram with 100 bars and a random data set of 10000 numbers ranging from 0.0 and 5.0
import numpy
import matplotlib.pyplot as plt
x = numpy.random.uniform(0.0, 5.0, 100000)
plt.hist(x, 100)
plt.show()
What is normal data distribution?
Create an array with 10000 values, a mean value of 5.0 and the standard deviation of 1.0
Array where values are concentrated around a given value.
Doing as the flash card says, the data we will see via the histogram is known as a bell curve
import numpy
import matplotlib.pyplot as plt
x = numpy.random.normal(5.0, 1.0, 100000)
plt.hist(x, 100)
plt.show()
What is a scatter plot
Diagram where each value in the data is represented by a dot.
Scatter plots need to be in the form of arrays and need an equal amount of number for the x and y axis.
Create a scatter plot
import matplotlib.pyplot as plt
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
plt.scatter(x, y)
plt.show()
Create a scatter plot with 1000 random numbers
The x axis will have a mean set of 5.0 and a standard deviation of 1.0
The y axis will have a mean set to 10.0 and a standard deviation of 2.0
import numpy
import matplotlib.pyplot as plt
x = numpy.random.normal(5.0, 1.0, 1000)
y = numpy.random.normal(10.0, 2.0, 1000)
plt.scatter(x, y)
plt.show()