Data Science Flashcards
quantitative data
used to measure the amount of something (eg: mass)
categorical data
used to classify instead of measure (eg: species of an animal)
Definitions area
where you write the code in Pyret
Interactions area
where the output is
Pyret decimals
must start with 0.
bar chart
count
Visual representation of value’s frequency
Column for every category
Pie chart
Percentage
Visual representation of RELATIVE frequency
slice for every column
Max 7 slices, generally 5
stacked bar chart
shows more detail about another column (eg: count of species and sex)
data cycle
ask questions, consider data, analyze data, interpret data (mnemonic: QCAI )
lookup questions
answered by looking up a single value in a table
arithmetic questions
computing an answer within a single column Can be finding the average, max, min in a column
statistical questions
asks a question about the relationship between two columns?
null hypothesis
a type of statistical hypothesis that proposes no statistical significance exists in a set of given observations
random samples
a subset of a population in which each member has an equal chance of being chosen. Larger the random sample, the more accurate
grouped samples
a subset of the population in which each member of the subset was chosen for a specific reason
file extension purpose
tell your computer which application created or can open open the file and which icon to use for the file
what does CSV stand for?
comma-separated values
histograms
shows the number of rows that fall within certain intervals (or “bins”) along the horizontal axis
⬛️⬛️
⬛️⬛️⬛️⬛️
?
type of histogram
📉
(imagine this as a histogram)
⬛️
⬛️⬛️
⬛️⬛️⬛️
⬛️⬛️⬛️⬛️⬛️⬛️⬛️
skew right
type of histogram
📈
(imagine this as a histogram)
⬛️
⬛️⬛️
⬛️⬛️⬛️
⬛️⬛️⬛️⬛️⬛️⬛️
Skew left
what does it mean to be an outlier
Compare it to the other data. But it is important to think about all extreme data points, not just outliers
mean
average
symmetric medium-large dataset
median
Half the values are smaller and half are larger. The middle number or average of two middle #s
If data is asymmetric, use median
mode
or #s that occur the most often in a dataset
in small dataset, mode will likely be most accurate measure of center
how many quartiles are in a box plot?
3
histogram and box plot shape
whisker direction is the same direction as the skew
standard deviation
the most useful way to summarize spread of quantitative columns
how to calculate standard deviation
average spread from mean
standard deviation equation
sqrt([number of squares of distances] / [# - 1] )
explanatory variable
a type of independent variable (x)
scatterplot
response variable
a type of dependent variable (y)
scatterplot
r
correlation statistic
between -1 and +1
-1 = strongest negative correlation
+1 = strongest positive correlation
0 = no correlation
what is the regression line also known as?
Line of best fit, least quares line, predictor, trendline
definition of a row
cat-row = row-n(animals-table, #)
look up identify
cat-row[species”]
(have cat-row predefined)
how to make a function
fun gt(name): fun(parameters) end
what is an example for functions?
shows what the function does
example example
fun f(x): x / 2 end
examples
f(2) is 2 / 2
f(10) is 10 / 2
end
what functions need a helper function
image-scatter-plot, build-column
what function to make a specific table
sort or build-column
filter(build-column(animals-table, “kilos”, kilogram), is-heavy)
syntax errors
typos and easy to spot. code will not run
runtime error
the app runs for a bit and crashes at specific point in the code
logic error
the app runs completely but simply produces the wrong input
four categories of dirty data
missing data, inconsistent types, inconsistent units/invalid range, inconsistent naming
missing data
some cells have data. Some do not
inconsistent types
a column where the values have different data types. (eg: 2, two)
inconsistent unit/ invalid range
where the data types are the same but represent different units
inconsistent naming
inconsistent spelling and capitalization in entries
selection bias
if the participants selected are representative of the group study
bias in the study design
if the study was not designed specifically and ended up not measuring what was asked very specifically
poor choice of summary
using the wrong data analysis technique: mean/median
confounding variables
correlation does not imply causation. an outside influence other than the one being studies
intentionally using the wrong chart
misleads the audience, can remove holes in data, making it inaccurate
changing the scale of a chart
makes the data look a certain way