Introductory terms Flashcards
descriptive statistics
describe a dataset using mathematically calculated values such as mean and std deviation
inferential statistics
statistical calculations that enable us to draw conclusions about the larger population
Normal distribution
the mean sets the middle of the distribution and the standard deviation sets the width.
Probability
the mathematical study of what could potentially happen; in data science probability calculations are used to simulate scenarios and build models, which help us understand data that has yet to exist.
Programming
the act of giving the computer instructions to perform a task
Clustering
a subsection of data science that allows us to classify data. Programming makes clustering data time-efficient
Domain expertise
the particular set of knowledge that someone cultivates in order to understand their data. My domain expertise is the food system, agriculture, sustainability.
Data science process (8 steps)
Ask a question; determine necessary data; get the data; clean and organize the data; explore the data; model the data + analysis; communicate findings; reproducibility and automation.
a/b testing
a process of showing two variants of the same web page to different segments of website visitors at the same time and comparing which variant drives more conversions.
Margin of error
amount results of survey differ from real population value. The larger the error, the less confidence we have in results.
Confidence level
the probability that we were to run another survey with the same metrics that would get the same results. (90%, 95%, 99%).
Population size
size of the population we’re collecting data on. A common number in sample size calculations is 100,000.
Likely sample portion
the % of people surveyed whose results we anticipate matching the expected outcome. If we don’t have historical data, we normally use 50%.
Active data collection
setting up specific conditions in which to get data. On the hunt. I.e.: running different experiments and surveys.
Passive data collection
you’re looking for data that already exists. You’re foraging data. I.e.: Locating datasets and web scraping.