Data Analyst Concept Flashcards
State the five sub-categories of the routine steps in data analysis
1) Discovering the problem
2) Data preparation
3) Fitting models to the data
4) Understanding the results
5) Sharing your work
Describe the two elements of a problem hypothesis
Problem statement - outlines the problem in simple real world terms.
Hypothesis - a simple, testable theory that addresses the problem
Name 5 requirements for a good hypothesis
1) Must be testable and written in non-ambiguous language
2) Must at least partly answer the problem statement
3) Must make at least one clear prediction
4) Must be based on relevant and reliable information
5) Must contain a dependant and independent variable
Explain the difference between a dependant and independent variable
The independent variable is something that changes regularly and we measure it as it’s happening, such as time, whereas the dependant variable depends on the independent variable.
Explain the difference between a hypothesis and a prediction
A hypothesis makes a broad suggestion trend, i.e the more time a customer spends on an online shop, the more likely a customer is to buy an item, whereas a prediction states a specific trend, i.e I predict for every 2 minutes spent extra, the customer is 5% more likely to buy an item
What does ETL stand for
Extract, transform, load
Define a ‘mathematical model’
A model that describes some features of the data using equations and parameters.
State the four steps involved in Cross-Validation
1) Select two random sets of data from the original dataset - Training and Testing Data
2) Fit the model (trend line) to the Training Data
3) Plot the Testing Data against the trendline
4) Assess how well the trendline predicts the Testing Data
How can you identify a ‘good’ mathematical model
The model will clearly predict the testing data when fitted to a small amount of training data
Will look similar even if you randomly pick different samples of training data
How can you identify a ‘bad’ mathematical model
The model will give unpredictable results to the testing data
Will look very different each time you use randomly selected training data
Define ‘null hypothesis’
The null hypothesis is a theory that whatever relationship you are studying is not due to a real effect but observed only because of a random sampling
Define ‘alternative hypothesis’
Opposite of null, whatever observed relationship is related