Quiz 1 Flashcards
what are the steps of data analysis pipeline
- Figure out the question.
- Find/get relevant data.
- Clean & prepare the data.
- Analyze the data.
- Interpret & present results.
why is full data analysis broken up into many steps
if is impractical to rerun the first few steps over an over. ex API calls.
what does np.vectorize do
turns a function into a function that can operate on an entire array in an element by element fashion
what is numPy good for
storing and operating on arrays
what is pandas good for
pandas is good for manipulating data
what is a series in panda
a 1d array, stored as a numpy array, column
what is a dataframe in panda
a collection of data
define the steps of extract-transform-load
extract: get the data you need
transform: fix the data, clean data, get it in a form you want to work with
load: load into next step of your pipeline
what is signal processing and filtering algorithms
signal process uses filtering algorithms to remove noise from a signal
what does LOESS smoothing do
it’s a technique to smooth a curve, to remove the noise
LOESS smoothing takes a local area of the data and fit’s a line to it. We have to make a decision on how big this area is.
What happens if we pick a small area or a large area?
if small, we are more sensitive to noise
if large, we are less sensitive to signal changes
is LOESS better with lots of samples or sparce samples
better with more samples
true or false: LOESS’s parameters are y then x
true
what is kalman filtering
it allows you to express what you know to predict the most likely value for the truth
what does the kalman operation need
we need to give the variance of
1. our observations
2. our predications
and the covariance between each pair
we also need matrices for both out observations and predictions
the covariance matrix express our uncertainty in the measurement and predictions
in your observation-covariance matrix, which expresses errors in the observations, what does lower and higher values mean
lower values: less sensor error, allow the observations/measurements to have more of an effect on the result
higher values: more noise exists
the transition_covariance says what you think about the error in your prediction. What does lower and higher values mean
lower: less prediction error, let prediction affect the results more, less noise
higher: less accurate
what does it mean to impute data
replacing missing or deleted outliers with plausible, calculated values
what is entity resolution or record linkage
the process of finding multiple values that actually refer to the same entity
what is the difference between
city_data = city_data[city_data[‘area’] <= 10000]
city_data = city_data[‘area’] <= 10000
city_data = city_data[city_data[‘area’] <= 10000]
will filter based on if area is <= 10k
city_data = city_data[‘area’] <= 10000
will change the dataset to a single column of true/false based off if area <= 10k
how do you write sums in numpy and sums in pandas
numpy:
np.sum(totals, axis=x)
pandas:
totals.sum(axis=x)
you have a df called counts, and you have a column in the dataframe called ‘date’ make it a datetime column
counts[‘date’] = pd.to_datetime(counts[‘date’])