Simplilearn Data Analyst Questions Flashcards
What is data mining?
process of finding new, relevant information; it takes raw data and transforms it into valuable information
What is data profiling?
process of assessing a dataset for uniqueness, consistency, and logic; it usually doesn’t involving identifying incorrect/inaccurate data
What is data wrangling?
the process of cleaning, structuring, and enriching raw data into a desired, usable format for better decision-making
What is a simple process of data wrangling?
discover –> structure –> clean –> enrich –> validate –> analyze
Data Wrangling vs Data Cooking?
Data cooking involves falsifying data or selectively deleting data to improve a hypothesis
An example is demographic data being manipulated by fieldworkers, researchers, etc. to support behavioral science theory.
What are common problems data analysts encounter during analysis?
1) handling duplicate/missing values
2) collecting meaningful, correct data at the right time
3) making data secure
4) dealing with compliance issues (ensuring that sensitive data is organized and meets enterprise business rules and legal/govt regulations)
5) handling data purging (freeing up database space) and storage issues
What are some steps in the analytics project?
1) state/understand the problem
2) collect data
3) clean data
4) explore and analyze the data
5) interpret results
What are some technical tools used for analysis and presentation?
MS SQL Server, MySQL, MS Excel, IBM SPSS, Tableau, Python, MS PowerPoint
What are some best practices for data cleaning?
Make a data cleaning plan by understanding where common errors happen and keep communications open;
Identify and remove duplicates before working with data;
Focus on accuracy, maintain value types of data, provide mandatory constraints, and set cross-field validation;
Standardize the data at the point of entry so that there’s less chaos and fewer errors occur
How can you handle missing values in a dataset?
Listwise deletion: an entire record is excluded from analysis if any single value is missing
Average imputation: Use the average value of the responses from other participants to replace missing values
Regression substitution: use multiple-regression analysis to estimate a missing value
Multiple imputation: create plausible values based on correlations for missing data and then averages the simulated datasets by incorporating random errors in predictions
What is a normal distribution?
It is a type of continuous probability distribution that is symmetric about a mean and appears as a bell curve.
mean = median = mode and they are located at the center of the graph
68% of data lies within 1 std away from the mean/avg
95% data falls within 2 std away from mean/avg
99.7% data falls within 3 std away from mean/avg
What is time series analysis?
Time series analysis is a statistical method that deals with an ordered sequence of values of a variable at equally spaced time intervals
In Tableau, what differs between joining and blending?
Data joining can only be done when data comes from the same source. So to combine two tables, the tables must be from the same databases or two/more sheets from the same Excel file.
Meanwhile, data blending is used when data is from 2/more different sources. An example would be combining an Oracle table with SQL Server or combining an Excel sheet and Oracle table.
In data joining, all combined tables/sheets contains a common set of dimensions/measures. On the other hand, data blending has each data source contain its own set of dimensions/measures.
Overfitting vs Underfitting
Overfitting: A model trains the data too well using the training set, causing significant performance drops over the test/validation set. This happens when the model understands noise and random fluctuations too well and over-specifies.
Underfitting: A model is not able to train data well or generalize new data. It performs poorly on both training and testing data. This happens when there is less data to build an accurate model or if the model does not suit the data (e.g.: using a linear model on non-linear data).
In MS Excel, a numeric value can be treated as a text value if it is preceded by an…
Apostrophe (‘)