Simplilearn Data Analyst Questions Flashcards
What is data mining?
process of finding new, relevant information; it takes raw data and transforms it into valuable information
What is data profiling?
process of assessing a dataset for uniqueness, consistency, and logic; it usually doesn’t involving identifying incorrect/inaccurate data
What is data wrangling?
the process of cleaning, structuring, and enriching raw data into a desired, usable format for better decision-making
What is a simple process of data wrangling?
discover –> structure –> clean –> enrich –> validate –> analyze
Data Wrangling vs Data Cooking?
Data cooking involves falsifying data or selectively deleting data to improve a hypothesis
An example is demographic data being manipulated by fieldworkers, researchers, etc. to support behavioral science theory.
What are common problems data analysts encounter during analysis?
1) handling duplicate/missing values
2) collecting meaningful, correct data at the right time
3) making data secure
4) dealing with compliance issues (ensuring that sensitive data is organized and meets enterprise business rules and legal/govt regulations)
5) handling data purging (freeing up database space) and storage issues
What are some steps in the analytics project?
1) state/understand the problem
2) collect data
3) clean data
4) explore and analyze the data
5) interpret results
What are some technical tools used for analysis and presentation?
MS SQL Server, MySQL, MS Excel, IBM SPSS, Tableau, Python, MS PowerPoint
What are some best practices for data cleaning?
Make a data cleaning plan by understanding where common errors happen and keep communications open;
Identify and remove duplicates before working with data;
Focus on accuracy, maintain value types of data, provide mandatory constraints, and set cross-field validation;
Standardize the data at the point of entry so that there’s less chaos and fewer errors occur
How can you handle missing values in a dataset?
Listwise deletion: an entire record is excluded from analysis if any single value is missing
Average imputation: Use the average value of the responses from other participants to replace missing values
Regression substitution: use multiple-regression analysis to estimate a missing value
Multiple imputation: create plausible values based on correlations for missing data and then averages the simulated datasets by incorporating random errors in predictions
What is a normal distribution?
It is a type of continuous probability distribution that is symmetric about a mean and appears as a bell curve.
mean = median = mode and they are located at the center of the graph
68% of data lies within 1 std away from the mean/avg
95% data falls within 2 std away from mean/avg
99.7% data falls within 3 std away from mean/avg
What is time series analysis?
Time series analysis is a statistical method that deals with an ordered sequence of values of a variable at equally spaced time intervals
In Tableau, what differs between joining and blending?
Data joining can only be done when data comes from the same source. So to combine two tables, the tables must be from the same databases or two/more sheets from the same Excel file.
Meanwhile, data blending is used when data is from 2/more different sources. An example would be combining an Oracle table with SQL Server or combining an Excel sheet and Oracle table.
In data joining, all combined tables/sheets contains a common set of dimensions/measures. On the other hand, data blending has each data source contain its own set of dimensions/measures.
Overfitting vs Underfitting
Overfitting: A model trains the data too well using the training set, causing significant performance drops over the test/validation set. This happens when the model understands noise and random fluctuations too well and over-specifies.
Underfitting: A model is not able to train data well or generalize new data. It performs poorly on both training and testing data. This happens when there is less data to build an accurate model or if the model does not suit the data (e.g.: using a linear model on non-linear data).
In MS Excel, a numeric value can be treated as a text value if it is preceded by an…
Apostrophe (‘)
In Excel, what is the difference between COUNT, COUNTA, COUNTBLANK, and COUNTIF?
COUNT: returns the number of numeric cells in a range
COUNTA: returns the number of non-blank cells in a range
COUNTBLANK: returns the number of blank cells in a range
COUNTIF: returns the number of values by checking a given condition
How does the function VLOOKUP work in Excel?
Can be used to find things in a table or a range by row.
VLOOKUP(lookup_value, table_array, col_index_num, [range_lookup])
lookup_value: value you want to look up
table_array: range where lookup_value is located
col_index_num: column number in the range that contains the return value
range_lookup: specify TRUE if you want to approximate match or FALSE if you want an exact match of the return value
In SQL, how do you subset or filter data?
Use WHERE and HAVING clauses
Ex1: select * from MOVIES where Director = ‘Brad Bird’;
Ex2: select Director, sum(Duration) as total_duration, avg(Duration) as avg_duration from Movies group by Director having avg(Duration)>151;
In SQL what is the difference between WHERE and HAVING clauses?
the WHERE clause: works on row data, filter occurs before any groupings; aggregate functions cannot be used
SELECT col1, col2,…
FROM table_name
WHERE condition;
Meanwhile…
the HAVING clause: works on aggregated data; used to filter values from a group; allows aggregate functions
SELECT col1, col2, ... FROM table_name WHERE condition GROUP BY column_names HAVING condition ORDER BY column_names
How do you use the reshape() function in NumPy?
reshape(array, shape)
Ex1:
import numpy as np
a = np.array([[1,2,3,4,5],[6,7,8,9,10]])
np.reshape(a, (2,5))
What are different ways to create a dataframe in Pandas?
import pandas as pd…
1) by initializing a list
data = [[‘tom’, 30], [‘jerry’, 20], [‘angela’, 35]]
df = pd.DataFrame(data, columns = [‘Name’, ‘Age’])
2) by initializing a dictionary
data = {‘Name’:[‘Tom’, ‘Jerry’, ‘Angela’], ‘Age’: [20, 35, 19]}
df = pd.DataFrame(data)
Suppose you have an “emp.csv” file. Using Python, how can you create an employees dataframe and display the head and summary?
import pandas as pd
employees = pd.read_csv(“File_path”)
employees. head()
employees. info()
In Python, how do you select specific columns from a dataframe?
for one column: df[col_name]
for two or more columns: df[[col1, col2,…]]
What criteria entails that a developed data model is good or bad?
Good models are…
intuitive, insightful, and self-explanatory
easily consumed by the clients for actionable and profitable results
easily adapted to changes according to business requirements
scale according to new data when data gets updated
Why is Exploratory Data Analysis so important?
EDA helps understand data better.
Helps obtain confidence in data, giving reassurance to go ahead and engage in developing a ML algorithm
Allows you to refine your selection of feature variables for later model building
Allows you discover hidden trends and insights
How do you treat outliers?
drop, cap your outliers, assign new values, or try a new transformation (as last resort)
Descriptive vs predictive vs prescriptive (analytics)?
Descriptive: provides insights into past/history to answer “What has happened?”; uses data aggregation and mining techniques
Predictive: understands the futur and answers “What could happen?”; uses statistical models and forecasting techniques
Prescriptive: suggests various courses of action to answer “What should you do?”; uses optimization and simulation algorithms to advise possible outcomes
What are different types of sampling techniques?
Sampling = statistical method that selects a subset of data from an entire dataset (population) to estimate the characteristics of the population
simple random sampling
systematic sampling
cluster sampling
stratified sampling
judgmental/purposive sampling
What are the different types of hypothesis testing?
null hypothesis: states there’s no relation between the predictor and outcome variables in the population; denoted by H0
alternative hypothesis: states there’s some relation between the predictor and outcome variables in the population; denoted by H1