Simplilearn Data Analyst Questions Flashcards by Nafisa Sabah

What is data mining?

process of finding new, relevant information; it takes raw data and transforms it into valuable information

How well did you know this?

Not at all

Perfectly

What is data profiling?

process of assessing a dataset for uniqueness, consistency, and logic; it usually doesn’t involving identifying incorrect/inaccurate data

How well did you know this?

Not at all

Perfectly

What is data wrangling?

the process of cleaning, structuring, and enriching raw data into a desired, usable format for better decision-making

How well did you know this?

Not at all

Perfectly

What is a simple process of data wrangling?

discover –> structure –> clean –> enrich –> validate –> analyze

How well did you know this?

Not at all

Perfectly

Data Wrangling vs Data Cooking?

Data cooking involves falsifying data or selectively deleting data to improve a hypothesis

An example is demographic data being manipulated by fieldworkers, researchers, etc. to support behavioral science theory.

How well did you know this?

Not at all

Perfectly

What are common problems data analysts encounter during analysis?

1) handling duplicate/missing values
2) collecting meaningful, correct data at the right time
3) making data secure
4) dealing with compliance issues (ensuring that sensitive data is organized and meets enterprise business rules and legal/govt regulations)
5) handling data purging (freeing up database space) and storage issues

How well did you know this?

Not at all

Perfectly

What are some steps in the analytics project?

1) state/understand the problem
2) collect data
3) clean data
4) explore and analyze the data
5) interpret results

How well did you know this?

Not at all

Perfectly

What are some technical tools used for analysis and presentation?

MS SQL Server, MySQL, MS Excel, IBM SPSS, Tableau, Python, MS PowerPoint

How well did you know this?

Not at all

Perfectly

What are some best practices for data cleaning?

Make a data cleaning plan by understanding where common errors happen and keep communications open;

Identify and remove duplicates before working with data;

Focus on accuracy, maintain value types of data, provide mandatory constraints, and set cross-field validation;

Standardize the data at the point of entry so that there’s less chaos and fewer errors occur

How well did you know this?

Not at all

Perfectly

How can you handle missing values in a dataset?

Listwise deletion: an entire record is excluded from analysis if any single value is missing

Average imputation: Use the average value of the responses from other participants to replace missing values

Regression substitution: use multiple-regression analysis to estimate a missing value

Multiple imputation: create plausible values based on correlations for missing data and then averages the simulated datasets by incorporating random errors in predictions

How well did you know this?

Not at all

Perfectly

What is a normal distribution?

It is a type of continuous probability distribution that is symmetric about a mean and appears as a bell curve.

mean = median = mode and they are located at the center of the graph

68% of data lies within 1 std away from the mean/avg

95% data falls within 2 std away from mean/avg

99.7% data falls within 3 std away from mean/avg

How well did you know this?

Not at all

Perfectly

What is time series analysis?

Time series analysis is a statistical method that deals with an ordered sequence of values of a variable at equally spaced time intervals

How well did you know this?

Not at all

Perfectly

In Tableau, what differs between joining and blending?

Data joining can only be done when data comes from the same source. So to combine two tables, the tables must be from the same databases or two/more sheets from the same Excel file.

Meanwhile, data blending is used when data is from 2/more different sources. An example would be combining an Oracle table with SQL Server or combining an Excel sheet and Oracle table.

In data joining, all combined tables/sheets contains a common set of dimensions/measures. On the other hand, data blending has each data source contain its own set of dimensions/measures.

How well did you know this?

Not at all

Perfectly

Overfitting vs Underfitting

Overfitting: A model trains the data too well using the training set, causing significant performance drops over the test/validation set. This happens when the model understands noise and random fluctuations too well and over-specifies.

Underfitting: A model is not able to train data well or generalize new data. It performs poorly on both training and testing data. This happens when there is less data to build an accurate model or if the model does not suit the data (e.g.: using a linear model on non-linear data).

How well did you know this?

Not at all

Perfectly

In MS Excel, a numeric value can be treated as a text value if it is preceded by an…

Apostrophe (‘)

How well did you know this?

Not at all

Perfectly

In Excel, what is the difference between COUNT, COUNTA, COUNTBLANK, and COUNTIF?

Study These Flashcards

COUNT: returns the number of numeric cells in a range

COUNTA: returns the number of non-blank cells in a range

COUNTBLANK: returns the number of blank cells in a range

COUNTIF: returns the number of values by checking a given condition

How does the function VLOOKUP work in Excel?

Study These Flashcards

Can be used to find things in a table or a range by row.

VLOOKUP(lookup_value, table_array, col_index_num, [range_lookup])

lookup_value: value you want to look up

table_array: range where lookup_value is located

col_index_num: column number in the range that contains the return value

range_lookup: specify TRUE if you want to approximate match or FALSE if you want an exact match of the return value

In SQL, how do you subset or filter data?

Study These Flashcards

Use WHERE and HAVING clauses

Ex1: select * from MOVIES where Director = ‘Brad Bird’;

Ex2: 
select Director, 
sum(Duration) as total_duration,
avg(Duration) as avg_duration
from Movies
group by Director
having avg(Duration)>151;

In SQL what is the difference between WHERE and HAVING clauses?

Study These Flashcards

the WHERE clause: works on row data, filter occurs before any groupings; aggregate functions cannot be used

SELECT col1, col2,…
FROM table_name
WHERE condition;

Meanwhile…

the HAVING clause: works on aggregated data; used to filter values from a group; allows aggregate functions

SELECT col1, col2, ...
FROM table_name
WHERE condition
GROUP BY column_names
HAVING condition
ORDER BY column_names

How do you use the reshape() function in NumPy?

Study These Flashcards

reshape(array, shape)

Ex1:
import numpy as np
a = np.array([[1,2,3,4,5],[6,7,8,9,10]])
np.reshape(a, (2,5))

What are different ways to create a dataframe in Pandas?

Study These Flashcards

import pandas as pd…

1) by initializing a list
data = [[‘tom’, 30], [‘jerry’, 20], [‘angela’, 35]]
df = pd.DataFrame(data, columns = [‘Name’, ‘Age’])

2) by initializing a dictionary
data = {‘Name’:[‘Tom’, ‘Jerry’, ‘Angela’], ‘Age’: [20, 35, 19]}
df = pd.DataFrame(data)

Suppose you have an “emp.csv” file. Using Python, how can you create an employees dataframe and display the head and summary?

Study These Flashcards

import pandas as pd

employees = pd.read_csv(“File_path”)

employees. head()
employees. info()

In Python, how do you select specific columns from a dataframe?

Study These Flashcards

for one column: df[col_name]

for two or more columns: df[[col1, col2,…]]

What criteria entails that a developed data model is good or bad?

Study These Flashcards

Good models are…

intuitive, insightful, and self-explanatory

easily consumed by the clients for actionable and profitable results

easily adapted to changes according to business requirements

scale according to new data when data gets updated

Why is Exploratory Data Analysis so important?

EDA helps understand data better. Helps obtain confidence in data, giving reassurance to go ahead and engage in developing a ML algorithm Allows you to refine your selection of feature variables for later model building Allows you discover hidden trends and insights

How do you treat outliers?

drop, cap your outliers, assign new values, or try a new transformation (as last resort)

Descriptive vs predictive vs prescriptive (analytics)?

Descriptive: provides insights into past/history to answer "What has happened?"; uses data aggregation and mining techniques Predictive: understands the futur and answers "What could happen?"; uses statistical models and forecasting techniques Prescriptive: suggests various courses of action to answer "What should you do?"; uses optimization and simulation algorithms to advise possible outcomes

What are different types of sampling techniques?

Sampling = statistical method that selects a subset of data from an entire dataset (population) to estimate the characteristics of the population simple random sampling systematic sampling cluster sampling stratified sampling judgmental/purposive sampling

What are the different types of hypothesis testing?

null hypothesis: states there's no relation between the predictor and outcome variables in the population; denoted by H0 alternative hypothesis: states there's some relation between the predictor and outcome variables in the population; denoted by H1

Simplilearn Data Analyst Questions Flashcards

(29 cards)