Compiled Summatives - Sheet1 Flashcards
What is the primary focus of statistics?
Predictive modeling
Data mining
Application of algorithms to inform strategic decisions
Collection, analysis, interpretation, presentation, and organization of data
Collection, analysis, interpretation, presentation, and organization of data
Which of the following methods is commonly used in statistics to understand data distributions and relationships?
Algorithm application
Data mining
Hypothesis testing and regression analysis
Predictive modeling
Hypothesis testing and regression analysis
What does analytics emphasize in addition to statistical methods?
Data presentation
Data interpretation
Predictive modeling and data mining
Data collection
Predictive modeling and data mining
Which of the following best describes the scope of analytics?
Integrates statistical methods with advanced computational techniques
Focuses solely on hypothesis testing
Limited to data collection and presentation
Only involves data organization
Integrates statistical methods with advanced computational techniques
What is the first step in the data analysis process
Get actionable information
Extract patterns
Prepare data
Apply machine learning techniques
Prepare data
Which of the following is not listed as a data source from the chart?
Printed Books
Email
Social Media Posts
Audio
Printed Books
What does the second step of the process involve?
Finding patterns using algorithms
Making decisions based on information
Collecting raw information
Cleaning and transforming databases
Finding patterns using algorithms
In which step would you apply machine learning techniques according to this flowchart?
Step 2- Extract Patterns
None of the above steps explicitly mention applying machine learning techniques
Step 3 - Get Actionable Information
Step 1 - Prepare Data
Step 2- Extract Patterns
What outcome does this flowchart suggest as a result of following these steps?
Creation of new databases
Learning how to code in various programming languages
Development of new software programs
Gaining insights or making informed decisions based on analyzed data
Gaining insights or making informed decisions based on analyzed data
What does transactional data primarily consist of?
Visual representations of data
General summaries of transactions
Structured, detailed information
Unstructured and random information
Structured, detailed information
Which of the following is an example of transactional data?
Credit card payment
Social media posts
Weather forecasts
Movie reviews
Credit card payment
What type of information is included in contractual, subscription, or account data?
Social media interactions
General market trends
Information about the type of product combined with customer characteristics
Weather patterns
Information about the type of product combined with customer characteristics
Which of the following is an example of a product type mentioned in the statement?
Loan
Weather forecast
Movie review
Social media post
Loan
What is the primary aim of surveys?
To extract sociodemographic and behavioral data from a particular group of people
To organize social events for communities
To entertain a particular group of people
To provide financial assistance to people
To extract sociodemographic and behavioral data from a particular group of people
Surveys are typically in the form of:
Novels
Music albums
Questionnaires
Art exhibitions
Questionnaires
Which of the following is NOT an example of unstructured data?
Social media posts
Media files
Sensor data
Spreadsheets
Spreadsheets
What is unstructured data?
Information that resides in a traditional row-column database
Data that is always textual
Data that is always numerical
Information that does not reside in a traditional row-column database
Information that does not reside in a traditional row-column database
Which of the following is an example of a purpose for which data poolers gather data?
Marketing and credit risk assessment
Weather forecasting
Event planning
Cooking recipes
Marketing and credit risk assessment
What is the primary role of data poolers?
To provide financial advice
and sell data for specific purposes
To develop software applications
To create new databases
and sell data for specific purposes
What is the first phase in the data analytics process?
Business Understanding
Modelling
Data Preparation
Evaluation
Business Understanding
What is the primary goal of the Business Understanding phase?
Cleaning data for better quality
Evaluating the model
Evaluating the model
Applying machine learning algorithms
Evaluating the model
Which phase involves selecting related data from various databases?
Data Understanding
Deployment
Data Preparation
Modelling
Data Understanding
Which of the following is NOT a type of database mentioned in the Data Understanding phase?
Relational Databases
Temporal, Sequence or Time-Series Database
Social Media Databases
Data Warehouses
Social Media Databases
What is another term for Data Preparation?
Data Modelling
Data Preprocessing
Data Transformation
Data Cleaning
Data Preprocessing
Which of the following activities is NOT part of Data Preparation?
Aggregating data
Filling in missing values
Applying machine learning algorithms
Filtering outliers
Applying machine learning algorithms
What does Data Transformation involve?
Converting different measurements into a unified numerical scale
Evaluating the model
Selecting related data from databases
Cleaning data for better quality
Converting different measurements into a unified numerical scale
Which of the following is an example of categorical values?
Filtered data
Numerical scales
Ordinal values (less, moderate, strong)
Aggregated data
Ordinal values (less, moderate, strong)
What is the primary focus of the Modelling phase?
Applying statistical and machine learning algorithms
Identifying business tasks
Selecting related data
Cleaning data
Applying statistical and machine learning algorithms
Which phase involves evaluating the performance of the model?
Deployment
Data Preparation
Business Understanding
Evaluation
Evaluation
What is the final phase in the data analytics process?
Modelling
Deployment
Evaluation
Data Understanding
Deployment
Which activity is part of the Data Preparation phase?
Identifying relevant data for the problem description
Evaluating the model
Applying machine learning algorithms
Filtering outliers and redundancies
Filtering outliers and redundancies
What type of data can be found in a Temporal, Sequence or Time-Series Database?
Static data
Aggregated data
Time-based data
Categorical data
Time-based data
Which phase involves selecting the related data from many available databases to correctly describe a given business task?
Data Understanding
Evaluation
Data Preparation
Modelling
Data Understanding
What is the definition of Mean
The range of values in a dataset
The average value of a dataset
The middle value in a dataset
The most frequently occurring value in a dataset
The average value of a dataset
How is the Mean calculated?
By identifying the most frequent value
By summing all values and dividing by the number of values
By subtracting the smallest value from the largest value
By finding the middle value
By summing all values and dividing by the number of values
What does the Median represent?
The most frequently occurring value in a dataset
The middle value when arranged in order
The difference between the highest and lowest values
The average value of a dataset
The middle value when arranged in order
Which measure of central tendency can have multiple values?
Median
Mean
Range
Mode
Mode
What is the primary purpose of measures of central tendency?
Measuring dispersion
Solving equations
Calculating probability
Organizing, summarizing, and visualizing data
Organizing, summarizing, and visualizing data
Formula for mean of population data
Formula for mean of sample data
What is the midrange of the data set 11, 13, 4, 30, 9, 15?
15
17
16
18
17
Median Formula for Grouped Datasets
t
What does the SUM function do?
Calculates the mean value of a dataset
Adds a range of cells
Returns the median value of a dataset
Returns the maximum value of a dataset
Adds a range of cells
Which function would you use to calculate the arithmetic average of a range of cells?
SUMIF
AVERAGE
MEDIAN
MAX
AVERAGE
For finding the smallest value in your data set, which function will you use?
AVERAGE
MIN
SUMIF
MAX
MIN
Given the class boundaries 50-60, 60-70, 70-80, 80-90, and 90-100 with frequencies 5, 12, 9, 6, and 4 respectively, what is the total frequency (N)?
30
36
45
40
36
What is the midpoint ((d_i)) of the class 70-80?
70
80
65
75
75
Calculate the product of the midpoint and frequency for the class 80-90.
600
570
540
510
510
Match each question to its corresponding statistical method.
Which factors move together?
Are there differences in distribution?
Are two populations similar?
Correlation Coefficient
Categorical Distribution
Analysis of Variance (ANOVA, F-Test)
What branch of statistics involves using sample data to make conclusions or predictions about a larger population?
Inferential Statistics
Descriptive Statistics
Non-parametric Statistics
Bayesian Statistics
Inferential Statistics
Which method measures the linear relationship between two numerical variables?
Pearson Correlation Coefficient
Chi-Square Test
ANOVA
T-test
Pearson Correlation Coefficient
What does the F-Test in ANOVA compare?
Variances within and between groups
Means of two samples
Medians of two samples
Standard deviations of two samples
Variances within and between groups
Which technique is NOT used for hypothesis testing?
Predictive Modeling
Z-test
T-test
Chi-Square Test
Predictive Modeling
What does the significance level (α) indicate in hypothesis testing?
The probability of rejecting the null hypothesis when it is true.
The maximum allowed sample size.
The minimum sample size for an accurate test.
The variance within a sample.
The probability of rejecting the null hypothesis when it is true.
What is the significance level (α) commonly set at?
0.05 or 5%
0.10 or 10%
0.01 or 1%
0.20 or 20%
0.05 or 5%
In the given example, which two categorical variables are being tested for association?
Gender (male/female) and smoking status (smoker/non-smoker)
Age group and education level
Income level and exercise frequency
Ethnicity and diet preference
Gender (male/female) and smoking status (smoker/non-smoker)
A T-test is a statistical test used to determine whether there is a significant difference between sample and population means, or between the means of two samples.
True
False
False
Use a Z-test when the population standard deviation (σ) is unknown
and must be estimated from the sample.
True
False
False
What is the formula for Pearson Correlation Coefficient
If you know the population standard deviation and have a large sample size (n > 30), you can use a Z-test for comparing means.
True
False
True
If the population standard deviation is unknown or the sample size is small (n < 30), use a t-test to compare means.
True
False
True
If the test statistic is greater than the critical t-alue, we reject the null hypothesis.
True
False
False
In the given example, the calculated t-value of -4.22 is less than the critical t-value of -2.821, so we reject the null hypothesis.
True
False
True
Match the following types of analytics with their corresponding questions
Descriptive Analytics
Diagnostic Analytics
Predictive Analytics
What has happened or what is happening now?
Why it happened?
What will likely happen?
Which of the following activities are associated with Data Exploration?
A) Data cleaning
B) Data augmentation and transformation
C) Exploratory data analysis
D) Feature selection
E) Identify data dependencies and correlations
F) Identify trends or anomalies in the data
C, E, F
A, B, D
B, D, F
A, C, E
C, E, F
Which of the following activities are associated with Data Exploration? Choose 3 correct answers
Identify data dependencies and correlations
Identify trends or anomalies in the data
Exploratory data analysis
Data cleaning
Feature selection
Data augmentation and transformation
Identify data dependencies and correlations
Identify trends or anomalies in the data
Exploratory data analysis
Which of the following activities are associated with Data Modification?
A) Data cleaning
Data augmentation and transformation
Exploratory data analysis
Feature selection
Identify data dependencies and correlations
Identify trends or anomalies in the data
A) Data cleaning
Data augmentation and transformation
Feature selection
Which process involves removing or correcting errors in the data?
Data cleaning
Data augmentation
ata transformation
Feature selection
Data cleaning
What is the purpose of Feature Selection?
To reduce the number of variables for modeling
To identify trends in the data
To enhance the data with additional information
To clean the data
To reduce the number of variables for modeling
Which activity involves adding new data points or modifying existing ones to improve the dataset?
Data augmentation
Data cleaning
Exploratory data analysis
Feature selection
Data augmentation
Which of the following is NOT typically a part of Data Exploration?
Cleaning the data
Identifying data dependencies
Identifying trends in the data
Exploratory data analysis
Cleaning the data
Which activity is crucial for understanding the relationships between different variables in a dataset?
Identifying data dependencies and correlations
Data cleaning
Data augmentatio
Feature selection
Identifying data dependencies and correlations
Can you use the model already for prediction purposes?
No, you still need to investigate the model’s goodness-of-fit.
Yes, the model is ready for predictions.
No, you still need to investigate the model’s goodness-of-fit.
What do you need to prove before using the model for predictions?
If your predictors are significant
The model’s accuracy
If your predictors are significant
Simple Linear Regression Match the Symbol:
y
β
x
α
ε
dependent variable
beta coefficient
independent variable
alpha intercept
error term
Which of the following methods is best for visualizing the relationship between TV ad spend and sales?
Scatter plot
Line graph
Bar chart
Pie chart
Scatter plot
What does ANOVA stand for?
Analysis of Varianc
Analysis of Variables
Analysis of Values
Analysis of Vectors
Analysis of Varianc
In ANOVA, what does the explained variability represent?
The amount of variation in the response variable that may be attributed to the predictors explicitly stated in the model
The total variation in the response variable
The amount of variation that cannot be explained by the model
The amount of variation attributed to random error
The amount of variation in the response variable that may be attributed to the predictors explicitly stated in the model
Which part of the variation does ANOVA decompose?
Both explained and unexplained variability
Only the explained variability
Only the unexplained variability
Neither explained nor unexplained variability
Both explained and unexplained variability
Why is ANOVA used in statistical analysis?
To compare the means of different groups
o measure the central tendency of data
To determine the correlation between variables
To visualize data distributions
To compare the means of different groups
In multiple regression, what is the purpose of including multiple independent variables?
To improve the prediction accuracy by accounting for more factors
To increase the complexity of the model
To ensure the residuals are normally distributed
To reduce the sample size
To improve the prediction accuracy by accounting for more factors
Which of the following is a key assumption of linear regression?
The residuals are normally distributed
The relationship between the independent and dependent variables is non-linear
The independent variables are highly correlated
The dependent variable is categorical
The residuals are normally distributed
Which of the following libraries are used for mathematical and statistical operations on multi-dimensional arrays and matrices in Python?
NumPy
Pandas
Matplotlib
NumPy
Which of the following libraries are used for data visualization in Python?
Matplotlib
SciPy
NumPy
Matplotlib
Which of the following libraries are used for sorting, grouping, and rearranging data in Python?
Pandas
NumPy
SciPy
Matplotlib
Pandas
Which of the following libraries are used for processing large multidimensional arrays and matrices in Python?
SciPy
Pandas
PyTorch
SciPy
Which of the following libraries are used for deep learning in Python?
TensorFlow
Keras
Scikit-learn
TensorFlow
Which of the following libraries are used for natural language processing in Python?
NLTK
Scrapy
Scikit-learn
NLTK
Which of the following libraries are used for data scraping in Python?
Scrapy
Gensim
NLTK
Pandas
Scrapy
Which of the following libraries are used for efficient learning of word representations in Python?
Gensim
Scrapy
NLTK
Gensim
Which of the following libraries are used for creating spiders bots that scan website pages and collect structured data in Python?
Scrapy
SciPy
Pandas
Scrapy
Which of the following libraries are used for object identification, speech recognition, and more in Python?
PyTorch
Keras
Dist-keras
PyTorch
Which of the following libraries are used for reading data, selecting and filtering in data, and data manipulations in Python? There are two correct answer in the options, just choose one.
NumPy
Pandas
SciPy
PyTorch
NumPy
Pandas
Which of the following libraries are used for creating two-dimensional diagrams and graphs in Python?
Matplotlib
NumPy
SciPy
Seaborn
Matplotlib
Which of the following libraries are used for creating interactive and scalable visualizations in a browser using JavaScript widgets in Python?
Plotly
Bokeh
Which Python libraries are built on NumPy? There are two correct ansers from the choices, just select one.
Pandas
Scikit-Learn
Seaborn
Matplotlib
Pandas
Scikit-Learn
Which Python library provides machine learning algorithms?
Scikit-Learn
NumPy
Matplotlib
Pandas
Scikit-Learn
Which data type in Pandas corresponds to a column with mixed data types?
object
int64
float64
timedelta[ns]
object