SECTION 2: The Data Analytics Lifecycle Flashcards

1
Q

What are the six phases in a data analytics project?

A

Discovery, Data preparation, Model planning, Model execution, Communicate results, Operationalize

The phases may vary in terminology across organizations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the purpose of the discovery phase?

A

Identify the project’s purpose, define questions of interest, assess resources and constraints, and establish desired outcomes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What activities are involved in the discovery phase?

A
  • Assessing available resources
  • Framing the problem
  • Identifying key stakeholders
  • Interviewing the analytics sponsor
  • Developing the initial hypothesis
  • Identifying potential data sources
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the data preparation phase focused on?

A

Gathering and preparing the necessary data for analysis using various sources and tools.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does the model planning phase involve?

A

Choosing appropriate analytical models based on project objectives and available data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What occurs during the model execution phase?

A

Applying chosen models to prepared data, interpreting results, and refining models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the focus of the communicate results phase?

A

Presenting findings in a meaningful format for various stakeholders.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the goal of the operationalize results phase?

A

Implementing insights from the project into real-world applications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Identify the key roles involved in executing analytic projects.

A
  • End users
  • Project sponsors
  • Project managers
  • Data analysts
  • Business intelligence analysts
  • Database administrators
  • Data engineers
  • Data scientists
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the purpose of the data analytics lifecycle?

A

Provide a systematic and iterative framework for managing big data challenges and data science projects.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the four main characteristics of big data?

A
  • Variety
  • Velocity
  • Veracity
  • Volume
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Define ‘variety’ in the context of big data.

A

Diverse types of data, including structured, semi-structured, and unstructured formats.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does ‘velocity’ refer to in big data?

A

The speed at which data is produced, collected, and processed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is ‘veracity’ in data analytics?

A

The accuracy, reliability, and quality of the data collected and analyzed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does ‘volume’ mean in big data?

A

The sheer amount of data generated and handled by businesses, ranging from terabytes to petabytes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the extract, load, transform (ELT) process?

A

A key aspect of data preparation that combines data transformation flexibility with data preservation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is data cleaning?

A

Processes for handling errors, missing data, and other problems in dirty data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is a corporate data warehouse?

A

A centralized storage system for a company’s data that is often the ideal location for data mining tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are the types of qualitative data?

A
  • Nominal
  • Ordinal
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are the types of quantitative data?

A
  • Interval
  • Ratio
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is a type I error in hypothesis testing?

A

Rejection of the null hypothesis when the null hypothesis is true.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is a type II error in hypothesis testing?

A

Acceptance of a null hypothesis when the null hypothesis is false.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is clustering in data analytics?

A

A technique used to group similar objects or data points together based on their characteristics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What are common tools used in the model planning phase?

A
  • R
  • SQL Analysis Services
  • Python
  • Apache Spark
  • RapidMiner
  • KNIME
25
Q

What is the purpose of the model execution phase?

A

To develop datasets for testing, training, and production purposes, and to build and execute models.

26
Q

What is the operationalization phase?

A

The phase where reports, briefings, and technical documents are delivered, and a pilot project may be implemented.

27
Q

What is the significance of the data analytics lifecycle?

A

It ensures that results are insightful, actionable, and valuable to the organization.

28
Q

What is the model planning phase?

A

The phase where suitable models for tasks such as clustering, classification, or discovering relationships are identified.

29
Q

What is the purpose of the model planning phase?

A

To ensure alignment with business goals and select key variables and models.

30
Q

List common tools used in the model planning phase.

A
  • R
  • SQL Analysis Services
  • Python
  • Apache Spark
  • RapidMiner
  • KNIME
31
Q

Define candidate models.

A

Potential models for clustering, classifying, or finding relationships in data.

32
Q

What is dataset structure?

A

The arrangement and organization of data used in the analysis process.

33
Q

What are analytical techniques?

A

Methods and tools used to analyze and process data to achieve business objectives.

34
Q

What is variable selection?

A

The process of identifying essential predictors and variables to include in the model.

35
Q

Differentiate between structured and unstructured data.

A
  • Structured data: Organized in a specific format or schema
  • Unstructured data: Lacks a specific format or structure
36
Q

What is training data?

A

The dataset used for model development, where the model learns patterns and relationships in the data.

37
Q

What is test data?

A

A separate dataset, also called hold-out data, used to evaluate the model’s performance and accuracy.

38
Q

What is model assessment?

A

The process of evaluating the technical merits of a model, such as accuracy, comprehensibility, and confidence in predictions.

39
Q

What does error rate measure?

A

The percentage of records classified correctly or incorrectly, used to measure the accuracy of a model.

40
Q

Define lift in the context of modeling.

A

A measure that indicates the change in concentration of a particular class when the model is used.

41
Q

What are ROC charts used for?

A

Performance measurement for binary response models, comparing the true positive rate with the false positive rate.

42
Q

What is the goal of the communicate results phase?

A

To deliver clear, articulate results, methodology, and business value to stakeholders.

43
Q

List the key terms associated with the communicate results phase.

A
  • Success and failure criteria
  • Stakeholders
  • Speculative analysis
  • Statistical significance
  • Model refinement
  • Control group
44
Q

What is the operationalize phase?

A

The phase where new analytical methods or models are deployed to a production environment.

45
Q

What is a pilot project?

A

A small-scale deployment of the model in a live setting to evaluate performance and manage risk.

46
Q

What are key performance metrics?

A

Business-relevant measures communicated on a dashboard for ongoing monitoring and decision support.

47
Q

What is model accuracy monitoring?

A

The ongoing process of checking the model’s performance and retraining it if its accuracy degrades.

48
Q

What is an out-of-bounds operation?

A

A situation where inputs to the model are outside the range it was trained on, causing inaccurate outputs.

49
Q

What are the six phases of the data analytics lifecycle?

A
  • Discovery
  • Data preparation
  • Model planning
  • Model building
  • Communicate results
  • Operationalize
50
Q

List the seven key roles necessary for a successful analytics project.

A
  • End user
  • Project sponsor
  • Project manager
  • Business intelligence analyst
  • Database administrator
  • Data engineer
  • Data scientist
51
Q

What is the focus of the data discovery phase?

A

Examining the problem, investigating available data sources, and formulating initial hypotheses.

52
Q

What is the purpose of data preparation?

A

To select appropriate data, comprehend it, and prepare it for analysis.

53
Q

What is the significance of establishing specific data mining goals?

A

To address well-defined business problems effectively.

54
Q

What is the purpose of creating an analytic sandbox?

A

To enable data exploration without affecting live production databases.

55
Q

What does ELT stand for?

A

Extract, Load, Transform.

56
Q

What is the model building phase?

A

Includes developing and fitting an analytical model on the training data.

57
Q

What does model deployment involve?

A

Transitioning models from the data mining environment to the production environment.

58
Q

Why is reflection on project challenges valuable?

A

To learn from past events and improve future performance.

59
Q

What is the importance of continuous monitoring of model accuracy?

A

To ensure the model remains effective and retrain if its accuracy degrades.