SECTION 2: The Data Analytics Lifecycle Flashcards
What are the six phases in a data analytics project?
Discovery, Data preparation, Model planning, Model execution, Communicate results, Operationalize
The phases may vary in terminology across organizations.
What is the purpose of the discovery phase?
Identify the project’s purpose, define questions of interest, assess resources and constraints, and establish desired outcomes.
What activities are involved in the discovery phase?
- Assessing available resources
- Framing the problem
- Identifying key stakeholders
- Interviewing the analytics sponsor
- Developing the initial hypothesis
- Identifying potential data sources
What is the data preparation phase focused on?
Gathering and preparing the necessary data for analysis using various sources and tools.
What does the model planning phase involve?
Choosing appropriate analytical models based on project objectives and available data.
What occurs during the model execution phase?
Applying chosen models to prepared data, interpreting results, and refining models.
What is the focus of the communicate results phase?
Presenting findings in a meaningful format for various stakeholders.
What is the goal of the operationalize results phase?
Implementing insights from the project into real-world applications.
Identify the key roles involved in executing analytic projects.
- End users
- Project sponsors
- Project managers
- Data analysts
- Business intelligence analysts
- Database administrators
- Data engineers
- Data scientists
What is the purpose of the data analytics lifecycle?
Provide a systematic and iterative framework for managing big data challenges and data science projects.
What are the four main characteristics of big data?
- Variety
- Velocity
- Veracity
- Volume
Define ‘variety’ in the context of big data.
Diverse types of data, including structured, semi-structured, and unstructured formats.
What does ‘velocity’ refer to in big data?
The speed at which data is produced, collected, and processed.
What is ‘veracity’ in data analytics?
The accuracy, reliability, and quality of the data collected and analyzed.
What does ‘volume’ mean in big data?
The sheer amount of data generated and handled by businesses, ranging from terabytes to petabytes.
What is the extract, load, transform (ELT) process?
A key aspect of data preparation that combines data transformation flexibility with data preservation.
What is data cleaning?
Processes for handling errors, missing data, and other problems in dirty data.
What is a corporate data warehouse?
A centralized storage system for a company’s data that is often the ideal location for data mining tasks.
What are the types of qualitative data?
- Nominal
- Ordinal
What are the types of quantitative data?
- Interval
- Ratio
What is a type I error in hypothesis testing?
Rejection of the null hypothesis when the null hypothesis is true.
What is a type II error in hypothesis testing?
Acceptance of a null hypothesis when the null hypothesis is false.
What is clustering in data analytics?
A technique used to group similar objects or data points together based on their characteristics.
What are common tools used in the model planning phase?
- R
- SQL Analysis Services
- Python
- Apache Spark
- RapidMiner
- KNIME
What is the purpose of the model execution phase?
To develop datasets for testing, training, and production purposes, and to build and execute models.
What is the operationalization phase?
The phase where reports, briefings, and technical documents are delivered, and a pilot project may be implemented.
What is the significance of the data analytics lifecycle?
It ensures that results are insightful, actionable, and valuable to the organization.
What is the model planning phase?
The phase where suitable models for tasks such as clustering, classification, or discovering relationships are identified.
What is the purpose of the model planning phase?
To ensure alignment with business goals and select key variables and models.
List common tools used in the model planning phase.
- R
- SQL Analysis Services
- Python
- Apache Spark
- RapidMiner
- KNIME
Define candidate models.
Potential models for clustering, classifying, or finding relationships in data.
What is dataset structure?
The arrangement and organization of data used in the analysis process.
What are analytical techniques?
Methods and tools used to analyze and process data to achieve business objectives.
What is variable selection?
The process of identifying essential predictors and variables to include in the model.
Differentiate between structured and unstructured data.
- Structured data: Organized in a specific format or schema
- Unstructured data: Lacks a specific format or structure
What is training data?
The dataset used for model development, where the model learns patterns and relationships in the data.
What is test data?
A separate dataset, also called hold-out data, used to evaluate the model’s performance and accuracy.
What is model assessment?
The process of evaluating the technical merits of a model, such as accuracy, comprehensibility, and confidence in predictions.
What does error rate measure?
The percentage of records classified correctly or incorrectly, used to measure the accuracy of a model.
Define lift in the context of modeling.
A measure that indicates the change in concentration of a particular class when the model is used.
What are ROC charts used for?
Performance measurement for binary response models, comparing the true positive rate with the false positive rate.
What is the goal of the communicate results phase?
To deliver clear, articulate results, methodology, and business value to stakeholders.
List the key terms associated with the communicate results phase.
- Success and failure criteria
- Stakeholders
- Speculative analysis
- Statistical significance
- Model refinement
- Control group
What is the operationalize phase?
The phase where new analytical methods or models are deployed to a production environment.
What is a pilot project?
A small-scale deployment of the model in a live setting to evaluate performance and manage risk.
What are key performance metrics?
Business-relevant measures communicated on a dashboard for ongoing monitoring and decision support.
What is model accuracy monitoring?
The ongoing process of checking the model’s performance and retraining it if its accuracy degrades.
What is an out-of-bounds operation?
A situation where inputs to the model are outside the range it was trained on, causing inaccurate outputs.
What are the six phases of the data analytics lifecycle?
- Discovery
- Data preparation
- Model planning
- Model building
- Communicate results
- Operationalize
List the seven key roles necessary for a successful analytics project.
- End user
- Project sponsor
- Project manager
- Business intelligence analyst
- Database administrator
- Data engineer
- Data scientist
What is the focus of the data discovery phase?
Examining the problem, investigating available data sources, and formulating initial hypotheses.
What is the purpose of data preparation?
To select appropriate data, comprehend it, and prepare it for analysis.
What is the significance of establishing specific data mining goals?
To address well-defined business problems effectively.
What is the purpose of creating an analytic sandbox?
To enable data exploration without affecting live production databases.
What does ELT stand for?
Extract, Load, Transform.
What is the model building phase?
Includes developing and fitting an analytical model on the training data.
What does model deployment involve?
Transitioning models from the data mining environment to the production environment.
Why is reflection on project challenges valuable?
To learn from past events and improve future performance.
What is the importance of continuous monitoring of model accuracy?
To ensure the model remains effective and retrain if its accuracy degrades.