Section 2 Flashcards
—- focus on the benefits and implications of findings, while — focus on the business impact, risks, and return on investment
Business users, project sponsors
A situation in which the inputs to the model are outside the range it was trained on, potentially causing inaccurate or invalid outputs
Out-of-bounds operation
The system where the model is deployed and integrated with existing business processes as opposed to a sandbox or testing environment
Production environment
A small-scale deployment of the model in a live setting, allowing the data science team to manage risk, evaluate performance, and adjustments before a full-scale deployment
Pilot project
What is data? What is information?
Data is the raw material used by analysts, while information refers to processed or organized data
What order does the data analytics lifecycle follow?
Discovery phase, Data preparation phase, Model planning phase, Model execution phase, Communicate results phase, Operationalize phase
The data analytics team familiarizes themselves with the business domain, examines relevant historical data, and assesses available resources.It also involves framing the business problem as an analytics challenge and formulating initial hypotheses to test and explore the data
Discovery phase
Requires the establishment of an analytic sandbox where the team can work with data and perform analytics throughout the project
Data preparation
The team determines the methods, techniques, and workflow to be used during the subsequent model building phase
Model planning
The team develops datasets for testing, training, and production purposes, builds and executes models based on the planning phase and evaluates the need for more robust tools or environments for executing models and workflows
Model execution
Involves determining the project’s success or failure based on the criteria developed in the discovery phase. The team identifies key findings, quantifies the business value, and develops a narrative to summarize and communicate the results to stakeholders
Communicate results
The team delivers, reports, briefings, code, and technical documents. A pilot project may be implemented to test the models in a production environment, ensuring that the results are framed effectively and demonstrate clear value to stakeholders
Operationalization
Refers to the vast amount of information collected, stored and analyzed by businesses and organizations; its unique aspects can differ between organizations and include up to 7 characteristics; however, for this course, we will focus on the main 4 variety, velocity, veracity, and volume
Big data
The diverse types of data,including structured, semi-structured, and unstructured formats; big data comes from numerous sources
Variety
The speed at. which data is produced, collected and processed; in the context of big data, velocity refers to the need for quick analysis and decision-making based on the data gathered
Velocity
The accuracy, reliability and quality of the data collected and analyzed; ensuring data — is essential for gaining valuable insights and making informed decisions
Veracity
The sheer amount of data generaetd and handled by businesses; big data involves dealing with enormous quantities of data ranging from terabytes to petabytes and beyond, which can be challenging in terms of storage and processing
Volume
By the end of this phase, the project team should have a clear understanding of the business problem nd the data available and should be ready to move forward to the analysis phase
Discovery phase
Items necessary for a successful project; can include items such as technology, tools, systems, data, and people
Resources
The process of stating the data analytics problem to be solved
Framing
Involves data mining, which refers to the process of discovering hidden patterns, trends and insights in large datasets, that can then be used by an organization to make informed decisions
Data preparation phase
The extract, load, transfomr process is a key aspect of —-, which combines data transformation flexibility with data preservation
Data preparation
Programming language and software framework for statistical analysis and graphics available under the GNU General public license
R
Emphasizes identifying appropriate models for clustering, classification or uncovering relationships that correspond with the hypotheses establsihed in the discovery phase
Model planning phase
Is a technique used in data analytics to group similar objects or data points together based on their characteristics or attributes
Clustering
This phase includes evauluating the structure of datasets, ensuring analytical techniques align with business objectives, deciding on a single model or a series of techniques, and examining existing approaches to similar problems
Model planning phase
What can contribute to efficient model planning>
R, SQL Analysis services, Python, Apache Spark, RapidMiner, and KNIME
Data organized in a specific format or schema, making it easier to analyze
Structured data
Data that lacks a specific format or structure often requiring additional processing before analysis
Unstructured data
Is dedicated to developing datasets for various purposes, such as training, testing, and production.Initially, training data is created for model development, while hold-out data is set aside for model evaluation
Model execution phase
A separate dataset, also called hold-out data, used to evaluate the models performance and accuracy on unseen data
Test data
Includes developing an fitting an analytical model on the training data
Model building phase
The process of evaluating the technical merits of a model, such as accuracy, comprehensibility, and confidence in predictions
Model assessment
The percentage of records classified correctly or incorrectly, used to measure the accuracy of a model
Error rate
Measure that indicates the change in concentration of a particular class when the model is used to select a group from the general population
Lift
A performance measurement for binary response models, comparing the true positive rate with the false positive rate
ROC Charts
Analysts compare outcomes to success and failure criteria, articulate findings for stakeholders and assess the significance of their results
Communicate results phase
Individuals or groups interested in the project and its outcomes
Stakeholders
Measures whether the observed results are likely to have occurred by chance or if they indicate a genuine relationship between variables
Statistical significance
This phase marks the first time most analytics teams deploy new analytical methods or models to a production environment
Operationalize
This approach allows the team to evaluate the model’s performance and make necessary adjustments in a live setting before implementing it across the enterprise
Operationalize phase
A small-scale deployment of the model in a live setting, allowing the data science team to manage risk, evaluate performance, and make adjustments before a full-scale deployment
Pilot project
The system where the model is deployed and integrated with existing business processes as opposed to a sandbox or testing environment
Production environement
A situation in which the inputs to the model are outside the range it was trained on, potentially causing inaccurate or invalid outputs
Out-of-bounds-operations