Data Science Methodology Flashcards
What is the PPDAC model
The PPDDAC model is a structured approach used to carry out investigative research in general. It works well for data science due to its explorative and inquisitive nature
What does PPDAC stand for
Problem, Plan, Data, Analysis, Conclusion
What is the first step in PPDAC model
The first step is identifying the Problem or question that needs to be answered
What is the second step in PPDAC model
The second step is to create a Plan for how to approach the problem and gather data
What is the third step in PPDAC model
The third step is gathering and cleaning Data that is relevant to the problem being investigated
What is the fourth step in PPDAC model
The fourth step is conducting an Analysis of the data to draw insights and identify patterns
What is the final step in PPDAC model
The final step is drawing a Conclusion based on the analysis and using it to solve the original problem or answer the original question
What kinds of questions are asked to understand the problem
How much/many…, Which category does this belong to…, Are there similar kinds…, Is this strange…, What action should I take when…, …
What are some academic planning considerations in the PPDAC model
Identifying what data is needed, determining how much data will be needed and evaluating the solution
What are some practical considerations in the PPDAC model
Some practical considerations in the PPDAC model include determining where the data will come from and whether it needs to be collected, deciding where and how the data will be stored and considering any legal or ethical implications
What is the importance of identifying what data is needed in the PPDAC model
Identifying what data is needed is important in the PPDAC model because it helps ensure that the analysis is relevant and meaningful to the problem being investigated
Why is it important to consider legal and ethical implications in the PPDAC model
It is important to consider legal and ethical implications in the PPDAC model to ensure that the investigation conducted in a responsible and lawful manner and that any potential harm or negative impacted is minimized
What is the purpose of evaluating the solution in the PPDAC model
The purpose of evaluating the solution in the PPDAC model is to determine whether the solution effectively addresses the original problem or question and to identify any areas for improvement
What is the role of data storage in the PPDAC model
The role of data storage in the PPDAC model is to ensure that the data is easily accessible, organised, and secure throughout the investigation and analysis process
Why is it important to determine whether data needs to be collected in the PPDAC model
It is important to determine whether data needs to be collected in the PPDAC model to ensure that the investigation is conducted efficiently and effectively and that the analysis is based on relevant and accurate data
What are the steps involved in data processing in the PPDAC model
Obtaining the data, conducting quality checks, cleaning the data, and addressing any missing values
What is the role of data management in the PPDAC model
Determining how the data will be represented or stored, ensuring proper storage and maintenance of the data, and managing access rights to the data
What are some common ways that data is represented or stored in the PPDAC model
Tabular formats (2D table with each row as an observation and each column as a measurement), structured formats (each observation is represented by a dictionary of keys and values), and semi-structured formats (not all records are represented by the same keys
What is the purpose of conducting quality checks on data in the PPDAC model
To ensure that the data is accurate, complete, and reliable for analysis purposes
Why is it important to address missing values in the data cleaning step of the PPDAC model
To ensure that the analysis is based on complete and accurate data and to avoid any potential biases or errors in the result
What is the importance of managing access rights in the PPDAC model
To ensure that the data is only accessed by authorized individuals and to protect the privacy and confidentiality of the data
What is a 2D table in the context of data representation
A 2D table is a data representation format where each row represents an observation or a singular data point, and each column represents a variable, attribute, or feature of the data. The table contains a set of observations of the same kind of purpose, and it is relatively easy to set up and populate
What are some common characteristics of 2D tables as a data representation format
Easy-to-parse format for existing tools, their clear and organised presentation of data, and their suitability for representing homogenous data
What are some potential drawbacks of using 2D tables as a data representation format
Their idealistic assumption that data is homogenous and fits neatly into a table structure, which many not always be the case. In some scenarios, 2D tables may not be a realistic choice for representing data due to the data’s complexity or variability
What is a dictionary of keys and values in the context of data representation
Each observation is represented by a well-defined structure, typically in the form of a dictionary with keys and values. This format allows for effective storage, retrieval and usage of data
What are some advantages of using a dictionary of keys and values as a data representation format
Well-defined structure that allows for effective storage, retrieval and usage of data, and its suitability for industrial settings where proprietary data is collected. The format is defined via a scheme, which provides rules for what can be stored and how, and if planned well, the data capture can feed directly into this schema and then be stored within a relational database such as MySQL
What is a schema in the context of data representation
A schema is a set of rules that defines the structure and format of a data representation format, such as a dictionary of keys and values. It provides guidelines for what can be stored and how, and ensures consistency and organization of the data
What is the role of planning in using a dictionary of keys and values as a data representation format
Planning is essential in using a dictionary of keys and values as a data representation format to ensure that the data capture can feed directly into the schema and to ensure that the schema is designed in a way that supports effective storage, retrieval, and usage of data
What is unstructured data in the context of data representation
Unstructured data is a data representation format that does not conform to a defined data model structure. It is prevalent in the real world but can be difficult to parse and use without some intermediary steps to provide at least some structure
What is semi-structured data in the context of data representation
Semi-structured data is a data representation format where not all records are represented by the same keys. It is prevalent in the real world but can be difficult to parse and use without some intermediary steps to provide at least some structure
What are some challenges of using unstructured or semi-structured data as a data representation format
Difficulty of parsing and using the data without some intermediary steps to provide at least some structure. The lack of a defined data model structure can also make it challenging to ensure consistency and organisation of the data
What are some challenges of using unstructured or semi-structured data as a data representation format
Difficulty of parsing and using the data without some intermediary steps to provide at least some structure. The lack of a defined data model structure can also make it challenging to ensure consistency and organisation of the data
What are some strategies for working with unstructured or semi-structured data
Some strategies for working with unstructured or semi-structured data include using natural processing (NLP) techniques to extract structured data from unstructured text, using regular expressions or other pattern-matching tools to extract relevant data, and using data wrangling or data munging techniques to transform the data into a more structured format for analysis
What are some benefits of using visualisations in data analysis
Visualisations help us explore and gain insights into distributions, relationships, compositions, and comparisons across datasets. They also help to develop initial questions and communicate findings effectively
What types of visualisations are commonly used to explore data
Bar charts, histograms, scatter plots, …
What is a bar chart used for in data visualisation
A bar chart is commonly used to represent counts of categorical features in a dataset
What is a histogram used for in data visualisation
A histogram is commonly used to show the distribution of a continuous variable across a range of values, with values binned into brackets
What is a scatter plot used for in data visualisation
A scatter plot is commonly used to show the relationship between two variables within multivariate data
Why is it important to use visualisation correctly in data analysis
They can have a powerful impact on how data is interpreted and understood. Misleading or poorly designed visualisations can lead to incorrect conclusions or miscommunication of results
What is the purpose of creating a model in data analysis
To form a representation or function that can predict or explain the underlying behaviour observed in the data
What is a function in data analysis
In data analysis, a function is a mathematical expression or algorithm that takes input values and produces output values. It is used to perform various operations on data, such as transforming, manipulating, or summarizing it.
What is a top-down approach to modelling in data analysis
In data analysis, a top-down approach to modeling refers to a methodology where you start with a high-level overview or conceptual understanding of the problem and then gradually break it down into smaller components or sub-models. This approach involves decomposing the problem into manageable parts and developing models for each part, which are then integrated to form a complete solution.
What is a bottom-up approach to modelling in data analysis
In data analysis, a bottom-up approach to modeling refers to a methodology where you start with individual data points or observations and gradually build up to form higher-level insights or conclusions. This approach involves analyzing and aggregating individual data elements to derive patterns, relationships, and models at a higher level.
What are some ways to validate the success of an investigation in data analysis
Some ways to validate the success of an investigation in data analysis include using performance metrics or measures, such as accuracy, the number of failures vs successes, or the distance from the expected output. Other methods may include cross-validation, comparing results to a baseline or benchmark, or conducting a peer review
What are some considerations for summarizing and communicating insights in data analysis?
When summarizing and communicating insights in data analysis, it is important to reflect on whether the investigation answered the original problem and identify any key insights that were gained. It is also helpful to consider what may be of interest going forward, what can be done differently, and what the next steps might be. It may be useful to use clear, concise language and visual aids to communicate insights effectively. Finally, it is important to remain open to feedback and further investigation as needed.