Data Science Methodology Flashcards

1
Q

What is the PPDAC model

A

The PPDDAC model is a structured approach used to carry out investigative research in general. It works well for data science due to its explorative and inquisitive nature

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does PPDAC stand for

A

Problem, Plan, Data, Analysis, Conclusion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the first step in PPDAC model

A

The first step is identifying the Problem or question that needs to be answered

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the second step in PPDAC model

A

The second step is to create a Plan for how to approach the problem and gather data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the third step in PPDAC model

A

The third step is gathering and cleaning Data that is relevant to the problem being investigated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the fourth step in PPDAC model

A

The fourth step is conducting an Analysis of the data to draw insights and identify patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the final step in PPDAC model

A

The final step is drawing a Conclusion based on the analysis and using it to solve the original problem or answer the original question

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What kinds of questions are asked to understand the problem

A

How much/many…, Which category does this belong to…, Are there similar kinds…, Is this strange…, What action should I take when…, …

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are some academic planning considerations in the PPDAC model

A

Identifying what data is needed, determining how much data will be needed and evaluating the solution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are some practical considerations in the PPDAC model

A

Some practical considerations in the PPDAC model include determining where the data will come from and whether it needs to be collected, deciding where and how the data will be stored and considering any legal or ethical implications

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the importance of identifying what data is needed in the PPDAC model

A

Identifying what data is needed is important in the PPDAC model because it helps ensure that the analysis is relevant and meaningful to the problem being investigated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why is it important to consider legal and ethical implications in the PPDAC model

A

It is important to consider legal and ethical implications in the PPDAC model to ensure that the investigation conducted in a responsible and lawful manner and that any potential harm or negative impacted is minimized

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the purpose of evaluating the solution in the PPDAC model

A

The purpose of evaluating the solution in the PPDAC model is to determine whether the solution effectively addresses the original problem or question and to identify any areas for improvement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the role of data storage in the PPDAC model

A

The role of data storage in the PPDAC model is to ensure that the data is easily accessible, organised, and secure throughout the investigation and analysis process

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why is it important to determine whether data needs to be collected in the PPDAC model

A

It is important to determine whether data needs to be collected in the PPDAC model to ensure that the investigation is conducted efficiently and effectively and that the analysis is based on relevant and accurate data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the steps involved in data processing in the PPDAC model

A

Obtaining the data, conducting quality checks, cleaning the data, and addressing any missing values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the role of data management in the PPDAC model

A

Determining how the data will be represented or stored, ensuring proper storage and maintenance of the data, and managing access rights to the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are some common ways that data is represented or stored in the PPDAC model

A

Tabular formats (2D table with each row as an observation and each column as a measurement), structured formats (each observation is represented by a dictionary of keys and values), and semi-structured formats (not all records are represented by the same keys

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the purpose of conducting quality checks on data in the PPDAC model

A

To ensure that the data is accurate, complete, and reliable for analysis purposes

20
Q

Why is it important to address missing values in the data cleaning step of the PPDAC model

A

To ensure that the analysis is based on complete and accurate data and to avoid any potential biases or errors in the result

21
Q

What is the importance of managing access rights in the PPDAC model

A

To ensure that the data is only accessed by authorized individuals and to protect the privacy and confidentiality of the data

22
Q

What is a 2D table in the context of data representation

A

A 2D table is a data representation format where each row represents an observation or a singular data point, and each column represents a variable, attribute, or feature of the data. The table contains a set of observations of the same kind of purpose, and it is relatively easy to set up and populate

23
Q

What are some common characteristics of 2D tables as a data representation format

A

Easy-to-parse format for existing tools, their clear and organised presentation of data, and their suitability for representing homogenous data

24
Q

What are some potential drawbacks of using 2D tables as a data representation format

A

Their idealistic assumption that data is homogenous and fits neatly into a table structure, which many not always be the case. In some scenarios, 2D tables may not be a realistic choice for representing data due to the data’s complexity or variability

25
Q

What is a dictionary of keys and values in the context of data representation

A

Each observation is represented by a well-defined structure, typically in the form of a dictionary with keys and values. This format allows for effective storage, retrieval and usage of data

26
Q

What are some advantages of using a dictionary of keys and values as a data representation format

A

Well-defined structure that allows for effective storage, retrieval and usage of data, and its suitability for industrial settings where proprietary data is collected. The format is defined via a scheme, which provides rules for what can be stored and how, and if planned well, the data capture can feed directly into this schema and then be stored within a relational database such as MySQL

27
Q

What is a schema in the context of data representation

A

A schema is a set of rules that defines the structure and format of a data representation format, such as a dictionary of keys and values. It provides guidelines for what can be stored and how, and ensures consistency and organization of the data

28
Q

What is the role of planning in using a dictionary of keys and values as a data representation format

A

Planning is essential in using a dictionary of keys and values as a data representation format to ensure that the data capture can feed directly into the schema and to ensure that the schema is designed in a way that supports effective storage, retrieval, and usage of data

29
Q

What is unstructured data in the context of data representation

A

Unstructured data is a data representation format that does not conform to a defined data model structure. It is prevalent in the real world but can be difficult to parse and use without some intermediary steps to provide at least some structure

30
Q

What is semi-structured data in the context of data representation

A

Semi-structured data is a data representation format where not all records are represented by the same keys. It is prevalent in the real world but can be difficult to parse and use without some intermediary steps to provide at least some structure

31
Q

What are some challenges of using unstructured or semi-structured data as a data representation format

A

Difficulty of parsing and using the data without some intermediary steps to provide at least some structure. The lack of a defined data model structure can also make it challenging to ensure consistency and organisation of the data

32
Q

What are some challenges of using unstructured or semi-structured data as a data representation format

A

Difficulty of parsing and using the data without some intermediary steps to provide at least some structure. The lack of a defined data model structure can also make it challenging to ensure consistency and organisation of the data

33
Q

What are some strategies for working with unstructured or semi-structured data

A

Some strategies for working with unstructured or semi-structured data include using natural processing (NLP) techniques to extract structured data from unstructured text, using regular expressions or other pattern-matching tools to extract relevant data, and using data wrangling or data munging techniques to transform the data into a more structured format for analysis

34
Q

What are some benefits of using visualisations in data analysis

A

Visualisations help us explore and gain insights into distributions, relationships, compositions, and comparisons across datasets. They also help to develop initial questions and communicate findings effectively

35
Q

What types of visualisations are commonly used to explore data

A

Bar charts, histograms, scatter plots, …

36
Q

What is a bar chart used for in data visualisation

A

A bar chart is commonly used to represent counts of categorical features in a dataset

37
Q

What is a histogram used for in data visualisation

A

A histogram is commonly used to show the distribution of a continuous variable across a range of values, with values binned into brackets

38
Q

What is a scatter plot used for in data visualisation

A

A scatter plot is commonly used to show the relationship between two variables within multivariate data

39
Q

Why is it important to use visualisation correctly in data analysis

A

They can have a powerful impact on how data is interpreted and understood. Misleading or poorly designed visualisations can lead to incorrect conclusions or miscommunication of results

40
Q

What is the purpose of creating a model in data analysis

A

To form a representation or function that can predict or explain the underlying behaviour observed in the data

41
Q

What is a function in data analysis

A

In data analysis, a function is a mathematical expression or algorithm that takes input values and produces output values. It is used to perform various operations on data, such as transforming, manipulating, or summarizing it.

42
Q

What is a top-down approach to modelling in data analysis

A

In data analysis, a top-down approach to modeling refers to a methodology where you start with a high-level overview or conceptual understanding of the problem and then gradually break it down into smaller components or sub-models. This approach involves decomposing the problem into manageable parts and developing models for each part, which are then integrated to form a complete solution.

43
Q

What is a bottom-up approach to modelling in data analysis

A

In data analysis, a bottom-up approach to modeling refers to a methodology where you start with individual data points or observations and gradually build up to form higher-level insights or conclusions. This approach involves analyzing and aggregating individual data elements to derive patterns, relationships, and models at a higher level.

44
Q

What are some ways to validate the success of an investigation in data analysis

A

Some ways to validate the success of an investigation in data analysis include using performance metrics or measures, such as accuracy, the number of failures vs successes, or the distance from the expected output. Other methods may include cross-validation, comparing results to a baseline or benchmark, or conducting a peer review

45
Q

What are some considerations for summarizing and communicating insights in data analysis?

A

When summarizing and communicating insights in data analysis, it is important to reflect on whether the investigation answered the original problem and identify any key insights that were gained. It is also helpful to consider what may be of interest going forward, what can be done differently, and what the next steps might be. It may be useful to use clear, concise language and visual aids to communicate insights effectively. Finally, it is important to remain open to feedback and further investigation as needed.