3 Collecting Data Flashcards

1
Q

What is the main focus of this chapter?

A

Sources of data and methods for collecting it

This includes public sources, collecting personal data, and optimizing queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the two main categories of public databases?

A

Government and industry

Government databases are typically run by governmental entities, while industry databases focus on specific industries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What type of data can government-run databases provide?

A
  • Population
  • Agriculture
  • Utilities
  • Public health concerns

The availability of data depends on the government entity and location.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a downside of using government databases?

A

They often contain missing data and mistakes

This requires significant time for data cleaning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Kaggle known for?

A

It is a platform that hosts open source datasets and is a valuable resource for data analysis

Kaggle also offers courses and competitions for those in the data analytics community.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does an API do?

A

Allows two unrelated computer systems to exchange information

APIs act as intermediaries for communication between systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the two ways information can be passed through APIs?

A
  • Synchronous
  • Asynchronous

Synchronous requests wait for a response, while asynchronous requests continue processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is web scraping?

A

The process of collecting data directly from a web page

This method differs from accessing data through databases or APIs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a survey?

A

A set of questions given to a sample of individuals

Surveys are used to gather data about a larger population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the difference between a population and a sample?

A

A population includes all individuals in a group, while a sample is a smaller subset

Sampling is used to generalize findings to the larger population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the main types of survey answers?

A
  • Text-based
  • Single-choice
  • Multiple-choice
  • Drop-down
  • Likert

Each type has its own pros and cons for data analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a Likert scale used for?

A

Gauging opinions or effectiveness on a specific topic

Respondents indicate their level of agreement with statements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the challenge with text-based survey answers?

A

They are difficult to analyze due to variability in responses

Natural language processing may be needed for interpretation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is survey bias?

A

A tendency that skews results and affects data accuracy

Analysts must actively avoid bias to ensure valid results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the primary advantage of asynchronous API requests?

A

They allow for faster and more efficient processing

The code does not have to wait for a response before continuing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does the term ‘open sources’ refer to?

A

Datasets made available for free by individuals or companies

These datasets cover a wide range of topics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the limitations of industry-run public databases?

A

They tend to share only the legal minimum and are not always easy to find

However, they are usually cleaner than government databases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the main function of web services?

A

A specific kind of API that requires both computers to be in the same hosted environment

All web services are APIs, but not all APIs are web services.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the importance of filtering by license when using datasets?

A

Not all datasets can be used commercially

It’s crucial to check licenses if the data is for business use.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What does the acronym ETL stand for?

A

Extract, Transform, Load

It refers to a process for moving and transforming data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What does OLTP stand for?

A

Online Transactional Processing

It involves managing transaction-oriented applications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does OLAP stand for?

A

Online Analytical Processing

It is used for complex analytical queries and data modeling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is a common application of surveys?

A

Collecting information on demographics and customer satisfaction

Surveys can be administered electronically or in person.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is a Likert scale?

A

A scale set up as a single-choice question where the question is a statement and answers are on a scale.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What does survey bias refer to?

A

The difference between the expected value and the actual value in survey results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What are common types of survey bias?

A
  • Order bias
  • Leading questions
  • Recall bias
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

How can order bias be avoided in surveys?

A

By randomizing the order of questions and answers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is recall bias?

A

Occurs when respondents are asked to remember details from the past that they may not accurately recall.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What should be included as an answer option to avoid recall bias?

A

‘I don’t know’.

30
Q

What is the process of observation in research?

A

Witnessing something and recording it.

31
Q

What is an A-B study?

A

A study comparing two designs to see which performs better in terms of a specific goal.

32
Q

What are the three steps of a data pipeline?

A
  • Extraction
  • Transformation
  • Loading
33
Q

What does ETL stand for?

A

Extract, Transform, Load.

34
Q

Describe the ETL process.

A

Data is extracted, transformed, and then loaded to the destination.

35
Q

What does ELT stand for?

A

Extract, Load, Transform.

36
Q

How does the ELT process differ from ETL?

A

Data is extracted, loaded, and then transformed in the destination environment.

37
Q

What is a delta load?

A

A loading method that only loads data that has changed since the last load.

38
Q

What is OLTP?

A

Online Transaction Processing, which automatically stores and processes data from online transactions.

39
Q

What is OLAP?

A

Online Analytical Processing, which analyzes data collected by OLTP.

40
Q

What is the main difference between OLTP and OLAP?

A

OLTP is for data collection, while OLAP is for data analysis.

41
Q

What is a query in data analytics?

A

A request for information from a database.

42
Q

What is filtering in the context of querying?

A

The process of being selective about which data is queried using conditional logic.

43
Q

What are subsets in data querying?

A

Smaller sections of a dataset created by filtering.

44
Q

How can filtering improve query efficiency?

A

By reducing the amount of data pulled, leading to faster processing.

45
Q

What is indexing in database management?

A

Assigning a unique ascending number to every entry in a table.

46
Q

What is sorting in the context of data?

A

Arranging rows in a different order based on specific logic.

47
Q

True or False: ETL is generally preferred for simple transformations.

48
Q

True or False: ELT is faster than ETL when dealing with complicated transformations.

49
Q

What are the main loading methods in data pipelines?

A
  • Full load
  • Delta load
50
Q

Define full load in data pipelines.

A

Every time the pipeline is run, the entire dataset is extracted, transformed, and loaded.

51
Q

What is the role of automated observations?

A

To conveniently collect data without physical observation.

52
Q

What is one disadvantage of automated observations?

A

They are limited in the types of information they can collect.

53
Q

What is sorting in the context of data management?

A

Sorting is simply arranging the rows in a different order according to some logic.

54
Q

How can sorting improve the efficiency of filtering?

A

Sorting can save time by placing all the rows you want next to each other in the table.

55
Q

True or False: Sorting always improves processing efficiency.

56
Q

What is the primary role of indexing?

A

Indexing can play a role in query optimization.

57
Q

What is parameterization in database queries?

A

Parameterization is a prewritten query that allows the user to enter specific parameters.

58
Q

What is the main reason for using parameterization?

A

The biggest reason is cyber security, protecting data from injection attacks.

59
Q

What are temporary tables?

A

Temporary tables are tools that save the results of a query as their own table.

60
Q

What is a key limitation of temporary tables?

A

They are temporary and do not update when the source data updates.

61
Q

Fill in the blank: A temporary table can be created using the command ______.

A

CREATE TEMPORARY TABLE.

62
Q

What are subqueries?

A

Subqueries are queries embedded inside another query.

63
Q

True or False: Subqueries are generally considered more efficient than using temporary tables.

64
Q

What is an execution plan?

A

An execution plan shows how a query will be executed and may include a graphical representation.

65
Q

What is the difference between an estimated execution plan and an actual execution plan?

A

An estimated execution plan gives a rough idea, while an actual execution plan provides specifics after a query is run.

66
Q

What is OLAP?

A

OLAP is the process of aggregating and analyzing data stored by OLTP and moving it to a data warehouse.

67
Q

What is the purpose of filtering in query optimization?

A

Filtering helps narrow down the rows returned from a query to improve efficiency.

68
Q

What is the consequence of using specific past events in survey questions?

A

It introduces recall bias.

69
Q

When should you use a temporary table instead of running a long query multiple times?

A

When you need several pieces of information from a complicated query.

70
Q

What is the best practice when conducting surveys?

A

Avoid asking about specific incidents in the past.