3 Collecting Data Flashcards
What is the main focus of this chapter?
Sources of data and methods for collecting it
This includes public sources, collecting personal data, and optimizing queries.
What are the two main categories of public databases?
Government and industry
Government databases are typically run by governmental entities, while industry databases focus on specific industries.
What type of data can government-run databases provide?
- Population
- Agriculture
- Utilities
- Public health concerns
The availability of data depends on the government entity and location.
What is a downside of using government databases?
They often contain missing data and mistakes
This requires significant time for data cleaning.
What is Kaggle known for?
It is a platform that hosts open source datasets and is a valuable resource for data analysis
Kaggle also offers courses and competitions for those in the data analytics community.
What does an API do?
Allows two unrelated computer systems to exchange information
APIs act as intermediaries for communication between systems.
What are the two ways information can be passed through APIs?
- Synchronous
- Asynchronous
Synchronous requests wait for a response, while asynchronous requests continue processing.
What is web scraping?
The process of collecting data directly from a web page
This method differs from accessing data through databases or APIs.
What is a survey?
A set of questions given to a sample of individuals
Surveys are used to gather data about a larger population.
What is the difference between a population and a sample?
A population includes all individuals in a group, while a sample is a smaller subset
Sampling is used to generalize findings to the larger population.
What are the main types of survey answers?
- Text-based
- Single-choice
- Multiple-choice
- Drop-down
- Likert
Each type has its own pros and cons for data analysis.
What is a Likert scale used for?
Gauging opinions or effectiveness on a specific topic
Respondents indicate their level of agreement with statements.
What is the challenge with text-based survey answers?
They are difficult to analyze due to variability in responses
Natural language processing may be needed for interpretation.
What is survey bias?
A tendency that skews results and affects data accuracy
Analysts must actively avoid bias to ensure valid results.
What is the primary advantage of asynchronous API requests?
They allow for faster and more efficient processing
The code does not have to wait for a response before continuing.
What does the term ‘open sources’ refer to?
Datasets made available for free by individuals or companies
These datasets cover a wide range of topics.
What are the limitations of industry-run public databases?
They tend to share only the legal minimum and are not always easy to find
However, they are usually cleaner than government databases.
What is the main function of web services?
A specific kind of API that requires both computers to be in the same hosted environment
All web services are APIs, but not all APIs are web services.
What is the importance of filtering by license when using datasets?
Not all datasets can be used commercially
It’s crucial to check licenses if the data is for business use.
What does the acronym ETL stand for?
Extract, Transform, Load
It refers to a process for moving and transforming data.
What does OLTP stand for?
Online Transactional Processing
It involves managing transaction-oriented applications.
What does OLAP stand for?
Online Analytical Processing
It is used for complex analytical queries and data modeling.
What is a common application of surveys?
Collecting information on demographics and customer satisfaction
Surveys can be administered electronically or in person.
What is a Likert scale?
A scale set up as a single-choice question where the question is a statement and answers are on a scale.
What does survey bias refer to?
The difference between the expected value and the actual value in survey results.
What are common types of survey bias?
- Order bias
- Leading questions
- Recall bias
How can order bias be avoided in surveys?
By randomizing the order of questions and answers.
What is recall bias?
Occurs when respondents are asked to remember details from the past that they may not accurately recall.
What should be included as an answer option to avoid recall bias?
‘I don’t know’.
What is the process of observation in research?
Witnessing something and recording it.
What is an A-B study?
A study comparing two designs to see which performs better in terms of a specific goal.
What are the three steps of a data pipeline?
- Extraction
- Transformation
- Loading
What does ETL stand for?
Extract, Transform, Load.
Describe the ETL process.
Data is extracted, transformed, and then loaded to the destination.
What does ELT stand for?
Extract, Load, Transform.
How does the ELT process differ from ETL?
Data is extracted, loaded, and then transformed in the destination environment.
What is a delta load?
A loading method that only loads data that has changed since the last load.
What is OLTP?
Online Transaction Processing, which automatically stores and processes data from online transactions.
What is OLAP?
Online Analytical Processing, which analyzes data collected by OLTP.
What is the main difference between OLTP and OLAP?
OLTP is for data collection, while OLAP is for data analysis.
What is a query in data analytics?
A request for information from a database.
What is filtering in the context of querying?
The process of being selective about which data is queried using conditional logic.
What are subsets in data querying?
Smaller sections of a dataset created by filtering.
How can filtering improve query efficiency?
By reducing the amount of data pulled, leading to faster processing.
What is indexing in database management?
Assigning a unique ascending number to every entry in a table.
What is sorting in the context of data?
Arranging rows in a different order based on specific logic.
True or False: ETL is generally preferred for simple transformations.
True.
True or False: ELT is faster than ETL when dealing with complicated transformations.
True.
What are the main loading methods in data pipelines?
- Full load
- Delta load
Define full load in data pipelines.
Every time the pipeline is run, the entire dataset is extracted, transformed, and loaded.
What is the role of automated observations?
To conveniently collect data without physical observation.
What is one disadvantage of automated observations?
They are limited in the types of information they can collect.
What is sorting in the context of data management?
Sorting is simply arranging the rows in a different order according to some logic.
How can sorting improve the efficiency of filtering?
Sorting can save time by placing all the rows you want next to each other in the table.
True or False: Sorting always improves processing efficiency.
False.
What is the primary role of indexing?
Indexing can play a role in query optimization.
What is parameterization in database queries?
Parameterization is a prewritten query that allows the user to enter specific parameters.
What is the main reason for using parameterization?
The biggest reason is cyber security, protecting data from injection attacks.
What are temporary tables?
Temporary tables are tools that save the results of a query as their own table.
What is a key limitation of temporary tables?
They are temporary and do not update when the source data updates.
Fill in the blank: A temporary table can be created using the command ______.
CREATE TEMPORARY TABLE.
What are subqueries?
Subqueries are queries embedded inside another query.
True or False: Subqueries are generally considered more efficient than using temporary tables.
False.
What is an execution plan?
An execution plan shows how a query will be executed and may include a graphical representation.
What is the difference between an estimated execution plan and an actual execution plan?
An estimated execution plan gives a rough idea, while an actual execution plan provides specifics after a query is run.
What is OLAP?
OLAP is the process of aggregating and analyzing data stored by OLTP and moving it to a data warehouse.
What is the purpose of filtering in query optimization?
Filtering helps narrow down the rows returned from a query to improve efficiency.
What is the consequence of using specific past events in survey questions?
It introduces recall bias.
When should you use a temporary table instead of running a long query multiple times?
When you need several pieces of information from a complicated query.
What is the best practice when conducting surveys?
Avoid asking about specific incidents in the past.