3 Collecting Data Flashcards
What is the main focus of this chapter?
Sources of data and methods for collecting it
This includes public sources, collecting personal data, and optimizing queries.
What are the two main categories of public databases?
Government and industry
Government databases are typically run by governmental entities, while industry databases focus on specific industries.
What type of data can government-run databases provide?
- Population
- Agriculture
- Utilities
- Public health concerns
The availability of data depends on the government entity and location.
What is a downside of using government databases?
They often contain missing data and mistakes
This requires significant time for data cleaning.
What is Kaggle known for?
It is a platform that hosts open source datasets and is a valuable resource for data analysis
Kaggle also offers courses and competitions for those in the data analytics community.
What does an API do?
Allows two unrelated computer systems to exchange information
APIs act as intermediaries for communication between systems.
What are the two ways information can be passed through APIs?
- Synchronous
- Asynchronous
Synchronous requests wait for a response, while asynchronous requests continue processing.
What is web scraping?
The process of collecting data directly from a web page
This method differs from accessing data through databases or APIs.
What is a survey?
A set of questions given to a sample of individuals
Surveys are used to gather data about a larger population.
What is the difference between a population and a sample?
A population includes all individuals in a group, while a sample is a smaller subset
Sampling is used to generalize findings to the larger population.
What are the main types of survey answers?
- Text-based
- Single-choice
- Multiple-choice
- Drop-down
- Likert
Each type has its own pros and cons for data analysis.
What is a Likert scale used for?
Gauging opinions or effectiveness on a specific topic
Respondents indicate their level of agreement with statements.
What is the challenge with text-based survey answers?
They are difficult to analyze due to variability in responses
Natural language processing may be needed for interpretation.
What is survey bias?
A tendency that skews results and affects data accuracy
Analysts must actively avoid bias to ensure valid results.
What is the primary advantage of asynchronous API requests?
They allow for faster and more efficient processing
The code does not have to wait for a response before continuing.
What does the term ‘open sources’ refer to?
Datasets made available for free by individuals or companies
These datasets cover a wide range of topics.
What are the limitations of industry-run public databases?
They tend to share only the legal minimum and are not always easy to find
However, they are usually cleaner than government databases.
What is the main function of web services?
A specific kind of API that requires both computers to be in the same hosted environment
All web services are APIs, but not all APIs are web services.
What is the importance of filtering by license when using datasets?
Not all datasets can be used commercially
It’s crucial to check licenses if the data is for business use.
What does the acronym ETL stand for?
Extract, Transform, Load
It refers to a process for moving and transforming data.
What does OLTP stand for?
Online Transactional Processing
It involves managing transaction-oriented applications.
What does OLAP stand for?
Online Analytical Processing
It is used for complex analytical queries and data modeling.
What is a common application of surveys?
Collecting information on demographics and customer satisfaction
Surveys can be administered electronically or in person.
What is a Likert scale?
A scale set up as a single-choice question where the question is a statement and answers are on a scale.