Data Sources, Collection & Validation Flashcards
What is the difference between primary and secondary data sources?
Primary data is first-hand information collected directly from sources (e.g., surveys, interviews), while secondary data is pre-existing data used for analysis (e.g., government records, books).
What are real-time data sources, and why are they useful?
Real-time data is collected and processed instantly, allowing immediate decision-making. Examples include stock market tracking, IoT monitoring, and traffic navigation.
What are three key considerations in data collection?
Accuracy (data correctness), reliability (consistency over time), and ethics (ensuring privacy and compliance).
What are three common data collection methods?
Surveys and questionnaires, interviews, sensors, transactions and web scraping.
What are the six stages of the data processing cycle?
Collection, preparation, input, processing, output, and storage.
How does batch processing differ from real-time processing?
Batch processing processes large amounts of data at scheduled times (e.g., payroll systems), while real-time processing occurs immediately as data is received (e.g., fraud detection).
What is data validation, and why is it important?
Data validation ensures data accuracy, completeness, and consistency before storage or processing, reducing errors and improving reliability.
What are three types of data validation?
Format validation (ensures correct data format), range validation (checks numerical values within limits), and presence check (ensures required fields are filled). Consistency check: confirming data matches related information, Uniqueness check: ensures values are distinct
What are common disadvantages of data processing?
Inaccurate data can lead to misleading results, data breaches pose security risks, and maintaining infrastructure can be costly.
How does cloud storage improve data collection and processing?
Cloud storage allows remote access, scalability, automatic backups, and integration with AI and big data tools for efficient processing.
What are the four main types of data storage, and how do they differ in purpose and functionality?
Databases – Used for storing structured data, allowing efficient organization and retrieval.
Data Warehouses – Large repositories that store vast datasets from multiple sources, primarily for data analysis.
Cloud Storage – Stores data on remote servers accessible via the internet, offering scalability and flexibility.
Local Storage – Stores data on physical devices like hard drives or solid-state drives, providing direct access.
What is the difference between online, distributed and cloud based processing:
Online processing is the interactive processing of data as it is input by users.
*
Distributed processing involves distributing data processing work across numerous computers.
*
Cloud based processing means using cloud based resources to process data.
What is the data validation process?
Analyzing Data – Understanding business requirements, choosing the right analysis technique, and processing results.
Sampling – Testing a small subset of data before validating the full dataset to save time and resources.
Validating Database – Ensuring database data is relevant by comparing source and target data fields.
Comparison – Handling incomplete data and verifying output accuracy against expected results.