2 Data Structures, Types, and Formats Flashcards
What is the main focus of this chapter?
Data storage and the various formats of data.
What are the two main categories of databases?
- Structured
- Unstructured
What defines a structured database?
It follows a standardized format with a clear and logical structure.
What are the two main archetypes of structured databases?
- Defined rows/columns
- Key-value pairs
How are defined rows and columns organized?
In tables or spreadsheets where columns represent variables and rows represent data points.
What do key-value pairs represent in a structured database?
Data objects where each object has the same set of keys with different values.
What characterizes unstructured data?
It has no attempt at organization and is often stored as individual files.
What are the two groups of unstructured data?
- Undefined fields
- Machine data
What types of file formats are included in undefined fields?
- Text files
- Audio files
- Video files
- Images
- Social media data
- Emails
What is machine data?
Data automatically generated by software without human intervention.
What is the difference between relational and non-relational databases?
Relational databases store information and relationships, while non-relational databases store information only.
What language is primarily used for querying relational databases?
Structured Query Language (SQL)
True or False: All SQL databases are structured and relational.
True
True or False: All non-relational databases are unstructured.
True
What are the two most basic types of data schemas covered in this chapter?
- Star schema
- Snowflake schema
What is the structure of a star schema?
A central key table with dimension tables connected directly to it.
What are the pros of a star schema?
- Simple
- Fewer joins required
- Easier to understand
What are the cons of a star schema?
- High redundancy
- Denormalized
What distinguishes a snowflake schema from a star schema?
It has two levels of dimension tables instead of one.
What are the pros of a snowflake schema?
- Low redundancy
- Normalized
What are the cons of a snowflake schema?
- More complicated
- More joins required
What is a data warehouse?
A database used for structured relational tables, holding large amounts of processed transactional data.
What is a data mart?
A specialized subset of a data warehouse holding processed information on a specific topic.
What is a data lake?
A storage system for large amounts of raw, unprocessed data, which can be structured, unstructured, or a combination.
Fill in the blank: A data mart is designed to be _______ enough for analysts or customer support employees to access by themselves.
[self-service]
What is a data mart?
A data mart is a subset of a data warehouse that contains customer-facing data and is designed for self-service access by analysts or customer support employees
Data marts prioritize ease of use and often follow a star schema.
What is a data lake?
A data lake stores large amounts of raw, unprocessed data, which can include structured, unstructured, or a combination of both types
Data lakes are often used by data scientists and do not follow any specific schema.
Who typically creates data warehouses and data lakes?
Data warehouses and data lakes are usually created by specialized data engineers, and many companies only have one
They are often created through third-party services or software.
What are some popular data warehouse tools?
- Snowflake
- Hevo
- Amazon Web Services (AWS) data warehouse tools
- Microsoft Azure data warehouse tools
- Google data warehouse tools
What are the two options for updating a current value in a dataset?
- Overwrite historical values
- Keep historical values
What is the benefit of overwriting historical values?
It keeps your dataset smaller and simpler, but historical data is lost
Historical data is required for trend analysis.
What columns are added to keep historical values?
- Active Record
- Active Start
- Active End
What does the Active Record column indicate?
It indicates whether the specified value is the most current value, marked as Yes or No.
What are the consequences of changing the number of variables being recorded?
It creates null values in the dataset
Null values occur whether you are adding new columns or removing existing ones.
What are the four common data types that everyone working with data should know?
- Date
- Numeric
- Alphanumeric
- Currency
How should dates be formatted according to ISO recommendations?
- YYYY-MM-DD
- YYYY-MM-DD HH:MI:SS
What is numeric data?
Numeric data is made up of numbers, which can be whole numbers or decimals.
What does alphanumeric data include?
Alphanumeric data includes numbers and letters, except for values in scientific notation.
What is currency data?
Currency data includes monetary values, typically denoted with a currency symbol.
What are the two types of numeric data?
- Discrete
- Continuous
What defines discrete variables?
Discrete variables are counts that usually describe whole numbers or integers.
What are continuous variables?
Continuous variables can represent an infinite number of values between two points and are often measured as decimals.
What are the three main types of categorical variables?
- Binary
- Nominal
- Ordinal
What distinguishes independent variables from dependent variables?
Independent variables are manipulated directly, while dependent variables are measured and depend on independent variables.
What file types are commonly encountered by data analysts?
- Text (TXT)
- Image (JPEG)
- Audio (MP3)
- Video (MP4)
- Flat (CSV)
- Website (HTML)
What are flat files?
Flat files contain a simple two-dimensional dataset or spreadsheet, such as TSV or CSV.
What is the difference between TSV and CSV?
TSV values are separated by tabs, while CSV values are separated by commas.
What are the common image file types?
- JPG/JPEG
- PNG
- GIF
- BMP
- RAW
What formats do audio files commonly use?
- MP3
- WAV
- WMA
- AAC
- ALAC
What are some popular video file formats?
- MP4
- WMV
- MOV
- FLV
- AVI
What are the website file types recognized by the exam?
- HTML
- XML
- JSON
What is the purpose of HTML?
HTML is used to structure websites and store information between tags.
What distinguishes XML from HTML?
XML tags have no pre-determined meanings and can be customized, while HTML tags have specific meanings.
What is HTML primarily used for?
Website structure and occasionally passing information
How is information stored in HTML?
Between tags that create elements with specific meanings
What is XML similar to?
HTML
In XML, what is unique about the tags?
They have no pre-determined meanings
What is a key feature of JSON?
Specializes in storing and passing information
How does a JSON file structure data?
Contains a list of data objects using key-value pairs
What is the main difference between JSON and HTML/XML?
JSON does not contribute to website structure
What types of databases were covered in this chapter?
Structured and unstructured databases
What are the two types of databases discussed?
Relational and non-relational databases
What types of schemas were mentioned?
Star and snowflake schemas
What are the three types of data storage mentioned?
Data warehouses, data marts, and data lakes
What is a characteristic of a data lake?
Focuses on raw, unprocessed data
True or False: JSON can have pre-determined tag meanings like XML.
False
Fill in the blank: A smart thermometer sends data to a ______.
local database
What type of schema is most appropriate for non-technical client-facing agents?
Star schema
What will historic values be for a newly added column in a dataset?
Null
What type of data does a file with the ‘.png’ extension contain?
Image