Unit 3 Flashcards
Primary Key
field that uniquely identifies a record. ex: student id or employee id
Foreign Key
Field that is in a relational table, that relates to a primary key of another table
Schema
Blueprint on how the database is constructed.
Big Data
Data come from everywhere—including smartphone metadata, internet usage records, social media activity, computer usage records, and countless other data sources—to be sifted for patterns and trends.
Four Vs of Big Data
Volume (amount), Variety (Various forms), Veracity (quality and trustworthiness), Velocity (speed).
Data mining/Data discovery
examination of huge sets of data to find patterns, connections, outliers and hidden relationships. It is a BI tool used for decision making.
Structed data
resides in fixed formats, well labeled, easily queried and searched.
unstructured data
is unorganized data(social media/big data) cannot be easily read or or processed by a computer. not stored in rows and columns
semi structured data
in between structured and unstructured data, can be read but takes work. example is email.
Big data tools
ETL and Hadoop
Extract
Once you have determined where your data is coming from and where you want it to reside, you can start extracting. usually comes from CRM or ERP
Transform
Once you have extracted data, they need to be transformed to fit into the database table. This may involve removing decimals and dollar signs from financial transactions so it will fit into the structured data table.
Load
Once data are transformed, they are ready to finally be transferred into the data warehouse and data mart. The more often this is done, the more up-to-date analytic reports can be.
Hadoop
an infrastructure for storing and processing large sets of data across multiple servers. Instead of centralized files in one place like a data warehouse or data mart, Hadoop uses a distributed file system that allows files to be stored on multiple servers
Which restriction applies to the data in the primary field of a database?
The primary key has to be unique
The data are full of missing, misplaced, or duplicate data, which the data analyst needs to remove.
Which process can this data analyst use to remove such data?
Normalization