Course1-M3 Flashcards
Data Collection and Data Wrangling
Non-relational databases can be queried using ____ or ____ query tools. Some non-relational databases come with their own querying tools such as CQL for Cassandra and GraphQL for Neo4J.
SQL
SQL-like
Data Exchange platforms allow the exchange of data between data providers and data consumers. They provide data licensing workflows, de-identification and protection of personal information, legal frameworks, and a quarantined analytics environment. Examples of popular data exchange platforms include ____, ____, ____, and ____.
AWS Data Exchange
Crunchbase
Lotame
Snowflake
Semi-structured data is data that has some organizational properties but not a rigid schema, such as, data from emails, XML, zipped files, binary executables, and TCP/IP protocols. Semi-structured can be stored in NoSQL clusters. ____ and ____ are commonly used for storing and exchanging semi-structured data.
XML
JSON
Unstructured data is data that does not have a structure and cannot be organized into a schema, such as data from web pages, social media feeds, images, videos, documents, media logs, and surveys. ____ and ____ provide a good option to store and manipulate large volumes of unstructured data.
NoSQL databases
Data Lakes
Data lakes can accommodate all data types and schema.
____ and ____ provide automated functions that facilitate the process of importing data. Tools such as Talend and Informatica, and programming languages such as Python and R, and their libraries, are widely used for importing data.
ETL tools
data pipelines
Joins combine ____ (columns/rows). Unions combine ____ (columns/rows).
Columns: When two tables are joined together, columns from the first source table are combined with columns from the second source table—in the same row. So, each row in the resultant table contains columns from both tables.
rows: Rows of data from the first source table are combined with rows of data from the second source table into a single table. Each row in the resultant table is from one source table or another.
cleaning the database of unused data and reducing redundancy and inconsistency. Data coming from transactional systems, for example, where a number of insert, update, and delete operations are performed on an ongoing basis, are highly normalized.
combine data from multiple tables into a single table so that it can be queried faster. For example, normalized data coming from transactional systems is typically denormalized before running queries for reporting and analysis.
popularly used data wrangling software and tools, such as: ____, ____ , ____, ____, ____, ____, ____, ____
Excel
Power Query / Spreadsheets and Add-ins
OpenRefine
Google DataPrep
Watson Studio Refinery
Trifacta Wrangler
Python
R