Data Science for Dummies Flashcards
What are the 3 types of data and examples of each?
Structured - stored, processed, and manipulated in a traditional relational database management system (RDBMS), i.e. a MySQL database that uses tabular data
Unstructured - data generated from human activities that doesn’t fit into a structured database format, i.e. emails, Word docs, AV files
Semistructured - data that doesn’t fit into a structured database system but is organizable by tags that are useful for creating a form of order/hierarchy. Examples include XML (a file used to store data in the form of hierarchical elements) and JSON files ( JSON file is a file that stores simple data structures and objects in JavaScript Object Notation (JSON) format, which is a standard data interchange format. It is primarily used for transmitting data between a web application and a server. JSON files are lightweight, text-based, human-readable, and can be edited using a text editor.)
What is big data?
Data that exceeds the processing capacity of conventional database systems because it’s too big or lacks the structural requirements of a traditional database architecture
What does it mean to query data?
Write commands to extract relevant datasets from data storage systems (usually you do this with SQL, structured query language)
What is Hadoop?
A platform for batch-processing and storing large volumes of data. It was designed to boil down big data into smaller datasets that are more manageable for data scientists to analyze. It’s been in decline in popularity since 2015
What are the 3 characteristics that define big data?
Volume, velocity, and variety. Because these 3 Vs are ever expanding, newer, more innovative data technologies must be developed to manage big data problems
What’s the size of big data volume?
As low as 1 terabyte and has no upper limit. If your org owns at least 1 terabyte of data, the data technically qualifies as Big Data
What does it mean that most big data is “low value?”
Big data is composed of huge number of very small transactions in a lot of formats that only have value once they’re aggregated (data engineers) and analyzed (data scientists)
Why is data velocity important?
A lot of big data is created using automated processes, and a lot of it is low value. You need systems that are able to ingest a lot of it quickly and generate timely and valuable insights
What is data velocity?
Data velocity is data volume per unit in time. Big data enters an average system at velocities between 30 kilobytes/second to 30 gigabytes per second
What does latency refer to?
Related to data velocity. A system’s delay in moving data after it’s been instructed to do so (every data system has this). Many data engineered systems are required to have latency <100 milliseconds from data creation to the time the system responds
What is throughput?
Related to data velocity. A characteristic describing a system’s capacity for work per unit of time. The capabilities of data handling and processing technologies limit data velocities
What are some of the tools (3) that intake data into a system (data ingestion)?
Apache scoop - quickly transfer data back and forth between a relational data system and the Hadoop distributed file system (HDFS)
Apache Kafka
Apache Flume
What does data variety refer to?
High variety data comes from a multitude of sources with different underlying structures (structured, unstructured, or semistructured)
What is a data lake?
A nonhierarchical data storage system used to hold huge volumes of multistructured raw data within a flat storage architecture. In other words, a collection of records that come in uniform format and are not cross-referenced in any way. HDFS can be used as a data lage storage reposity as can AWS S3 platform (one of the more popular cloud architectures for storing big data)
What is a data warehouse?
A data warehouse is a centralized data repository that you can use to store and access only structured data. A more traditional warehouse system is a data mart, a storage system for structured data that you can use to store one particular focus area of data belonging to one line of business in the company
What is machine learning?
The practice of applying algorithms to learn from and make automated predictions from data
What is a machine learning engineer?
Hybrid between a software engineer and a data scientist (NOT data engineer). A software engineer who is skilled enough in data science to deploy advanced data science models within the applications they build, bringing ML models into production in a live environment like a SaaS product or a webpage
What is a data engineer?
Build and maintain data systems for overcoming data processing bottlenecks and data handling problems that arise from handling the high volume, velocity, and variety of big data. They use SWE to design systems for and solve problems with handling and manipulating big data sets. They often have experience working with and designing real time processing frameworks and massively parallel processing platforms as well as RDBMSs.
What programs to data engineers code in?
Java, C++, Scala, or Python. THey also know how to deploy Hadoop MapReduce or Spark to handle, process, and refine big datasets into more manageable sizes
What is the purpose of data engineering?
Engineer large scale data solutions by building coherent, modular, scalable data processing platforms that data scientists can use to derive insights
True or false: Data engineering involves engineering a built system
False. It involves the designing, building, and implementing of software solutions to problems in the data world.
What’s the difference between a data engineer, ML engineer, and data scientist?
The data engineer will store, migrate, and process your data, data scientist will make sense of the data, and ML engineer will bring ML models into production
What are the big cloud data storage services?
AWS, Google cloud, Microsoft azure
Why is cloud data storage more beneficial than on premise data storage?
-cloud service providers take care of work to configure and maintain computing resources which makes it easier to use the data
-more flexibility - can turn off cloud service for what’s no longer needed vs. having servers on premise you’re not using
-more secure
What is serverless computing?
Computing done in the cloud vs. on a desktop or on premise. A physical server exists but is supported by the cloud computing company you retain
What is FaaS and some examples?
Function as a service. It’s a containerized cloud computing service that makes it easier to execute code in a cloud environment without needing to set up code infrastructure (data science model runs directly in the container). Examples: AWS Lambda, Google cloud functions, azure functions
What is kubernetes?
Open source software suite that manages and coordinates deployment, scaling, and management of containerized applications across clusters of worker nodes. It helps software developers build and scale apps quickly but needs data engineering expertise to set up quickly
What does it mean for a system to be fault tolerant?
Built to continue successful operations even if one of sub components fails. Has redundancy in computing nodes.
What does it mean for a system to be extensible?
It can be extended or shrunk in size without disrupting its operations
What is parallel processing?
Data is processed quickly bc the work required to process the data is distributed across multiple nodes in a system. This configuration allows for simultaneous processing is tasks across multiple nodes
Name 3 cloud warehouse solutions.
Amazing Redshift - big data warehousing service running on data sitting in the cloud
Snowflake - SaaS solution providing parallel processing for structured and semi structured data stored in the cloud on snowflakes servers
Google BigQuery
True of false: an RDBMS can handle big data
False; rdbms only for tabular data that can be queried by SQL