1. Intro to Data Curation Flashcards
What is data curation?
The process of preparing data for analytics
What is data science?
a multidisciplinary field that combines skills in computer science and statistics with domain experience. This combination of skills and experience is used to support the end-to-end analysis of large and diverse data sets, ultimately uncovering value for an organization and then communicating that value to stakeholders as actionable results.
What are the 6 phases of the Data Curation Lifecycle?
Finding
Exploring
Structuring
Cleansing
Updating
Archiving
What is a CPU?
The place where all the work or processing takes place on the
computer. The CPU can be thought of as the brain of the computer. It executes instructions supplied by programs and applications.
What is RAM?
Random Access Memory - the component that stores data for immediate use in CPU processing. RAM is volatile memory, meaning that when you turn your computer off, data in memory is lost. Memory serves as the intermediary between data stored physically on disk and the processing of that data.
What is structured data?
Data that has clearly defined columns and data types. Rows of data are stored in logical records where the fields or entries in each record pertain to a specific entity.
What is unstructured data?
Data that does not have a defined data model or schema. The column names, data types, and lengths are not defined and stored with the data. Examples include social media data, and audio and video files, and raw data.
What is Hadoop?
Hadoop is an open source, software framework that utilizes a cluster of computers for distributed storage and parallel processing of data.
What is a computer cluster?
A computer cluster is a grouping of multiple computers, connected by a local area network.
What is a node?
A computer in a cluster
What is distributed storage?
Distributed storage of data means that the data is stored in pieces across your computer cluster. Instead of having to fit an entire file on one disk on one computer, the file is broken into pieces and distributed across the nodes.
What is a data lake?
Data lakes are useful for storing structured and unstructured data. They do not require your data to fit a certain structure or schema, and they enable you to store a large variety and volume of data together. With data lakes, the data can be dumped into storage as is and curated later in the process.
What is cloud storage?
Cloud storage enables you to store your data in a location that you cannot physically access, but you can still access easily through the internet. Your data isn’t sitting on a server in the basement of your office or on the hard drive of your desktop computer, but instead, it is stored on your cloud provider’s servers.
What are the 4 major resources of a computing environment?
CPU, Memory, Storage, and Network
What is parallel processing?
The concept of breaking jobs into tasks that run simultaneously
What is grid computing?
Grid computing enables us to expand the resources that are available for processing and jobs beyond a single computer. Computer grids are created by connecting multiple computers together via a network in order to take advantage of all the processing power and resources available on those computers.
What is cloud computing?
a broad term that refers to immediate access to computing resources hosted over the internet. These resources can include software, data storage, processing power, and more.
What are the 3 broad service types of cloud computing?
Software as a Service (Saas)
Platform as a Service (PaaS)
Infrastructure as a Service (IaaS)
What is IaaS?
Providers of Infrastructure as a Service supply the infrastructure, which includes the basic computing resources and storage, and the users then build everything else that they need. When companies rely on IaaS providers, it can be thought of as renting servers, and their users can install operating systems and programs on the servers.
What is PaaS?
With PaaS, a provider offers more of the application stack than IaaS providers, adding operating systems, middleware (such as databases) and other runtimes into the cloud environment.” Users can develop applications without worrying about installing the operating system or dealing with maintenance or updates.