1. Intro to Data Curation Flashcards

1
Q

What is data curation?

A

The process of preparing data for analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is data science?

A

a multidisciplinary field that combines skills in computer science and statistics with domain experience. This combination of skills and experience is used to support the end-to-end analysis of large and diverse data sets, ultimately uncovering value for an organization and then communicating that value to stakeholders as actionable results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the 6 phases of the Data Curation Lifecycle?

A

Finding
Exploring
Structuring
Cleansing
Updating
Archiving

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a CPU?

A

The place where all the work or processing takes place on the
computer. The CPU can be thought of as the brain of the computer. It executes instructions supplied by programs and applications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is RAM?

A

Random Access Memory - the component that stores data for immediate use in CPU processing. RAM is volatile memory, meaning that when you turn your computer off, data in memory is lost. Memory serves as the intermediary between data stored physically on disk and the processing of that data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is structured data?

A

Data that has clearly defined columns and data types. Rows of data are stored in logical records where the fields or entries in each record pertain to a specific entity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is unstructured data?

A

Data that does not have a defined data model or schema. The column names, data types, and lengths are not defined and stored with the data. Examples include social media data, and audio and video files, and raw data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Hadoop?

A

Hadoop is an open source, software framework that utilizes a cluster of computers for distributed storage and parallel processing of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a computer cluster?

A

A computer cluster is a grouping of multiple computers, connected by a local area network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a node?

A

A computer in a cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is distributed storage?

A

Distributed storage of data means that the data is stored in pieces across your computer cluster. Instead of having to fit an entire file on one disk on one computer, the file is broken into pieces and distributed across the nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a data lake?

A

Data lakes are useful for storing structured and unstructured data. They do not require your data to fit a certain structure or schema, and they enable you to store a large variety and volume of data together. With data lakes, the data can be dumped into storage as is and curated later in the process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is cloud storage?

A

Cloud storage enables you to store your data in a location that you cannot physically access, but you can still access easily through the internet. Your data isn’t sitting on a server in the basement of your office or on the hard drive of your desktop computer, but instead, it is stored on your cloud provider’s servers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the 4 major resources of a computing environment?

A

CPU, Memory, Storage, and Network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is parallel processing?

A

The concept of breaking jobs into tasks that run simultaneously

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is grid computing?

A

Grid computing enables us to expand the resources that are available for processing and jobs beyond a single computer. Computer grids are created by connecting multiple computers together via a network in order to take advantage of all the processing power and resources available on those computers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is cloud computing?

A

a broad term that refers to immediate access to computing resources hosted over the internet. These resources can include software, data storage, processing power, and more.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the 3 broad service types of cloud computing?

A

Software as a Service (Saas)
Platform as a Service (PaaS)
Infrastructure as a Service (IaaS)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is IaaS?

A

Providers of Infrastructure as a Service supply the infrastructure, which includes the basic computing resources and storage, and the users then build everything else that they need. When companies rely on IaaS providers, it can be thought of as renting servers, and their users can install operating systems and programs on the servers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is PaaS?

A

With PaaS, a provider offers more of the application stack than IaaS providers, adding operating systems, middleware (such as databases) and other runtimes into the cloud environment.” Users can develop applications without worrying about installing the operating system or dealing with maintenance or updates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is SaaS?

A

With Software as a Service, cloud providers host software applications. These applications are available to customers via the internet.

22
Q

What are the 4 tiers of SAS 9.4?

A

Data Sources,
SAS Servers,
Middle Tier,
Client Applications

23
Q

What is the SAS Server Tier?

A

The SAS servers are the software components of your SAS deployment that receive requests from client applications and perform requested operations.

24
Q

What is the Client Applications Tier?

A

The client applications provide users with various programming or point-and-click interfaces to the SAS Platform. Some of the client applications are SAS Data Integration Studio, DataFlux Data Management Studio, SAS Event Stream Processing Studio, and SAS Studio.

25
Q

What is the Middle Tier?

A

The middle tier contains software components that enable SAS users to work with web applications, such as SAS Studio. These web applications are hosted on the middle tier and send data to and from users who interact with these hosted applications via a web browser.

26
Q

What is the SAS Grid Control Server?

A

The SAS Grid Control Server controls the distribution of jobs to the grid. If a client application wants to submit code to the grid, the request is sent to the Grid Control Server, and the request is queued and dispatched based on policies set by the grid administrator.

27
Q

What is SAS Viya?

A

SAS Viya is a cloud-enabled, in-memory analytics engine that uses SAS Cloud Analytics Services, or CAS.

28
Q

What is CAS?

A

CAS is a server that provides the run-time environment for data management and analytics with SAS, enabling you to tackle your analytics problems and gain insights from data.

29
Q

What is Machine Learning (ML)?

A

Machine learning involves the creation of computer programs and algorithms that learn from the data itself and adjust accordingly as new data becomes available. This means that as the model is given more data to work with, it can adjust and predict based on this new data.

30
Q

What is data velocity?

A

How quickly new data is being generated

31
Q

What to consider about data variety?

A

variety, ask yourself whether you have sampled a diverse and representative population

32
Q

What to consider about data veracity?

A

the accuracy of the data. Does the identified data contribute to answering the identified question? Is the data precise, trusted, and reliable?

33
Q

What are some common data exploration tasks as part of the curation life cycle?

A
  • Plotting the data gives a visual overview of both categorical and continuous variables.
  • Calculate some basic statistics. For example, with numerical data, you can look at the range, minimum, maximum, and frequency of values. You might also want to look at measures of central tendency such as the mean, median, and mode.
  • Determine where data needs to be cleansed and standardized,
34
Q

What are some common steps during the data structuring and cleansing phase of the data curation lifecycle?

A

All the issues identified in data exploration (casing, spelling, missing values, extreme values, and more) need to be handled in an appropriate way.

Aggregations.
Structuring = Standardizing

35
Q

What is SAS/ACCESS Technology?

A

SAS/ACCESS technology enables users to query and manage data stored in databases and other data sources. Users can manage, update, and query data using SQL that is native to the database or using SAS language.

Processing can be pushed to the database, depending on the methods used. Results can also be brought back to SAS and saved to SAS tables for further processing and analysis.

36
Q

What is SAS Data Integration Studio?

A

SAS Data Integration Studio is a SAS platform application interface that enables users to manage their data integration processes across an organization. Users can create jobs using a drag-and-drop interface. These jobs generate SAS code to access, manipulate, integrate, and store their data across a wide variety of data formats.

37
Q

What is SAS DataFlux Data Management Studio

A

DataFlux Data Management Studio is a platform application interface designed for data integration and advanced data quality. To perform a wide variety of data quality operations, users leverage an extensive library of data quality rules and algorithms, referred to as the Quality Knowledge Base (QKB), as well as third-party reference data packs. These operations include standardization, entity resolution, address verification, and more. DataFlux Data Management Studio also has built-in tools to profile data and build business rules, enabling data quality stewards to identify and remedy issues in their data. Users can design automated processes to assess data for specific data quality issues and generate alerts when such issues arise.

38
Q

What is SAS Data Loader for Hadoop?

A

SAS Data Loader for Hadoop is a web-based, non-programmatic way for users to interact with data in Hadoop. It can be used to move data in and out of Hadoop; interrogate and profile data for quality issues; transform, transpose, and join data; and more.

39
Q

What is SAS Federation Server?

A

SAS Federation Server is a platform application interface that makes it easier for business users to access secure data for reporting and analysis.

It enables data administrators to define SQL-based views, making the data available to users without physically moving the data.

SAS Federation Server can be used to maintain, configure, and monitor data access from a single point of administration in a web browser interface, improve data access performance, and apply data quality functions such as standardization and parsing. If necessary, administrators can control business user permissions all the way to the row and column level.

40
Q

What is SAS Event Stream Processing?

A

SAS Event Stream Processing Studio provides a graphical interface as well as a code-based interface that enable users to build event stream processing applications. An event stream is the continuous flow of data points from a sensor or other application. SAS Event Stream Processing Studio can be used to ingest, filter, join, and aggregate event streams, as well as to execute external routines against event streams and to detect patterns in event streams.

41
Q

What is the SAS QKB?

A

The SAS QKB is a collection of files and algorithms that store data and logic for defining data management operations such as data cleansing and standardization.

The SAS QKB is used in SAS Data Integration Studio, DataFlux Data Management Studio, SAS Data Loader for Hadoop, SAS Federation Server, SAS Event Stream Processing Studio, and more.

42
Q

What is AI?

A

AI can be thought of as the ability of your computer to mimic human intelligence.

43
Q

What is SAS’ definition of ML?

A

It is a branch of artificial intelligence based on the idea that systems can learn from data, identifying patterns, and make decisions with minimal human intervention

44
Q

What are the 4 categories of ML algorithms?

A

supervised,
semi-supervised,
unsupervised,
reinforcement.

45
Q

What is Training Data?

A

In supervised machine learning, the selected algorithm and data are used by machine learning tools to build a model. The data used to initially build the model from the machine learning algorithm selected is considered the training data.

46
Q

What is a target in terms if training data?

A

An appropriate answer, known as a target, must be present in the training data. If financial data is being used to determine whether a customer missed a credit card payment, the answer yes (they did miss a payment) or no (they did not miss a payment) would need to be present in the training data.

47
Q

What is data federation?

A

Data Federation is the ability to use data across multiple source systems without physically having to move the data. Access to the data is provided via SQL views, and these views populate data only when the view is accessed.

48
Q

What is Data Virtualization?

A

Data Virtualization is the process of accessing and manipulating data from disparate systems through a common data-access approach that hides the complexity of data access from the end
user. This includes how the data is formatted, where it is located, database security, database schemas or table names, and so on, as well as how data across multiple sources fits together.

49
Q

What is Data Disclosure Control?

A

Data Disclosure Control is modifying data so that no sensitive information remains.

The challenge of Data Disclosure Control is in the ability to share information with users, while at the same time, protecting personally identifiable information (for example, account numbers, addresses, phone numbers, and taxpayer IDs) from the end user.

It requires that the data elements are masked from the end user in some way.

50
Q

What is SAS Data Governance?

A

SAS Data Governance enables you to address the scope of data governance, including the business data glossary, metadata lineage, and data monitoring. As you learn about SAS Data
Governance, you are exposed to SAS Business Data Network, SAS Lineage, and DataFlux Data Management Studio.

51
Q

What to consider about data volume?

A

ask yourself whether you have enough data to train a machine learning model or make a statistically significant claim