Exam 1 Terms Flashcards
Datasets that are too large and complex for businesses’ existing systems to handle utilizing their traditional capabilities to capture, store, manage, and analyze these datasets.
Big Data
A data approach that attempts to assign each unit in a population into a few categories potentially to help with predications.
Classification
A data approach that attempts to divide individuals (like customers ) into groups (or clusters) in a useful way.
Clustering
A data approach that attempts to discover associations between individuals based on transactions involving them.
Co-occurrence grouping
The process of evaluating data with the purpose of drawing conclusions to address business questions.
Data Analytics
Centralized repository of descriptions for all of the data attributes of the dataset.
Data dictionary
A data approach that attempts to reduce the amount of information that needs to be considered to focus on the most critical items.
Data reduction
A data approach that attempts to predict a relationship between two data items.
Link Prediction
A variable that predicts or explains another variable.
Predictor variable.
A data approach that attempts to characterize the “typical” behavior of an individual, group, or population by generating summary statistics about the data.
Profiling
A data approach that attempts to estimate or predict, for each unit, the numerical value of some variable using some type of statistical model.
Regression
A variable that responds to, or is dependent on, another.
Response variable
A data approach that attempts to identify similar individuals based on data known about them.
Similarity matching
Data that are organized and reside in a fixed field with a record or a file.
Structured data
Data that do not adhere to a predefined data model in a tabular format.
Unstructured data
A system that records, processes, reports, and communicates the results of business transactions to provide financial and nonfinancial information for decision- making purposes.
Accounting information system
A special case of primary key that exists in linking tables. (made up of two primary keys in the table that it is linking)
Composite primary key
An information system for managing all interactions between the company and its current and potential customers.
Customer Relationship Management system (CRM)
Centralized repository of descriptions for all of the data attributes of the dataset.
Data dictionary
A method for obtaining data if you do not have access to obtain the data directly yourself.
Data request form
Attributes that exist in relational databases that are neither primary nor foreign keys. Provide business information.
Descriptive attributes
A category of business management software that integrates applications from throughout the business into one system.
Enterprise Resource Planning system (ERP)
The extract, transform, and load process that is integral to mastering the data.
ETL
A means of storing data in one place, such as in an Excel spreadsheet, as opposed to storing the data in multiple tables, such as in a relational database.
Flat line
An attribute that exists in a relational databases in order to carry out the relationship between two tables. Does not serve as the “Unique Identifier” for each table.
Foreign Key
An information system for managing all interactions between the company and its current and potential employees.
Human Resource Management system HRM
The second step in the IMPACT cycle; it involves identifying and obtaining the data needed for solving the data analysis problem, as well as cleaning and preparing the data for analysis.
Mastering the data
An attribute that is required to exist in each table of a relational database and serves as the “unique identifier” for each record in a table.
Primary key
A means of storing data in order to ensure that the data are complete, not redundant, and to help enforce business rules. (Communication and integrations of business processes)
Relational Database
An information system that helps manage all the company’s interactions with suppliers.
Supply Chain Management system (SCM)
The opposite of the null hypothesis, or a potential result that the analyst may expect.
Alternative hypothesis
The principle that in any large, randomly produced set of natural numbers, there is an expected distribution of the first, or leading, digit with 1 being the most common.
Benford’s Law
A data approach similar to regression, but used to test for cause and effect relationships between multiple variables.
Causal modelling
A data approach that attempts to assign each unit in a population into a few categories potentially to help with predictions.
Classification
A data approach that attempts to discover associations between individuals based on transactions involving them.
Co-occurrence grouping
A data approach that attempts to reduce the amount of information that needs to be considered to focus on the most critical items.
Data reduction
Technique used to mark the split between one class and another.
Decision boundaries
An information system that supports decision-making activity within a business by combining data and expertise to solve problems and perform calculations.
Decision support system
Tool used to divide data into smaller groups.
Decision tree
Procedures that summarize existing data to determine what has happened in the past.
Descriptive analytics
Procedures that explore the current data to determine why something has happened the way it has, typically comparing the data to a benchmark.
Diagnostic analyics
An interactive report showing the most important metrics to help users understand how a company or an organization is performing.
Digital Dashboard
A numerical value (1 or 0) to represent categorical data in statistical analysis. 1= something, 0= nothing
Dummy variable
Used in addition to statistical significance in statistical testing. Demonstrates the magnitude of the difference between groups.
Effect size
A measure of variability. Divided into four parts.
Interquartile range (IRQ)
A modeling error when the derived model too closely fits a limited set of data points.
Overfitting
Procedures used to generate a model that can be used to determine what is likely to happen in the future.
Predictive analytics
Procedures that work to identify the best possible options given constraints or changing conditions.
Prescriptive analytics
A data approach that attempts to identify similar individuals based on data known about them.
Similarity matching
Describe the location, spread, shape, and dependence of a set of observations.
Summary Statistics
Approach used to learn more about the basic relationships between independent and dependent variables that are hypothesized to exist.
Supervised approach
A discriminating classifier that is defined by a separating hyperplane that works first to find the widest margin.
Support Vector Machines
A set of data used to assess the degree and strength of a predicted relationship established by the analysis of training data.
Test data
A predictive analytics technique used to predict future values based on past values of the same variable.
Time series analysis
Existing data that have been manually evaluated and assigned a class, which assists in classifying the test data.
Training data
A modeling error when the derived model poorly fits a limited set of data points.
Underfitting
Approach used for data exploration looking for potential patterns of interest.
Unsupervised approach
A global standard for exchanging financial reporting information that uses XML.
XBRL
One way to categorize quantitative data, as opposed to discrete data. Height, or weight.
Continuous data
Made when the aim of your project is to declare or present your findings to an audience. Charts made after the data analysis has been completed.
Declarative visualizations
One way to categorize quantitative data, as opposed to continuous data. Whole numbers, like points in a basket ball game.
Discrete data
Made when the lines between steps “P” (perform test plan), “A” (address and refine results), and “C” (communicate results) are not as clearly divided as they are in a declarative visualization.
Exploratory visualization
The third most sophisticated type of data on the scale of nominal, ordinal, interval, and ratio. A quantitative data. No meaningful 0.
Interval data
The least sophisticated type of data on the scale of nominal, ordinal, interval and ration. You can only count, group, or take a proportion. Ex: Hair color, gender, and ethic groups.
Nominal data
A type of distribution in which the median, mean, and mode are all equal, so half of all the observations fall below the mean and the other half fall above the mean.
Normal distribution
The second most sophisticated type of data on the scale of nominal, ordinal, interval and ratio. Can be counted and categorized like nominal data. Gold, silver, and bronze medals. Includes ranking.
Ordinal data
The primary statistic used with quantitative data. Calculated by counting the number of items from a group, then dividing that number by the total.
Proportion
Categorical data. All you can do with these data is count and group, and some cases, you can rank.
Qualitative data
More complex than qualitative data. Can be further defined in 2 ways: interval and ratio. Can have mean, median, STD dev.
Quantitative data
The most sophisticated type of data on the scale of nominal, ordinal, interval, and ratio. Can be counted, grouped, and the differences between each data point are meaningful like interval data.
Ratio data
A special case of the normal distribution used for standardizing data.
Standard normal distribution
The method used for comparing two datasets that follow the normal distribution. By using a formula, every normal distribution can be transformed into the standard normal distribution.
Standardization