Lecture 2 Flashcards
Data?
A collection of facts
How is data obtained?
As the result of experiences, observations, or experiments
What does data consist of?
Numbers
Words
Images
Data source reliability?
Confidence and belief in this data source
Data content accuracy?
The right data for the job
Data accessibility?
Can we easily get to the data when we need to?
Data security and privacy?
Allow people with authority only
Data richness?
All the required data elements are required
Data consistency?
Accurately collected and combined/merged
Data currency?
Up to date
Data granularity?
The variables be defined at the lowest level of detail for the intended use of the data
Data validity?
Match/mismatch between the actual and expected data values of a given variable
Data relevancy?
The variables in the data set are all relevant to the study being conducted
Structured Data?
Targeted for computers to process
Numeric versus Categorical
Unstructured/Textual Data?
Targeted for humans to process/digest
Semi-Structured Data?
XML
HTML
Log files
Categorical Structured Data?
Nominal
Ordinal
Numerical Structured Data?
Interval
Ratio
Unstructured Data contents?
Textual
Multimedia
XML/JSON
What does data preprocessing include?
Data consolidation
Data cleaning
Data transformation
Data reduction
Variables?
Dimensional Reduction
Variable Selection
Cases/Samples?
Sampling
Balancing
Data consolidation subtasks?
Access and collect the data
Select and filter the data
Integrate and unify the data
Data consolidation popular methods?
SQL queries
Software agents
Web services
Domain expertise
Data cleaning subtasks?
Handle missing values in the data
Identify and reduce noise in the data
Find and eliminate erroneous data
Data cleaning, handling missing data popular methods?
Fill in the missing values with the most appropriate values
Data cleaning, identifying and reducing noise in the data popular methods?
Identify the outliers in data with simple statistical techniques or with cluster analysis
Data cleaning, finding and eliminating erroneous data popular methods?
Identify the erroneous values in data, such as odd values, inconsistent class labels, odd distributions
Data transformation subtasks?
Normalize the data
Discretize or aggregate the data
Construct new attributes
Data transformation, normalizing data popular methods?
Reduce the range of values in each numerically valued variable to a standard range by using a variety of normalization or scaling techniques
Data transformation, discretize or aggregate data popular methods?
Convert the numeric variables into discrete representations using range-or-frequency-based binning techniques
Data transformation, construct new attributes popular methods?
Derive new and more informative variables from the existing ones using a wide range of mathematical functions
Data reduction subtasks?
Reduce number of attributes
Reduce number of records
Balance skewed data
Data reduction, reduction number of attributes popular methods?
Principal component analysis
Independent component analysis
Chi-square testing
Correlation analysis
Data reduction, reduction of number of records popular methods?
Random sampling
Stratified sampling
Expert-knowledge-driven purposeful sampling
Data reduction, balancing skewed data popular methods?
Oversample the less represented or undersample the more represented classes
Statistics?
A collection of mathematical techniques to characterize and interpret data
Descriptive statistics?
Describing the data as it is
Inferential statistics?
Drawing inferences about the population based on sample data
Mean Absolute Deviation?
Average absolute deviation from the mean
Regression?
A part of inferential statistics
The most widely known and used analytics technique in statistics
Used to characterize relationship between explanatory and response variable
What can regression be used for?
Hypothesis testing
Forecasting
Correlation vs Regression?
Correlation is a single statistic or data point, where regression is the entire equation with all of the data points that are represented with a line
How to develop linear regression models?
Scatter plots
Ordinary least squares method
Regression Modelling Assumptions?
Linearity Independence Normality Constant Variance Multicollinearity
What is a report?
Any communication artifact prepared to convey specific information
Functions that report can fulfill?
To ensure proper departmental functioning To provide information To provide the results of an analysis To persuade others to act To create an organizational memory
What is a business report?
A written document that contains information regarding business matters
Purpose of business report?
To improve managerial decisions
Source of business report?
Data from inside and outside the organization
Format of business report?
Text + tables + graphs/charts
Distribution of business report?
In-print
Email
Portal
Steps of business report distribution?
Data acquisition -> Information generation -> Decision making -> Process management
Types of Business Reports?
Metric Management Reports
Dashboard-Type Reports
Balanced Scorecard - Type Reports
Data Visualization?
The use of visual representations to explore, make sense of, and communicate data
Information visualization?
Aggregation, summarization, and contextualization of data
Types of dimension reduction?
Variable Selection
Principle Components
Multi-dimensional scaling