09 - Data Analysis Flashcards

1
Q

What is data analysis?

A

Data Analysis is a process of collecting, transforming, cleaning, and modeling data with the goal of discovering the required information.

The idea is to do a separation of a whole into its component parts. For example in criminal data analysis:

  • Build criminal pattern to detect and prevent crimes
  • Breakdown the pattern of crimes to evaluate when, where, and why they are occurring
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are data limitation issues?

A
  • Missing data
    • Missing altogether
    • Specific items
  • Altered data
    • In the rest, in-transit, in the possession
    • Intentional vs. unintentional
  • Different definitions of the same data (Possibility of misunderstanding)
  • Non-existent data (Data of interest do not exist or is not in a usable form)
  • Data is available in different forms
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the four steps of data analysis?

A
  1. Planing
    • Identification of data, data availability & quality
  2. Data Collection (Inventory / Pre-Processing)
    • Improving data quality
    • Information gathering
  3. Data Preparation
    • Structure data for analysis tool
    • Data cleaning
    • Depends on type of data
  4. Data Analysis
    • Visualization
    • Statistics
    • Relationship linking
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the major tasks of data pre-processing?

A
  1. Data summarisation
    • Understand the distribution of data
  2. Data cleaning
    • Fill in missing values, remove outliers / noisy data
  3. Data integration
    • Integration multiple data sources
  4. Data transformation
    • Aggregation of data
  5. Data reduction
    • Take whats relevant
  6. Data discretisation
    • Reduction with focus on numbers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the four types of central tendency?

A
  • Mean (sensitive to extreme value)
  • Median (remove outliers)
  • Midrange (min + max / 2)
  • Mode (Value that occures most frequently)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the five number summary?

A

The five-number summary is a set of descriptive statistics that provide information about a dataset. It consists of the five most important sample percentiles:

min, max, med, Q1 (first quartile), Q3 (third quartile)

The five-number summary can be used to examine the credit card summary and its distribution and detect outliers as fraud values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a data warehouse?

A

A data warehouse, also known as an enterprise data warehouse, is a system used for reporting and data analysis, and is considered a core component of business intelligence. DWs are central repositories of integrated data from one or more disparate sources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Behaviour profiling?

A

Finding of suspicious behaviour by analyzing databases. Capability to recognize patterns of criminal activity. Predict when and where crimes are likely to take place.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is data mining and what is the ultimate goal?

A

Data mining (knowledge discovery in databases). Extraction of interesting (hidden) information or patterns from data in large databases.

The ultimate goal of data mining is the prediction of human behaviour.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are potential applications for data mining?

A
  • Database analysis and decision support.
  • Market analysis and management
  • Risk analysis and management
  • Fraud detection and management
  • Text mining (news group, email, documents)
  • Web analysis
  • Intelligent query answering
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How can data mining help to detect fraud?

A

Data mining can help to detect fraud by using the historical data to build models of fraudulent behavior and then to use data mining techniques to identify similar instances.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Name a few different data mining techniques?

A
  • Link analysis
  • Intelligent agents
  • Text mining
  • Neural Networks
  • Machine-learning algorithms
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is clustering?

A

Clustering is a technique to group a set of data objects into clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the three main components of link analysis?

A
  • Entities
  • Events
  • Associations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Name a few data mining tools?

A
  • Open-Source
    • Weka
    • Orange
    • Rapid Miner
    • NLTK
  • Commercial
  • SAS
  • IBM SPSS
  • Web site
  • kdnuggets.com
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the data mining process?

A
  1. Creating a target data set
  2. Data Cleaning & Pre-Processing
  3. Data reducation & transformation
  4. Data Mining
  5. Knowledge Presentation