Week 1 Flashcards
L 1.1.x
What are some alternative names for the process of data mining?
What are some alternative names for the process of data mining?
- Knowledge Discovery in Databases
- Knowledge Extraction
- Data/Pattern Analysis
- Data Archeology
- Data Dredging
- Information Harvesting
Why do we need data mining?
Why do we need data mining?
To handle the explosive growth of data. By 2025 ~100 zettabytes of data will be generated worldwide.
1 zettabyte is 1 billion terabytes.
What is one of the purposes of data mining?
What is one of the purposes of data mining?
To extract knowledge from the data.
What are Jim Gray’s four paradigms of Science?
What are Jim Gray’s four paradigms of Science?
- Empirical/Experimental Science <1600’s
- Theoretical Science (1600 - 1950’s)
- Computational Science (1950’s - 1990’s)
- Data-Intensive Science “eScience” >2000’s
What is another name for data mining?
What is another name for data mining?
Knowledge Discovery from Data
Sunita Sarawagi’s definition of data mining?
Sunita Sarawagi’s definition of data mining?
The process of semi-automatically analyzing large databases to find patterns that are:
• Valid: hold on new data with some certainty
• Novel: non-obvious to the system
• Useful: should be possible to act on the item
• Understandable: humans should be able to interpret the pattern.
Jiawei Han’s definition of data mining?
Jiawei Han’s definition of data mining?
The extraction of interesting [non-trivial, implicit, previously unknown, potentially useful] patterns or knowledge from huge amounts of data.
Vipin Kumar’s definition of data mining?
Vipin Kumar’s definition of data mining?
Exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns.
Not everything is data mining, give some examples.
Not everything is data mining, give some examples.
- Looking up a phone number in a phone directory.
- Querying a search engine for data.
- Deductive or Expert Systems.
L 1.2.x
Name concepts related to but different from data mining.
Name concepts related to but different from data mining.
- Machine Learning and Pattern Recognition
- Techniques utilized in data mining processes.
- RDBMS, Data Warehouses
- The systems that support data mining.
- Big Data Analytics, Data Science
- Data mining is a key component of these broad fields.
- Business Intelligence
- The application of data mining in business
What is the database view of data mining?
What is the database view of data mining?
The processes and techniques that connect data warehouses to the discovery of patterns.
- Transactional/Operational databases to
- Data Cleaning to
- Data Warehousing to
- Task-Relevant Data/Data Selection to
- DATA MINING to
- Pattern discovery to
- Evaluation to
- Knowledge
What is the machine learning view?
What is the machine learning view?
- Input Data - Pre-Processing
- Data Integration
- Extraction
- Normalization
- Feature Selection
- Dimension Reduction
- Model Training
- Supervised Learning
- Learning
- Semi-supervised learning
- Reinforcement Learning
- Post-Processing
- Model testing/evaluation
- Model selection
- Model interpretation
- Model visualization
- Classification/Prediction/Ranking
What is the Business Intelligence view?
What is the Business Intelligence view?
Business Intelligence (bottom up) • Lower level • Data sources • Preprocessing/ Integration/Warehousing • Middle level • Data Mining • Higher level • Knowledge evaluation • Presentation • Business Decisions
What is the Human-Centered view (as presented by UM)?
What is the Human-Centered view?
- For the People
- Data mining for social good
- End-user applications
- Of the People
- Bring users into the loop
- Wisdom of the crowd
- By the People
- Information about people
- Generated data
- Process owned by people
• Selection, Detection, Characterization,
Explanation, Prediction, Intervention
What is the first dimension of data mining?
What is the first dimension of data mining?
The Data to be mined (inputs).
- type/representation: vectors/matrices, sequences, time-series, spatiotemporal, data streams, or graphs
- genre/application: transactional data, text and web, multimedia, social and information networks, biological data, or user behaviors
What is the second dimension of data mining?
What is the second dimension of data mining?
Knowledge to be discovered
- Data mining functionalities
- lower-level output, such as patterns of data, the similarity of data, or association of data.
- Decision-driven output, such as classification, clustering, trend/deviation, prediction, and outlier analysis.
• Can be descriptive or predictive
What is the third dimension of data mining?
What is the third dimension of data mining?
Techniques Utilized
- Data cubing, machine learning
- statistics, pattern recognition,
- user modeling, visualization,
- data-intensive computing.
What is the fourth dimension of data mining?
What is the fourth dimension of data mining?
Application of Data Mining
- Retail: advertising, marketing
- Telecommunications: spam call detection
- Banking: loan approvals, credit score estimates
- Social Networks: Facebook, Twitter
- Scientific Discovery: Biological data mining
- Web Search: smart question answering
- Stock Market Analysis: stock picking
- Text Mining: Natural Language Processing
- Clinical: health informatics
L 1.3.x
What is the name of the object that bridges the gap between typical data structures and those required for data mining?
What is the name of the object that bridges the gap between typical data structures and those required for data mining?
Data Representation (DR).
DR is a mathematical way to represent data.
What are the three V’s of big data?
What are the three V’s of big data?
Volume, Variety, and Velocity
What more about Data Formulation?
What more about Data Formulation?
- There are more data science applications that one might expect.
- There are not so many basic data types.
- How do we abstract, formulate, and represent the data in real applications?
What questions does a Data Scientist look to answer when observing data?
What questions does a Data Scientist look to answer when observing data?
- What is the basic object of information?
- What are the properties/attributes of the data object?
- How are the attributes structured?
- How are values assigned to the attributes?
- How are the different data objects related?
A Data Scientist must be able to answer these questions in a mathematical way!
• This is the task and purpose of data representation.
Name several types of data representation.
Name several types of data representation.
- Item Set
- Vector / Matrix
- Sequences
- Time Series
- Spatial
- Spatiotemporal
- Graph / Network
- Stream
Describe an Item Set.
Describe an Item Set.
- Data Object: shopping basket, text, BoD
- : a product, a word (bag of words, word cloud), a person
- X = {x1, x2, x3, …, xk}
- belongs to X if and only if that categorical item appears in the set.
- order or count do not matter
- X = {x1, x2, x3, …, xk}