tenta Flashcards
what is Data science
Data science is an interdisciplinary field with focus on extracting knowledge and insights from data. Includes: Computation, statistics, understanding of domain. (Domain knowledge is the understanding of a specific industry, discipline or activity.) and scientific methods.
Findings within data science is driven from different domain areas often to drive business decisions.
What is Exploration, Inference and Prediction in Data science?
Exploration
Initial phase where focus is gaining a preliminary understanding of the dataset.
Identifying patterns in information
Uses Visualizations.
Inference
Inference is data analysis and statistics to make conclusions based on observed data. **Quantifying where those patterns are reliable. **
Uses randomization
Prediction
Making informed guesses
Uses machine learning.
Relations: what is Association?
“Any relation” == association
If phenomenon x has any relation to y there is an association.
Relations: what is Causality?
“lead to” == Causality
If phenomenon x leads to y there is casual relationship. While correlation indicates that there is a statistical relationship, it doesn’t necessarily imply causation. Causation implies a direct influence of one variable on another. Data science is very much about looking for cause and effect.
What are the 4 V’s of big data
Usually characterized by the 4 V’s:’
Volume
Size, amount of data. Exceeds the processing capacity of conventional databases. The ability to handle and process large volumes of data efficiently is a fundamental aspect of big data analytics.
Velocity
Represents the speed at which data is generated, collected and processed.
Variety
Refers to the diversity of data types and sources. Is the data structured? unstructured? semi? Design of solutions made for handling this diversity.
Veracity
Deals with the quality and reliability of the data.
what is data mining?
“Data mining is the process of sorting through large data sets to identify patterns and relationships that can help solve business problems through data analysis. Data mining techniques and tools enable enterprises to predict future trends and make more-informed business decisions.”
KDD refers to extraction of implicit previously unknown and potentially useful information from data. What are the 5 steps of KDD?
- Data selection
Selecting appropriate data from various sources. - Data pre-processing
Cleaning, removing errors, removing irrelevant data. - Transformation
Transformation, Transforming the data into a format usable by the data mining method. Like normalization. normalization is a technique used in DM to transform the values of a dataset into a common scale. If a dataset has multiple attributes but the attributes have values on different scales this may lead to poor data models while preforming data mining operations.
4.Data mining
The Actual application of the appropriate DM methods.
5.Interpretation and analysis.
what does it mean if a task is predictive?
Predictive
Is to predict a value of an attribute (target valuable) based upon the values of other attributes. Methods of predictive tasks usually fall under the category of Supervised learning.
what does it mean if a task is Descriptive?
Descriptive**
Detects patterns that summarize (describe) the underlying relationship in the data**. Methods of Descriptive tasks usually fall under the category of Unsupervised learning.
what is unsupervised learning?
Unsupervised learning. Explores patterns and relationships with unlabeled data. While primarily descriptive (uncovers hidden structures) its finding can indirectly be used in predictive tasks. No explicit feedback on the correctness of predictions. Objective: The primary goal is to explore the inherent structure of the data. Unsupervised learning seeks to find patterns, groupings, or relationships within the data without relying on predefined output labels. Use cases for this could be for example clustering or dimensionality reduction.
what is supervised learning?
Supervised learning. In supervised learning, the algorithm is trained on a labeled dataset, where each input is paired with the corresponding correct output or label. **The algorithm learns the mapping between inputs and outputs. **The algorithm receives feedback during training. Objective: The primary goal is to learn a mapping or relationship between inputs and outputs based on the labeled training data. The learned model can then make predictions on new, unseen data.Use Cases here could be classification and regression (predicting a continuous output)
What is CRISP-DM
CRoss-Industry Standard Process for Data Mining: Open standard and can be used freely. Intended as a model for best practice. Modeled as an ongoing, iterative cycle.
The Model outlines the stages involved in a typical data mining project. CRISP-DM provides a structured framework for guiding organizations and data scientists through the data mining process. The process is iterative.
what are the 6 steps of CRISP-DM
- Business Understanding:
Understand the business objectives and goals of the DM project. Defining the problem, understanding the requirements and scope. Produce plan. - Data Understanding
Understanding of the available data. Exploring the nature of the data, relationships and potential issues. This includes initial data preprocessing. Verify data quality - Data Preparation
The data preparation involves transforming the data to make it suitable to analysis. Merge data from different sources. Deal with missing data. - Modeling
Various DM techniques are applied to build and train models based on the dataset. - Evaluation
Assessing the performance of the models based on the business objectives. This point is closely tied to what we learned in 1. Business Understanding. - Deployment
The successful models are implemented into the operational environment. Plan monitoring and maintenance. Documentation.
what is the DIKW pyramid?
“Refers loosely to a class of models for representing purported structural and/or functional relationships between data, information, knowledge, and wisdom. “Typically information is defined in terms of data, knowledge in terms of information, and wisdom in terms of knowledge”.”- wiki
what is Ordinal (qualitative) data?
Ordinal attributes, on the other hand, represent categories with a clear order or ranking, but the intervals between the categories may not be uniform or meaningful.
what is Nominal (qualitative) data?
Nominal (qualitative)
Could be for example labels, names even when denoted by numerical values. Operations based on arithmetic are not applicable here. Can be binary aswell, TRUE / FALSE. When transforming the data labels can be freely changed ex green = 1, Blue = 2.
Datasets. Terminologi:
Dimensionality
The number of attributes, Dimension reduction may occur in some processing.
Datasets. Terminologi:
Sparsity
Sparsity:
Asymmetric features -> Refers to the proportion of zero or empty values in a dataset. The more values are missing sparsity increases.
Datasets. Terminologi:
Resolution
Resolution:
Level of detail -> Ex higher resolution could here mean more decimal places that enable more precise calculations.
Datasets. Terminologi:
Record Data
Record Data:
When each record represents a distinct unit of information.
Datasets. Terminologi:
Web scraping
Web scraping: Web scraping allows us to programmatically extract data from public Web pages, provides semi-structured data usually. “Social listening” can potentially give early-warnings to events that are not yet reported in the media by scraping information of the internet that users generate.
Datasets. Terminologi:
Data exhaust:
Data exhaust: Data exhaust is the trail of activity, or residual data, left behind by some
other kinds of business or computing process. ex: Transactions, calls, locations.
Terminologi
Document term matrix:
A matrix (as above) that represents the frequency of terms / words in a collection of documents.
what is
Machine learning:
Machine learning is about methods that can be used to improve the performance of an intelligent agent over time, based on stimuli (data) from the environment.
what is the difference between supervised and unsupervised learning?
In supervised learning, the algorithm is trained on a labeled dataset, where each input has a corresponding output. The goal is to learn a mapping from inputs to outputs. In unsupervised learning, the algorithm is given unlabeled data and must find patterns or relationships within the data without explicit guidance on the output.
what is Linear Regression (supervised learning):
Fitting a line to describe the relationship between variables. “The goal of linear regression is to find the best-fitting linear relationship that can be used for making predictions.”
Main idea: If classes of points can be separated by a line, you can use a linear model to classify data points.
Is best suited for problems where the goal is to predict a continuous value.