Data science Flashcards
What is data science?
Data science is a field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. It involves the use of statistical analysis, machine learning, and data visualization techniques to explore and understand data, and to discover patterns and relationships that can be used to make informed decisions.
What are the 3 main knowledge areas that’s used in data science?
- Math and statistics
- Computer science
- General knowledge of the domain/area you work in.
What the 5 steps in the data science process?
- Data exploration and preparation
- Data representation and transformation
- Computing with data
- Data modeling
- Data visualization and presentation
What does the 1. step of the data science process consist of?
Data exploration and preparation
Many datasets contain anomalies and artifacts. Exploratory data analysis, noise removal, missing value treatment, identifying outliers and correct data inconsistencies etc. all are a part of the process called data preparation and exploration
What does the 2. step of the data science process consist of?
Data representation and transformation
Data transformation is restructuring the originally given data into a new and more revealing form.
Processes such as data integration, data migration, data warehousing, and data wrangling all may involve data transformation.
An enterprise can choose among a variety of ETL tools that automate the process of data transformation. Data analysts, data engineers, and data scientists also transform data using scripting languages such as Python or domain-specific languages like SQL.
What does the 3. step of the data science process consist of?
Computing with data
Every data scientist should know and use several languages for data analysis and data processing. These can include popular languages like R and Python, but also specific languages for transforming and manipulating text, and for managing complex computational pipelines.
Cluster and cloud computing and the ability to run massive numbers of jobs on such clusters has become an overwhelmingly powerful ingredient of the modern computational landscape.
Machine learning is a popular way to compute with data.
What does the 4. step of the data science process consist of?
Data modeling
Data Science involves predictive modeling, in which one constructs methods which predict well over some some given data universe – i.e. some very specific concrete dataset. This roughly coincides with modern Machine Learning, and its industrial offshoots.
What does the 5. step of the data science process consist of?
Data visualization and presentation
Data visualization at one extreme overlaps with the very simple plots of EDA - histograms, scatterplots, time series plots - but in modern practice it can be taken to much more elaborate extremes.
Data Visualization and presentation involves creating plots, graphs, and dashboards for monitoring data processing pipelines that access streaming or widely distributed data.
What are the 5 processes in a data science project?
- Defining the question/problem you’re solving
- Exploratory data analysis
- Model building and evaluation phase. Splitting to train and test set.
- Review of data and conclusions. Categorize, manipulate, and summarize the information in order to answer critical questions
- Communication and telling the story the data tells.
What is a linear regression?
Linear regression is a statistical method used to model the linear relationship between a dependent variable and one or more independent variables. It is used to predict the value of the dependent variable based on the values of the independent variables.
A scatter plot is a very common way to visualize the relationship between the dependent and independent variables.
y = mx + b
What is AI (artificial intelligence)?
Artificial intelligence (AI) refers to the ability of a machine or computer system to perform tasks that would normally require human intelligence, such as understanding language, recognizing patterns, learning from experience, and making decisions.
What is unsupervised learning in machine learning?
Unsupervised learning is a type of machine learning where the model is not given any labeled training data and is instead asked to discover patterns and relationships in the data on its own. It is called unsupervised learning because the model is not given any guidance or supervision on what to look for in the data.
Unsupervised learning is often used to explore and understand complex datasets, and to identify hidden patterns and relationships in the data. It can be used to discover clusters or groups of similar data points, or to detect anomalies or outliers in the data.
Some common techniques used in unsupervised learning include clustering, dimensionality reduction, and anomaly detection
What is supervised learning in machine learning?
Supervised learning is a type of machine learning where the model is given labeled training data and is asked to make predictions or decisions based on that data. The training data consists of a set of input examples and the corresponding correct output labels, and the goal of supervised learning is to learn a function that can map the input examples to the correct output labels.
For example, in a supervised learning task to classify emails as spam or not spam, the input examples might be the content of the emails, and the output labels would be “spam” or “not spam.” The model would be trained on a large dataset of labeled emails, and then would be able to predict whether a new, unseen email is spam or not based on its content.
Supervised learning is one of the most widely used types of machine learning, and it has a wide range of applications, including image and speech recognition, natural language processing, and predictive modeling
regressions and classifications are common supervised learning models.
What is the main differences between supervised and unsupervised learning?
In supervised learning, the model is given labeled training data, which consists of a set of input examples and the corresponding correct output labels. The goal of supervised learning is to learn a function that can map the input examples to the correct output labels.
In unsupervised learning, on the other hand, the model is not given any labeled training data. Instead, it is asked to discover patterns and relationships in the data on its own.
In supervised learning, the model is trained to predict a specific output label based on the input data. For example, a supervised learning model might be trained to predict whether an email is spam or not spam based on its content.
In unsupervised learning, the model is not given any specific output labels to predict. Instead, it is asked to discover patterns and relationships in the data and to group similar data points together. For example, an unsupervised learning model might be used to cluster a dataset of customer data into different groups based on shared characteristics.
What is data wrangling?
Data wrangling is the process of cleaning, transforming, and organizing data for analysis. It is a crucial step in the data science workflow, as it involves preparing the raw data for analysis and modeling, and ensuring that it is in a suitable format and structure for these tasks.