Data science Flashcards by Carsten Thomassen

What is data science?

Data science is a field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. It involves the use of statistical analysis, machine learning, and data visualization techniques to explore and understand data, and to discover patterns and relationships that can be used to make informed decisions.

How well did you know this?

Not at all

Perfectly

What are the 3 main knowledge areas that’s used in data science?

Math and statistics
Computer science
General knowledge of the domain/area you work in.

How well did you know this?

Not at all

Perfectly

What the 5 steps in the data science process?

Data exploration and preparation
Data representation and transformation
Computing with data
Data modeling
Data visualization and presentation

How well did you know this?

Not at all

Perfectly

What does the 1. step of the data science process consist of?

Data exploration and preparation

Many datasets contain anomalies and artifacts. Exploratory data analysis, noise removal, missing value treatment, identifying outliers and correct data inconsistencies etc. all are a part of the process called data preparation and exploration

How well did you know this?

Not at all

Perfectly

What does the 2. step of the data science process consist of?

Data representation and transformation

Data transformation is restructuring the originally given data into a new and more revealing form.

Processes such as data integration, data migration, data warehousing, and data wrangling all may involve data transformation.

An enterprise can choose among a variety of ETL tools that automate the process of data transformation. Data analysts, data engineers, and data scientists also transform data using scripting languages such as Python or domain-specific languages like SQL.

How well did you know this?

Not at all

Perfectly

What does the 3. step of the data science process consist of?

Computing with data

Every data scientist should know and use several languages for data analysis and data processing. These can include popular languages like R and Python, but also specific languages for transforming and manipulating text, and for managing complex computational pipelines.

Cluster and cloud computing and the ability to run massive numbers of jobs on such clusters has become an overwhelmingly powerful ingredient of the modern computational landscape.

Machine learning is a popular way to compute with data.

How well did you know this?

Not at all

Perfectly

What does the 4. step of the data science process consist of?

Data modeling

Data Science involves predictive modeling, in which one constructs methods which predict well over some some given data universe – i.e. some very specific concrete dataset. This roughly coincides with modern Machine Learning, and its industrial offshoots.

How well did you know this?

Not at all

Perfectly

What does the 5. step of the data science process consist of?

Data visualization and presentation

Data visualization at one extreme overlaps with the very simple plots of EDA - histograms, scatterplots, time series plots - but in modern practice it can be taken to much more elaborate extremes.

Data Visualization and presentation involves creating plots, graphs, and dashboards for monitoring data processing pipelines that access streaming or widely distributed data.

How well did you know this?

Not at all

Perfectly

What are the 5 processes in a data science project?

Defining the question/problem you’re solving
Exploratory data analysis
Model building and evaluation phase. Splitting to train and test set.
Review of data and conclusions. Categorize, manipulate, and summarize the information in order to answer critical questions
Communication and telling the story the data tells.

How well did you know this?

Not at all

Perfectly

What is a linear regression?

Linear regression is a statistical method used to model the linear relationship between a dependent variable and one or more independent variables. It is used to predict the value of the dependent variable based on the values of the independent variables.

A scatter plot is a very common way to visualize the relationship between the dependent and independent variables.

y = mx + b

How well did you know this?

Not at all

Perfectly

What is AI (artificial intelligence)?

Artificial intelligence (AI) refers to the ability of a machine or computer system to perform tasks that would normally require human intelligence, such as understanding language, recognizing patterns, learning from experience, and making decisions.

How well did you know this?

Not at all

Perfectly

What is unsupervised learning in machine learning?

Unsupervised learning is a type of machine learning where the model is not given any labeled training data and is instead asked to discover patterns and relationships in the data on its own. It is called unsupervised learning because the model is not given any guidance or supervision on what to look for in the data.

Unsupervised learning is often used to explore and understand complex datasets, and to identify hidden patterns and relationships in the data. It can be used to discover clusters or groups of similar data points, or to detect anomalies or outliers in the data.

Some common techniques used in unsupervised learning include clustering, dimensionality reduction, and anomaly detection

How well did you know this?

Not at all

Perfectly

What is supervised learning in machine learning?

Supervised learning is a type of machine learning where the model is given labeled training data and is asked to make predictions or decisions based on that data. The training data consists of a set of input examples and the corresponding correct output labels, and the goal of supervised learning is to learn a function that can map the input examples to the correct output labels.

For example, in a supervised learning task to classify emails as spam or not spam, the input examples might be the content of the emails, and the output labels would be “spam” or “not spam.” The model would be trained on a large dataset of labeled emails, and then would be able to predict whether a new, unseen email is spam or not based on its content.

Supervised learning is one of the most widely used types of machine learning, and it has a wide range of applications, including image and speech recognition, natural language processing, and predictive modeling

regressions and classifications are common supervised learning models.

How well did you know this?

Not at all

Perfectly

What is the main differences between supervised and unsupervised learning?

In supervised learning, the model is given labeled training data, which consists of a set of input examples and the corresponding correct output labels. The goal of supervised learning is to learn a function that can map the input examples to the correct output labels.

In unsupervised learning, on the other hand, the model is not given any labeled training data. Instead, it is asked to discover patterns and relationships in the data on its own.

In supervised learning, the model is trained to predict a specific output label based on the input data. For example, a supervised learning model might be trained to predict whether an email is spam or not spam based on its content.

In unsupervised learning, the model is not given any specific output labels to predict. Instead, it is asked to discover patterns and relationships in the data and to group similar data points together. For example, an unsupervised learning model might be used to cluster a dataset of customer data into different groups based on shared characteristics.

How well did you know this?

Not at all

Perfectly

What is data wrangling?

Data wrangling is the process of cleaning, transforming, and organizing data for analysis. It is a crucial step in the data science workflow, as it involves preparing the raw data for analysis and modeling, and ensuring that it is in a suitable format and structure for these tasks.

How well did you know this?

Not at all

Perfectly

What is a web service?

Study These Flashcards

A web service is a software system designed to support interoperable machine-to-machine interaction over a network. It is a way for different software systems to communicate with each other and exchange data and functionality in a standardized and decentralized way.

Web services use a set of open standards, such as HTTP, XML, and SOAP, to define how they communicate and exchange data. This allows them to be accessed and used by a wide range of clients, including web browsers, mobile apps, and other software systems, regardless of their programming language or platform.

REST is one of the more known types of webservices.

A few examples of how web services might be used:

A weather forecasting website might expose a web service that allows other websites and apps to retrieve current and forecasted weather data for a given location.

A social media platform might expose a web service that allows third-party apps to post updates and retrieve data from user profiles and feeds.

A financial institution might expose a web service that allows other systems to access account information, make transactions, and retrieve statements.

What is a decision, from a decision making standpoint?

Study These Flashcards

A decision is a solution to a problem.

What is path dependency in decision making?

Study These Flashcards

Path dependency is you are limited to a specific set of options because of a decision that was made in the past. For instance a business might be forced on a specific set of options because of long-term decision they made some time ago.

What is linear programming?

Study These Flashcards

Linear programming is a mathematical method used to find the maximum or minimum value of a linear objective function, subject to a set of linear constraints. It is a tool that can be used to optimize a wide range of problems, including problems involving resource allocation, production planning, and financial planning.

What is a objective function in linear programming?

Study These Flashcards

In linear programming, the objective function is a mathematical expression that represents the quantity or quality that we are trying to optimize. It is usually a linear combination of decision variables, which are variables that represent the choices that we can make in order to achieve the optimal solution.

For example, consider a production planning problem in which we are trying to maximize the profit from producing a certain product. In this case, the objective function might be expressed as:

Objective function = 5x + 3y

Where x and y represent the number of units of two different raw materials that are used to produce the product. The coefficients 5 and 3 represent the profit per unit of each raw material.

In general, the objective function in linear programming can take many different forms, depending on the specific problem that we are trying to solve. It might represent the profit or revenue from a business, the cost or expenditure of a project, the efficiency or effectiveness of a process, or any other quantity that we are trying to optimize.

What is the linear constrains in linear programming?

Study These Flashcards

In linear programming, the constraints are mathematical expressions that describe the limits on the values that the decision variables can take. They are used to ensure that the solution we find is feasible and meets all of the requirements of the problem.

For example, consider a production planning problem in which we are trying to maximize the profit from producing a certain product. In this case, the constraints might include:

x + y <= 100: This constraint represents the maximum amount of raw material that is available for production. It states that the total amount of raw material used (x + y) must be less than or equal to 100 units.

They might represent the limits on the availability of resources, the capacity of equipment, the requirements of customers or regulators, or any other constraints that must be satisfied in order to find a feasible solution.

What the decision variables in linear programming?

Study These Flashcards

In linear programming, the decision variables are variables that represent the choices that we can make in order to achieve the optimal solution. They are the quantities that we can adjust in order to maximize or minimize the objective function, subject to the constraints of the problem

In other words the x and y in the objectives function.

What are the 3 amounts of decisions makers?

Study These Flashcards

One (A manager for instance)
Two (For instance a business and a customer. Also called a negotiated decision). Usually the two parties have different objectives and you need to find a compromise.
Many (A group decision)

What is deep learning?

Study These Flashcards

Deep learning is a way for computers to learn and make decisions on their own, without being explicitly programmed to do so. It is a type of machine learning that uses artificial neural networks,

In deep learning, a computer is presented with a large dataset and uses that data to train an artificial neural network. The neural network is made up of layers of interconnected “neurons,” which process the data and pass it along to the next layer. As the data is passed through these layers, the neural network learns and recognizes patterns in the data.

Deep learning algorithms have been successful at a wide range of tasks, including image and speech recognition, language translation, and even playing games like chess and Go. They have been able to outperform other machine learning algorithms on many tasks, thanks to their ability to learn and represent complex patterns in data.

They are particularly well-suited for tasks that require the processing of large amounts of unstructured data, such as images, video, and audio.

Apple’s Siri is an example of Deep learning.

What are the 3 layers in a neural networks?

Input Layer: This layer is responsible for accepting the inputs Hidden Layer: This layer processes the input data to find out hidden information and performs feature extraction Output Layer: This layer gives the desired output

Data science Flashcards

(25 cards)