Lectures (ALLA föreläsningar) Flashcards
Skrivit frågorna själv.
Explain the 4 v’s characteristics of big data:
- Volume: the size of the data.
- Velocity: the speed of which the data is being created. High velocity can result in a high volume of data.
-
Variety: the type of data. How do we combine data from different sources to process it? For example, say we have two different sources
of data that are collecting weather data like temperature from around the world,
some of these work with celsius and some work with fahrenheit -
Veracity: the accuracy. How do we deal with accuracy when we’re combining data? If we collect data regarding temperature, there might be
one or two sensors that are wrong and give us the wrong numbers
What does Data science typically involve
- Exploration:
Identifying patterns in information. For example collecting data regarding the prices in a supermarket. If you collect data for a year you might have enough data to prove that the prices change each day. You can use it as evidence. Uses visualisations. - Inference:
Quantifying whether the patterns that have been identified are reliable. Uses randomization, which considers what would have happened under all possible random assignments, not just the one that happened to be selected for the experiment. This reduces bias in sample data. - Prediction:
Making informed guesses. Reliably making an educated guess. Predict with confidence. Uses the data that we gathered before to make informed guesses. This can be done by using several different techniques. Uses machine learning.
What is Causality?
Why one thing impacts another thing. Cause and effect. Measured by thinking of data in terms of experiment.
What does Association mean?
- Association:
Identifying and observing an effect. Example: Is there any relation between chocolate consumption and heart disease?
What type of Groups have been dissscused in the course when it comes to making comparisons?
When making comparisons you have a treatment/target group and a control/reference group (those who don’t receive the treatment).
Explain User DNA
User DNA is created when you link together click data, social media (like LinkedIn for jobs), advertisements, online shopping, google searches. So now the data that is collected from for example Google is combined with data from other services, so we can do even more with the data.
This is a specific characteristic of big data and enables us to understand the users even better.
Once you’ve collected the data and want to combine it, what problems might face you?
The V’s come in here, we have a dataset that is combined with other data which is a volume problem, there might be data coming from somewhere else, which is a velocity problem.
Can we trust the accuracy of the data (veracity)? We might have to do some data quality testing.
How do we load data into a dataframe?
To load the data into a DataFrame-object we use pandas and store it in a variable. When we inspect the variable we get the full table with all the data. To inspect the column names, we use the columns attribute.
What do we use the functions head(), shape() and describe() for ?
- The head() function can be used to view the first rows of a dataset.
- The shape() attribute gives us the length and width of the dataset. So we can say dataset.shape
- The describe() function gives us summary statistics of the dataset.
To get general information about the data set, such as how many values are not empty what fuction can we use?
To get general information about the data set, such as how many values are not empty, use the info() function. With dataset.info() we get info about datatype such as object, int, float…
What is Big data?
- Big data:
Has to do with computing hardware, data storage and data collection. When we handle a very large amount of data, we refer to it as big data. This enables us to look beyond the data in our own business and try to combine it with other types of data. Big data had a massive impact on the whole industry and a lot of companies and applications solely work with it.
How is big data any different from regular data?
The data in Big data isn’t just data that comes in nice tables, it’s a mix of different kinds of structures of data. Big data is a mix of structured, semi-structured and unstructured data:
- Structured data include tables with columns that have meaningful rows of records.
- Semi-structured data can be data from web sources, where there’s some structure and there’s a way of extracting the data if you want to.
- Unstructured data (misleading since there’s always some structure to data) are things like images, videos and audio. So big data is about how we can try to solve business problems by combining these di erent types of data.
Big data is usually characterised by the 4 V’s. Name and disscribe them.
- Volume:
is the size of data that we’re trying to handle, the amount of data is unprecedented (unheard of). The amount is always changing, so it’s diffcult to say how big big is.. - Velocity:
is the speed of which the data is being created. It means that we characterise the rate at which new data is being created by a computer system. Think of how Amazon logs every single click a user makes on the website, if you scale that up to a million users that is a lot of data. This can vary and result in a high volume of data. - Variety:
has to do with different types of data, there’s a variety of data formats, sources and systems. This brings a whole lot of problems, how do we combine data from di erent sources to process it? For example, say we have two di erent sources of data that are collecting weather data like temperature from around the world, some of these work with celsius and some work with fahrenheit. So, there are di erent kinds of scales, how do we deal with these when they are reporting completely di erent numbers? - Veracity:
is about accuracy. If we collect data regarding temperature, there might be one or two sensors that are wrong and give us the wrong numbers. So how do we deal with accuracy when we’re combining the data altogether? Maybe it doesn’t matter if only two are reporting wrong numbers?
What is CRISP-DM (CRoss-Industry Standard Process for Data Mining)?
An open standard which can be freely used. Modelled an ongoing, iterative cycle as folows:
1. Business Understanding
- Fastställa affärsmål (bakgrund, mål, kriterier).
- Utvärdera situtationen (resurser, krav, risker).
- Faställa mål för datamining.
- Utforma plan (bedömning av verktyg och metoder).
2. Data Understanding
- Sammla in initial data (provdata, integreation).
- Beskriv data (typer, mängder, egenskaper).
- Utforska data (inledande analys, statistik, visualisering).
- Verifiera datakvalitet (avvikande värden, skadad data, saknad data).
3. Data Preperation
- Välja data (vilken data och varför).
- Rensa data (hantera saknad data).
- Integrera data (sammanfoga från olika källor).
- Formatera data (enligt krav).
4. Modeling
- Välj modelleringstekniker (beroende på mål).
- Beslutsträdmodellering (classification, k-nearest neighbor for clustering).
- Generera testdesign (hur man testar resultatet).
- Bygg modell.
5. Evalution
- Utvärdera resultat (utifrpn framgångskriterier).
- Granska processen (Har något missats/ misslyckas/ problem?).
- Fastställa nästa steg.
6. Deployment
- Planera implementering (strategi).
- Planera övervakning och underhåll (ex. ändrade krav).
- Skapa slutrapport (dokumentation).
- Utvärdera projektet (vad gick bra/dåligt?).
What is information and what is data?
-
Data:
Raw facts, subjective facts, no meaning, no context, we haven’t interpret it. -
Information:
Think of data that’s been transformed into something more useful. Maybe we’ve asked questions, what does the data concern, what things have been measured, where was it collected from? Adding more contextual data around the raw data.
Describe what data is?
Data refers to raw facts, information, or observations that are typically collected, stored, and analyzed for a specific purpose. It can take various forms, including numbers, text, images, audio, video, and more. Data is the foundation of information, knowledge, and decision-making processes.
Attributes can be categorised based upon the mathematical operations, what are the two categories of mathematical operations?
-
Qualitative
Distinctiveness: =, ≠
Order: <, <=, >, >= -
Quantitative
Addition/subtraction: +, -
Multiplication/division: *, /
Discribe the Nominal (qualitative) method?
Giving something a name (labelling things). Nominal scales, categorical data for grouping data objects.
- For example we might label the colour of someones hair as being black, blonde, gray.. We can say something about the distinctiveness, black hair is not the same as blonde hair. But we can’t say something about the order, black hair is heavier than blonde hair or brown hair is better than black hair. We are just saying that they di er, not how they di er. Distinct, can be counted, like frequency. Binary is a special case of nominal scale data, only two possible categories, e.g. yes or no, true or false.
Discribe the what is Ordinal (qualitative) method?
Ordered data with meaningful ranking, but distances are not necessarily uniform. We can say something about the order but not about HOW different they are. E.g. grades, opinion data.
Distinct and ordered, so order and count-based operations can be used, in addition to those for the nominal scale. Operations like rank order, median, percentiles, rank correlation.
What is data exhaust?
- Data exhaust:
Data exhaust is the trail of activity, or residual data, left behind by some other kinds of business or computing process. Like mobile phone data e.g. calls and locations, financial data e.g. transactions, residuals of Internet users activity, e.g. online searches, server access logs, and administrative data e.g. organisational transactions, record keeping. You could do data mining on how many phone calls someone makes every month to find out your financial status, like if you make many calls then you probably have a better economy?
Whats is Linear regreasion? (give the formula)
- Linear Regression:
LinearRegression is a built in model from the sklearn Python package, which we build a linear regression model with. With linear regression we predict the value of a variable based on the value of another variable. With the first example of the lab, we predict the variable weight based on the variable height. In the lecture we predict someones dept based on their income.
The linear regression equation is of the form: y=mx+b
Where:
- y is the dependent variable (t.ex. examination score).
- x is the independent variable (t.ex. hours studied).
- m is the slope.
- b is the y-intercept.
What is a Panda series?
- Series:
A pandas series is like a column in a table, it is a one-dimensional array holding data of any type.
Vad är syftet med intercept inom data mining?
- Inom data mining används intercept för att fånga avvikande värden. Det fungerar som en central punkt där alla värden klustras. Om ett värde avviker kraftigt från övriga data, kommer det att interceptas och identifieras som avvikande.
What is the classifcation method?
- Classification:
This is a supervised machine learning method where the model tries to predict the correct label of a given input data. The model gets fully trained using the training data, and then it is evaluated on test data before being used to perform prediction on new unseen data. These two datasets (training and test) are kept separate during the training process. The content of the test data set should not be included in the training process.
What is Mean imputation?
- Mean imputation:
When you replace the missing observation with the mean values of the column. Mean value is similar to default value, we take the mean(medelvärde) of the other entries in the column. Imputation is when we replace the missing values.
-Here’s a step-by-step explanation:
- Identify Missing Values: Identify the entries in your dataset that contain missing values.
- Calculate Mean: For each variable or column with missing values, calculate the mean of the available data points in that column.
- Replace Missing Values: Substitute the missing values in each column with the mean value calculated for that column.
when should you use mean or median?
There’s no situation where you should use one or the other.
If you have a few rows of code with big differences, so that the mean och median are very different from each other, this could potentially give us distorted data.
Symmetry of the Data:
- Use the mean when the data is approximately symmetrically distributed and does not have extreme outliers.
- Use the median when the data is skewed or contains outliers. The median is less affected by extreme values.
Terminology, desscribe the terms:
Sample, Feature, Label and Model.
- Sample: some incoming data that will be analysed, for example a JPEG picture.
- Feature: some kind of quantifiable data from the sample, in the JPEG picture example this could be colour, height, width, pixel data, etc.
- Label: some useful information about the sample that we wish to categorise, eg. looking at a picture we can tell if it is of a person, cat, dog, etc.
- Model: the output of some learning algorithm. Machine learning programs start out as uninitialized parametric spaces, they are blank. As the algorithm learns, it adjusts these blank parameters until the model starts giving us the predictions we want (the desired output).