L1: Course Introduction Flashcards
What are the three Vs that distinguish big data from just data?
Volume
Variety
Volecity
One of the three Vs that distinguish big data from just data is Volume - what does this mean
Big data is defined by its volume. Owing to the digitalisation of life as we know it, there are now immense amounts of data being captured
One of the three Vs that distinguish big data from just data is Variety - what does this mean
Big data comes in a variety of forms captured by a multitude of different sensors and stored in a variety of formats. Big data goes beyond numbers, and also comprises images, videos and more
One of the three Vs that distinguish big data from just data is Veloicty- what does this mean
Velocity refers to the speed of which data is generated and transmitted – i.e., big data is not just big sets of various types of data, but big data is often in motion; constantly changing
Some critics argue that two additional Vs characterise big data. Which?
A) Veracity: difficulty of assessing the data’s reliability, completeness, or trustworthiness
B) Visualisability: ability to visualise meaningful infomration
D) Value: data’s ability to fuel business applications
A) Veracity: difficulty of assessing the data’s reliability, completeness, or trustworthiness
D) Value: data’s ability to fuel business applications
Despite the fact that BDA is increasingly adopted i organisations, some limitations hinder this adoption. Provide some examples hereto
Budget: expensive to implement
Data security concerns: how to store it responsibly
Integration challenges: shortage of technical expertise
Generally, when working with and analysing big data, at following methods can be considered. Which? (Select all correct)
A) Chunk and pull
B) Split and search
C) Push compute to data
D) Sample and model
A) Chunk and pull
C) Push compute to data
D) Sample and model
One of the methods that can be applied when working with big data is CHUNK AND PULL. Which statements are NOT true?
A) the method suggests to split up the dataset into smaller chunks, allowing a local device to handle them
B) Chunks are typically logical and structured rather than based on randomized separation
C) After data split, each chunk can be pulled individually to conduct analysis
D) When all chunks are analyzed, the results are aggregated to get conclusion
E) Poorly suited for parallelization
F) Not all data is appropriate for chunking logically, posing a limitation to the method
FALSE: E) Poorly suited for parallelization
Chunk and pull are well-suited for parallelization and ultimately allows you to analyze large sets of data with lower computational power
One of the methods that can be applied when working with big data is PUSH COMPUTE TO DATA. Which statements are NOT true?
A) Entails compressing the big dataset in database where it is stored
B) Once compression is complete, data can be pulled into a local device to analyze the compressed dataset
C) Disadvantage is that it relies on database speed and functionalities
D) the advantage is that the entire dataset is used at once and that it can be faster than CHUNK AND PULL
All are correct
One of the methods that can be applied when working with big data is SAMPLE AND MODEL. Which statements are NOT true?
A) Entails taking a sample from a big dataset (a volume that can be handled by a local device)
B) Essentially we downsample the dataset to a more convenient size
C) Advantages: data can be modelled by standard software packages: allows or rapid prototyping with different techniques
D) Disadvantages: must ensure that sample is valid and representative; potential scalability issues
E) Not the focus of the course
WRONG: E) Not the focus of the course
SAMPLE AND MODEL is the method used in the course
You want to estimate the price of which a given property will sell for under given conditions.
Do you need a predictive or explanatory model for this purpose?
Predictive - you want to know if, when, where, and how much of something will happen.
For this purpose, we are only interested in the predictive accuracy of the model - not causal effects
You want to investigate why properties sell for more or less; e.g., the causal effect of the size of the property, the location, material design etc. on the price
Do you need a predictive or explanatory model for this purpose?
An explanatory model identifies cause-and-effect relationships: If you want to know WHY or HOW something will happen, you are interested in explanation
Predictive models explain correlations between the independent variables and the dependent variable
TRUE/FALSE
TRUE
In predictive models, you will get correlation coefficients estimating the association between the independent and dependent variables
Explanatory models explain causations between the independent variables and the dependent variable
TRUE/FALSE
TRUE
What is true about overfitting? (Select all correct)
A) One of the most common pitfalls in BDA and ML
B) Means that model has been fitted to tightly to the training dataset and fails to be generalisable
C) It can be mitigated if dataset is sufficiently large and by using cross-validation
D) It is particularly a big problem i very large datasets
A) One of the most common pitfalls in BDA and ML
B) Means that model has been fitted to tightly to the training dataset and fails to be generalisable
C) It can be mitigated if dataset is sufficiently large and by using cross-validation
WRONG: D) very large datasets allow for cross-validation with a lower number of folds (faster computation) since each fold incl. an ample amount of observations