Week 1: Big Data Introduction Flashcards
Where does machine generated data come from? (3 types)
- Internet of things
- smart devices
- sensing
Where does people generated data come from ? (4 examples) (BOSE)
- Blogging
- Online photo/video sharing
- Social networks
- Emails
Describe big data generated by people (4 formats)
- Typically unstructured and text heavy
- Multiple data formats:
- Web pages
- images
- XML
Describe big data generated by organisations (3 points)
- Typically structured
- stored in RDBMS
- Often siloed by department
- Limited infrastructure to share and integrate this data
Define: big data
- Big data is high volume, high velocity and/or high variety information assets
- require new forms of processing
- to enable enhanced decision making
- insight discovery
- process optimisation.
Characteristics of big data volume
- Refers to the vast amount of data that is generated every second, minutes, hours and days
- the dimension of big data and its exponential growth
Characteristics of big data velocity
- Refers to the speed at which data is being generated
- increasing speed at which the data needs to be stored and analysed
- Processing of data in real-time to match its production rate
Characteristics of big data variety
- Refers to the ever increasing different forms that big data can come in such as
- text
- images
- voice
- spatial data
Characteristics of big data veracity
Limited value if data is not accurate
Refers to the quality of the data, free from
- biases
- noise
- abnormalities
Data must be:
- accuracate
- trustworthy
- reliable
- and have context within analysis
5 steps in the data science process
- Acquire
- Prepare
- Analyse
- Report
- Act
What 3 ways can we acquiring data
- Finding the right data sources
- Conventional data may exist in RDBMS
- NoSQL storage system can be used to manage variety of data types in big data
Prepare data
- Preliminary investigations
- Correlations,
- general trends,
- outliers
- Summary statistics
- Mean
- Median
- Range
- std dev
- Visualisation
- Histogram
- scatter plots
Why do we pre-process data
- to clean the data to address data quality issues
Why is domain knowledge important
To make informed decisions on how to handle incomplete or incorrect data
Methods of pre processing data
- Scaling
- Transformation
- Feature selection
- Dimensional reduction
- Data manipulation
- One hot encoding (categorical data)
- Normalise