L1 - Data Characteristics Flashcards
5 V's Types of data
What are the 5 V’s of Data? Define each…
Veracity : Refers to the accessibility, usability and quality of the data.
Velocity : Refers to the streaming efficiency of processing the data.
Volume : Refers to the scale of the data.
Variety: There are different types of data such as structured, semi-structured and unstructured.
Value : All data can be used extract value.
Define Data Variety…
Refers to the different types of data such as medical records, geo-spatial data, animal behaviour etc.
There are 3 types of variety: structured, semi-structured, unstructured.
Give examples of structured, unstructured and semistructured data…
Structured : Medical records, shop purchases, anything that adheres to a model and is in tabular format.
Semi-structured : Emails, Binary executables, any data that doesn’t adhere to a model, but records still have sorting properties and can be grouped via overlappying or similar attributes.
Unstructured : Geo-spatial data, audio, visual. Any data that doesn’t adhere to a model and is not initially sortable.
Define structured data…
Data that adheres to a data model and can be stored in tabular format ( tables, rows and relationships ). For example, any data that can be held in a SQL database.
Give advantages and disadvantages of Structured Data?
Advantage:
- Structure enables more efficient computational analysis.
- Efficient querying.
- Structure provides context to the data for easier human interpretation.
Disadvantage:
- Limited type of data that can be stored in this format.
- Less flexible in how data can be stored.
- Computational cost of inserting data.
Define Semi-structured data…
Data that doesn’t adhere to a data model, but records show similarities which mean similar records can be grouped together based on shared or similar properties. For example, emails won’t share the same data, but they do all share properties such as a sender, receiver, subject line etc.
Give advantages and disadvantages of Semi-structured data…
Advantage:
- Not constrained to a schema.
- Flexible in the way in which the data can be stored.
Disadvantage:
- No schema can make interpreting data relationships difficult.
- Less efficient querying than Structured due to no fixed schema.
Define Unstructured Data…
Data that doesn’t adhere to a model or schema. For example, audio, visual, geo-spatial.
This accounts for 80 to 90% of the data collected, and can be created by both humans and computers.
Give advantages and disadvantages of Unstructured Data…
Advantage:
- No schema or model remove data constrains on collection.
- Data is collected quickly and easily.
- Larger data sets can be acquired easily.
Disadvantages:
- - Hard to process, analyse and interpret.
- Hard to query due to lack of relationships and structure.
- Computationally expensive to query.
Explain the relationship between unstructured data, the questions we ask about our data, and structured data…
When we collect data, usually it is unstructured. However, it can be considered structured at the lowest level. For example, visual data such as a picture can be considered a set of pixels.
What is unstructured is the questions we ask of data.
Thus when querying data, we can consider the unstructured data as structured at a low level, and our query as truly unstructured. The response to our query should be a structured subset of the original data set, that is represented at a higher level than the original.