Creating an Analytic Dataset Flashcards
What are the three structures of data?
Structured Data
Structured data are data with a high degree of organization. They are typically organized into columns and rows like in a spreadsheet. Sometimes columns are also called fields and rows are referred to as records and these terms may end up being used interchangeably throughout the course. Each column represents a variable and each row represents a record of data. Structured data is often stored in databases or files such as spreadsheets and it is usually easily accessible and most importantly, it’s easy to use.
Unstructured Data
Unstructured data can have no structure to it at all. Since the data isn’t organized into a typical columns and rows format, it can be time-consuming to work with as you have to pull what you want out of it. Some examples of this type of data are a resume, a tweet or a contract document.
Semistructured Data
Semi-structured data is data that has some structure to it but still requires some work to put it into a structured format of columns and rows. This could be a computer system log file that requires parsing and manipulating to put into a format that makes the data easier to analyze.
What are the processes involved in organising a dataset?
- Understanding data
- what are the various types of data
- why is formatting important - Data issues
- problems we may encounter when working with data
- we may have dirty data, missing values, outliers - Data formatting
- format data so it is usable / useful
- Data is often not in the format we need - Data blending
- Blend data together from disparate datasets, as we will rarely have all the data we need for analysis in one file.
How do you identify an outlier?
The value needs to be 1.5 times the Interquartile range beyond the first and third quartile.
To calculate the upper fence and the lower fence, here are the exact steps:
1 . Calculate 1st quartile Q1 and 3rd quartile Q3 of the dataset. You can use the Excel function QUARTILE.INC or QUARTILE.EXC
2 . Calculate the Interquartile Range: IQR = Q3 - Q1
3 . Add 1.5 * IQR to Q3 to get the upper fence: Upper Fence = Q3 + 1.5 * IQR
4 . Subtract 1.5 * IQR to Q1 to get the lower fence: Lower Fence = Q1 - 1.5 * IQR
Determining outliers is not an exact science so there is no “one” definition as to what defines an outlier. We’ve decided to cover the Interquartile Range methodology but there are additional methodologies that people use such as z-scores or standard deviations.