M7 U3 - Data Types and Sources - Q2 Flashcards
What decides the type of data that will be used in the project? (2)
the tasks and methods that were defined at the same time as the project’s business and analytic objectives.
The type of data will influence the source and data collection techniques.
List the categories of numeric/categorical data (4)
What’s one interesting thing about qualitative data?
Qualitative data is sometimes transformed to enable it to be used in certain machine learning modeling techniques that require quantitative data.
Describe structured vs unstructured data and examples of their storage backbones
Structured Data
- Fixed formats (usually row and column structure?)
- Easy to extract
- Requires a predefined schema
- Examples: spreadsheets, relational databases and other repositories in the row and column format.
Unstructured Data
- Most difficult to extract
- Doesn’t fit row and column structure
- It cannot be maintained in formats that are uniform.
- Doesn’t need a predefined schema
- Examples: Text, multimedia files, and log files from servers, NoSQL databases
What are some ways to classify data? (4)
- Data type: Numeric vs categorical and subtypes of each
- Qualitative and Quantitative Data
- Structured and Unstructured Data
- Internal and External Data
What’s a Secondary data source?
- Secondary data sources: gathered from sources external to an organization
What’s a primary data source?
Primary data sources: collected and processed by an organization and housed internally
What sources can internal data be collected by?
Can come from a primary or secondary data source
What data sources can an organization’s data governance framework affect?
Both primary and secondary sources
Any data used by the organization.
What’s the key to distinguishing between internal and external data sources?
I believe: If the data is stored in a company’s DB and completely controlled by that company, it’s internal. Otherwise, external.
The data does not have to be about things within the company to be internal (but I think it usually is).
What’s the key to distinguishing between primary and secondary data sources?
Whether or not you collected it yourself. If so, it’s primary.
Is primary data internal or external? What about secondary data?
- Primary data can be collected from internal or external sources
- Secondary data will usually come from external sources.
What do we know about secondary data?
It’s often used by others (too). I.e. it’s usually not your own.
Give examples of each of the 4 groupings of data sources
- Primary Internal: Data scientist conducts questionnaires and focus groups with employees of their own company.
- Primary External: Data scientist conducts questionnaires and focus groups with customers.
- Secondary Internal: Your company purchases potential client data from data brokers (external source). That data has now become your company’s data that will be used for marketing, advertising, etc. (It has now become internal data) .
- Secondary External: An example of secondary data is data used in a kaggle competition or a dataset from the popular UCI Machine Learning Repository. You did not collect that data and it has been used by others.
What data does data governance affect?
Any data that is used by the organization for decision making