M7 U3 - Data Types and Sources - Q2 Flashcards
What decides the type of data that will be used in the project? (2)
the tasks and methods that were defined at the same time as the project’s business and analytic objectives.
The type of data will influence the source and data collection techniques.
List the categories of numeric/categorical data (4)

What’s one interesting thing about qualitative data?
Qualitative data is sometimes transformed to enable it to be used in certain machine learning modeling techniques that require quantitative data.
Describe structured vs unstructured data and examples of their storage backbones
Structured Data
- Fixed formats (usually row and column structure?)
- Easy to extract
- Requires a predefined schema
- Examples: spreadsheets, relational databases and other repositories in the row and column format.
Unstructured Data
- Most difficult to extract
- Doesn’t fit row and column structure
- It cannot be maintained in formats that are uniform.
- Doesn’t need a predefined schema
- Examples: Text, multimedia files, and log files from servers, NoSQL databases
What are some ways to classify data? (4)
- Data type: Numeric vs categorical and subtypes of each
- Qualitative and Quantitative Data
- Structured and Unstructured Data
- Internal and External Data
What’s a Secondary data source?
- Secondary data sources: gathered from sources external to an organization
What’s a primary data source?
Primary data sources: collected and processed by an organization and housed internally
What sources can internal data be collected by?
Can come from a primary or secondary data source
What data sources can an organization’s data governance framework affect?
Both primary and secondary sources
Any data used by the organization.
What’s the key to distinguishing between internal and external data sources?
I believe: If the data is stored in a company’s DB and completely controlled by that company, it’s internal. Otherwise, external.
The data does not have to be about things within the company to be internal (but I think it usually is).
What’s the key to distinguishing between primary and secondary data sources?
Whether or not you collected it yourself. If so, it’s primary.
Is primary data internal or external? What about secondary data?
- Primary data can be collected from internal or external sources
- Secondary data will usually come from external sources.
What do we know about secondary data?
It’s often used by others (too). I.e. it’s usually not your own.
Give examples of each of the 4 groupings of data sources
- Primary Internal: Data scientist conducts questionnaires and focus groups with employees of their own company.
- Primary External: Data scientist conducts questionnaires and focus groups with customers.
- Secondary Internal: Your company purchases potential client data from data brokers (external source). That data has now become your company’s data that will be used for marketing, advertising, etc. (It has now become internal data) .
- Secondary External: An example of secondary data is data used in a kaggle competition or a dataset from the popular UCI Machine Learning Repository. You did not collect that data and it has been used by others.
What data does data governance affect?
Any data that is used by the organization for decision making
What is data collection?
Data collection is the process of gathering and analyzing data that can meet defined business and analytic objectives.
When data is collected, it can be for one of three purposes? List them (3)
- Data collected to define business and analytic objectives
- data collected to define business requirements
- data needed for developing an analytic solution
What are the traditional data collection methods?
- Questionnaires and surveys
- Interviews
- Observations
- Focus groups
What’s obtrusive vs unobtrusive data collection methods? Give examples.
- Obtrusive: Participants of the data collection exercise are aware that data is being collected from them for a purpose. E.g. the 4 traditional data collection methods.
- Unobtrusive: Can be done without the knowledge of the subject of the study. E.g. web sources of data, social media data, data sets from a data repository.
The traditional data collection processes are similar to the requirements gathering techniques. What is the difference between the data collected during both processes?
The data collected during the requirements phase is useful in determining what data is collected in the data gathering phase.
Where is the first place you should start looking for data during the data collection process?
Start from within the organization. No matter how small, you should start collecting data from within your client organization.
What’s the main idea of how data collection fits into the project?
When you defined the project’s business and analytic objectives, the tasks and methods were proposed as well, those tasks and methods drive the type of data that will be used in the project. The type of data will influence the source and data collection techniques.
Give examples of external data (4)
- Statistics from surveys
- Questionnaires
- Research
- Customer feedback.
What influences the source and data collection techniques?
The type of data
What do we know about nominal data? (2)
Can’t be:
- ordered
- measured
List examples of internal data (4)
Data about:
- Operations
- Maintenance
- Personnel
- Finance
Data collected from Twitter by an presidential candidate’s election campaign team is considered which of the following?
- Internal
- External
- Primary
External data