Data Flashcards
First Party Data
Data collected by an individual or group using their own resources.
Second Party Data
Data collected from a group from its audience and then sold.
Third Party Data
Data collected from outside sources who did not collect it directly.
Population
All possible data values in a certain dataset.
Sample
A part of a population that is representative of the population. This is useful when looking to analyse data on an entire population as collecting such a massive amount of data would be challenging.
Time Series Data
This is data that includes dates and is useful when looking to analyse trends over time.
Qualitative Data
This is data that can’t be counted, measured or easily expressed using numbers. Examples include names, categories and descriptions.
Quantitative Data
This is data which can be measured, counted and then expressed as a number. This is data with a certain quantity, amount or range.
Discrete Data [Quantitative Data]
This is data that’s counted or has a limited number of values. It’s not fractional data composed of whole numbers or points such as 10, 50 and 365.
Continuous Data [Quantitative Data]
Data that is measured and can have almost any numerical value. An example is 110.0356 minutes.
Nominal Data [Qualitative Data]
A type of qualitative data that’s categorized without a set order. This type of data doesn’t have a sequence.
Ordinal Data [Qualitative Data]
This is a type of qualitative data with a set order or scale.
Internal Data
Data that lives within a company’s own systems. This is usually more reliable and easier to collect.
External Data
Data that lives and is generated outside of an organisation.
Structured Data
Data that’s organised in a certain format such as rows and columns. Spreadsheets and relational databases can stored data in a structured way.
Unstructured Data
This is data that is not organised in any easily identifiable manner such as audio and video files.
Data Model
A model that is used for organising data elements and how they relate to one another.
Data Elements
Pieces of information, such as people’s names, account numbers, and address.
Data Modelling
Data modelling is the process of creating diagrams that visually represent how data is organised and structured. These visual representations are called data models.
3 Most Common Types of Data Modelling
- Conceptual data modelling
- Logical data modelling
- Physical data modelling
Conceptual Data Modelling
Conceptual data modelling gives a high-level view of the data structure, such as how data interacts across an organisation. A conceptual data model doesn’t contain technical details.
Logical Data Modelling
Logical data modelling focuses on the technical details of a database such as relationships, attributes, and entities. It doesn’t spell out the actual names of database tables. That’s the job of a physical data model.
Physical Data Modelling
Physical data modelling depicts how a database operates. A physical data model defines all entities and attributes used; for example, it includes table names, column names, and data types for the database.
Spreadsheet Data Types
- Number
- Text or string
- Boolean [a result that can only have one of two possible values: true or false.]
Record
This is data contained in a row.
Field
This is data contained in a column.
Wide Data
Data in which every data subject has a single row with multiple columns to hold the values of various attributes of the subject.
Long Data
Long data is data in which each row is one time point per subject, so each subject will have data in multiple rows.
Data Transformation
This is the process of changing the data’s format, structure, or values.
Data Transformation Involves:
Adding, copying, or replicating data
Deleting fields or records
Standardising the names of variables
Renaming, moving or combining columns in a database
Joining one set of data with another
Saving a file in a different format. For example, saving a spreadsheet as a comma separated values (CVS) file.
Goals For Data Transformation:
Data organisation: better organised data is easier to use
Data compatibility: different applications or systems can then use the same data
Data migration: data with matching formats can be moved from one system to another
Data merging: data with the same organisation can be merged together
Data enhancement: data can be displayed with more detailed fields
Data comparison: applies-to-apples comparisons of the data can be made
Kaggle
This is an online community of people passionate about data.
Bias
A preference in favour of or against a person, group or people, or a thing.
Data Bias
Type of error that systematically skews results in a certain direction.
Sampling Bias
When a sample isn’t representative of the population as a whole.
Observer Bias (experimenter or research bias)
The tendency for different people to observe things differently.
Interpretation Bias
The tendency to always interpret ambiguous situations in a positive or negative way.
Confirmation Bias
The tendency to search for or interpret information in a way that confirms pre-existing beliefs.
R.O.C.C.C Process
This process can be used to identify good data sources.
Reliable - Good data sources are reliable.
Original - Be sure to validate data with the original source.
Comprehensive - The best data sources contain all critical information needed to answer the question or find a solution.
Current - The usefulness of data decreases as time passes. The best data sources are current and relevant to the task at hand.
Cited - Who created the dataset? Is it part of a credible organisation? When was the data last refreshed? Your source has to be cited and vetted.
Ethics
Well-founded standards of right and wrong that prescribe what humans ought to do, usually in terms of rights, obligations, benefits to society, fairness, or specific virtues.
Data Ethics
Well-founded standards of right and wrong that dictate how data is collected, shared, and used.
Aspects of data ethics
Ownership
Transaction transparency
Consent
Currency
Privacy
Openness
Ownership
Individuals own the raw data they provide and they have primary control over it’s usage, how it’s processed, and how it’s shared.
Transaction Transparency
All data-processing activities should be completely explainable and understood by the individual who provides their data.
Consent
An individual’s right to know explicit details about how and why their data will be used before agreeing to provide it.
Currency
Individuals should be aware of financial transactions resulting from the use of their personal data and the scale of these transactions.
Privacy
Preserving a data subject’s information and activity anytime a data transaction occurs.
Data Anonymisation
Data anonymisation is the process of protecting people’s private or sensitive data by eliminating that kind of information. Typically, data anonymisation involves blanking, hashing, or masking personal information, often by using fixed-length codes to represent data columns, or hiding data with altered values.
De-identification
This is the process used to wipe data clean of all PII. This is commonly done in the healthcare and financial industries.
Data that’s often anonymised includes:
Telephone numbers
Names
License plates and license numbers
Social security numbers
IP addresses
Medical records
Email addresses
Photographs
Account numbers
Openness (or open data)
Free access, usage and sharing of data
CSV
Comma Separated Values are files that save data in a table format. They use plain text and delineated by characters, such as a comma.
Sorting Data
Arranging data into a meaningful order to make it easier to understand, analyse, and visualise.
Data Governance
A process to ensure the formal management of a company’s data assets.
Filtering
Showing only the data that meets a specific criteria while hiding the rest.
Multiple Criteria Sorting
This allows you to sort multiple rows at the same time.
Best Practices for Organising Data
- Naming conventions
- Foldering
- Archiving older files
- Aligning your naming and storage practices with your team
- Developing metadata practices
Naming Coventions
Consistent guidelines that describe the content, date, or version of the file in its name.
Data Security
Protecting data from unauthorised access or corruption by adopting safety measures.
Encryption
Encryption uses a unique algorithm to alter data and make it unusable by users and applications that don’t know the algorithm. This algorithm is saved as a key which can be used to reverse the encryption.
Tokenisation
This process replaced the data elements you want to protect with randomly generated data referred to as a token. The original data is stored in a separate location and mapped to the tokens.
Mentor
A professional who shares their knowledge, skills and experience to help you develop and grow. A mentor elps you to skill up.
Sponsor
A professional advocate who’s committed to moving a sponsee’s career forward within an organisation. A sponsor helps you to move up.