Introduction Flashcards
Data
Found everywhere, generated at an unprecedented rate. Fundamental component of society
Data in our lives
- Smart Home Devices
- Fitness Trackers (Track Biometric data)
- Financial Transactions
Types of Data
Structured, Unstructured, Semi-Structured
Structured Data
Well organized, formattable and easily searchable ex. Financial Records. Usually stored in RDBMS or files like csv
Unstructured Data
Unorganized, Unformatted and different formats ex. Social Media Posts, Emails etc. Usually stored in file systems/CMS that preserve original structure
Semi-Structured Data
Combination. Type of unorganized or partially organized data which doesn’t follow a rigid format but still has some level of structure. Mix of fixed and variable fields. Can be found in XML or JSON files.
Qualitative Data examlpe
Gender, Nationality
Quantitative Data example
Height, Weight
Raw Data
Original Source of data, hard to use for analysis. Raw data may only need to be processed once
Processed Data
Data that is ready for analysis, processing can include merging, subsetting, transforming etc. All steps should be recorded.
Raw Data Example
ASCII files to Binary files that are machine generated, unformatted excel files, API responses.
Accuracy
The measure of data quality that ensures data is correct, free from errors, and represents the real-world value accurately.
Completeness
Indicates whether all required data is recorded or if some is missing/unavailable.
Consistency
Ensures uniformity across data. Examples of issues include partially modified records or dangling updates.
Timeliness
Refers to whether data is updated promptly to reflect the current state.
Believability
Assesses how trustworthy or credible the data is.
Interpretability
Reflects how easily the data can be understood by users.
Data Consolidation Process
- Moving data: Ensuring all data is gathered from different sources into a unified location.
- Making it consistent: Aligning data formats and resolving inconsistencies.
- Cleaning data: Removing errors, duplicates, and filling in missing values where possible.
Why is Data Consolidation Needed?
- Data is often stored in different formats, making integration difficult.
- Data is frequently inconsistent across sources, causing discrepancies.
- Data may be dirty due to:
- Internal inconsistencies (e.g., conflicting records).
- Missing values or blank fields.
- Potentially incorrect data caused by:
- Faulty instruments.
- Human or system errors.
- Transmission errors during data movement.
Disparate Data
Data is often stored in diverse locations and formats, which may include:
- Relational Databases: Used in operational systems for structured data.
- XML Files: Common in web services for hierarchical data.
- Desktop Databases: Such as Microsoft Access.
- Spreadsheets: Examples include Microsoft Excel.
- JSON: Popular for semi-structured or API-related data.
Challenges with Disparate Data
- Data may reside on different operating systems
- Databases may operate on varying hardware platforms
- Integration of such varied data requires specialized tools and processes.
Causes of Inconsistencies
- Faulty instruments, human/computer errors, transmission errors, or business requirements.
- Examples of discrepancies:
- Two plants use different part numbers for the same item.
- Systems using different formats for True/False (e.g.,
1/0
,T/F
,Y/N
).
Other Data Quality Issues
- Free-form text entry:
- Example: Same city entered as Louisville, Lewisville, and Luisville.
- Cleaning routines must handle variations in bad data.
- Incomplete Data: Missing attribute values or only containing aggregate data.
- Example: Occupation = “” or Gender = “Unknown”.
- Noisy Data: Contains noise, errors, or outliers.
- Example: Salary = -10 (invalid).
Causes of Missing Data
- Equipment malfunction.
- Deleted data due to inconsistencies.
- Misunderstandings or assumptions during data entry.
- Data not considered important at the time.
- Lack of historical records or change tracking.
Handling Missing Data
- Ignore the tuple: Common for classification tasks when class labels are missing. Not effective when missing values vary significantly.
- Fill in missing values:
- Manually: Labor-intensive and impractical for large datasets.
- Automatically:
- Global constant: e.g.,
"unknown"
. - Attribute mean: Fill with mean value.
- Class-specific attribute mean: Mean of samples from the same class.
- Most probable value: Using inference (e.g., Bayesian formula, decision trees).
- Global constant: e.g.,
Handling Noisy Data
- Binning
- Regression
- Clustering
- Hybrid Methods
The Data Cleaning Process
- Data Discrepancy Detecting
- Data Migration and Integration
Binning
- Sort data into bins, then smooth using:
- Bin means.
- Bin medians.
- Bin boundaries.
Regression
Fit data to regression functions for smoothing.
Clustering
Detect and remove outliers as small clusters.
Hybrid Methods
Combine automated detection with human inspection (e.g., flag suspicious values for manual review).
Data Discrepancy Detection
- Use metadata (e.g., domain, range, dependency rules).
- Verify field overloading, uniqueness rules, null values, etc.
- Use tools for auditing:
- Data scrubbing: Leverage domain knowledge (e.g., postal codes, spell-check).
- Data auditing: Discover rules and relationships (e.g., correlations, clustering).
Data Migration and Integration
- Data Migration Tools: Specify transformations for uniformity.
- ETL Tools: Graphical interfaces to extract, transform, and load data.
- Integrated Processes: Iterative and interactive approaches to improve accuracy and usability.
Sharing Data
To effectively share data, ensure the following components are included:
- Raw Data: The original dataset in its unaltered form.
- Tidy Dataset: A processed version of the raw data with organized and meaningful structure.
- Code Book: A document describing variables, values, and methods in the tidy dataset.
- Step-by-Step Procedure: Detailed instructions to transform raw data into the tidy dataset and code book.
The Code Book
essential to describe variables and provide context for the dataset.
Sections of the Code Book
- Variable Metadata:
- Describe all variables in the raw dataset, including their units.
- Document any summary choices made (e.g., averages, medians).
- Study Design:
- Thoroughly explain how the data was collected.
- Include experimental study design details if applicable.
- This section must be comprehensive and clear.
- File Format:
- Provide the code book as a standard document (e.g.,
.txt
,.docx
,.pdf
).
- Provide the code book as a standard document (e.g.,
The Instruction List
A step-by-step guide to replicate the data cleaning and tidying process.
Components of the Instruction List
- Automated Scripts: Prefer scripts to automate the procedure for accuracy and reproducibility.
- Inputs and Outputs:
- Input: Raw data.
- Output: Processed, tidy data.
- Manual Steps:
- Clearly explain any manual interventions.
- Provide verification tasks and checklists to ensure correctness at each stage.
- Intermediate Files:
- Explain the purpose of any intermediate files generated during the process.
The FASTA Format:
- The header line begins with
>
and contains:- gi: GenBank Identifier (e.g.,
9626243
). - ref: NCBI Reference Sequence (e.g.,
NC_001416.1
). - Description: Information about the sequence (e.g., “complete genome of Enterobacteria phage lambda virus”).
- gi: GenBank Identifier (e.g.,
typical genome sequence consists
- A: Adenosine
- C: Cytidine
- G: Guanine
- T: Thymidine
- N: Any nucleotide (A, G, C, T).
Transcribing the Code Book for Genome Data
Your Code Book should include:
- Organism Name: The name of the organism (e.g., Enterobacteria phage lambda).
- GenBank Identifier: If available, provide the GenBank ID (e.g.,
9626243
). - NCBI Reference Sequence: Reference sequence, if available (e.g.,
NC_001416.1
). - Subsequences: For subsequences, reference UCSC/Ensembl exon databases if applicable.
- Study Variables:
- Clearly list variables relevant to the study.
- Example: For patient data, include attributes such as age, conditions (list), treatments (name, dose, unit).
Study Design Section
Include a description of the technique used to collect data (e.g., sequencing methods).
Instruction List for Data Preparation
- Include scripts or manual steps used for tidying the data.
- Specify the file formats used to share data (e.g.,
.csv
,.fasta
). - Ensure reproducibility by providing detailed explanations of every processing step.
Call Detail Records (CDRs) Key Observations
- Data lacks headers.
- Missing data and gaps are present.
- Entire records may be omitted if no activity exists for a combination of SquareID, Time Interval, and Country Code.
Code Book Variables
- Square ID: ID of a square in the Milano GRID. (Type: numeric)
- Time Interval: Start of time interval (milliseconds since Unix Epoch). End = Start + 600,000 ms (10 mins). (Type: numeric)
- Country Code: Phone country code. Context varies by activity. (Type: numeric)
- SMS-In Activity: Received SMS in the Square ID during the interval from the nation in
Country Code
. (Type: numeric) - SMS-Out Activity: Sent SMS in the Square ID during the interval to the nation in
Country Code
. (Type: numeric) - Call-In Activity: Received calls in the Square ID during the interval from
Country Code
. (Type: numeric) - Call-Out Activity: Issued calls in the Square ID during the interval to
Country Code
. (Type: numeric) - Internet Traffic Activity: Performed internet traffic in the Square ID during the interval by
Country Code
. (Type: numeric)
Study Design
- File Format: TSV (Tab-Separated Values).
- Missing Data: If no activity exists, the value is omitted. Empty fields indicate no record.
- Dataset Origin: Aggregated from Telecom Italia cellular network’s CDRs in Milano for:
- Received/Sent SMS.
- Incoming/Outgoing Calls.
- Internet activity (generated on start, end, or when 15 mins/5 MB limits reached).
- Purpose: Measure mobile phone network activity by SMS, call, and internet usage. SMS and call data are comparable; internet traffic is not.