Introduction Flashcards

1
Q

Data

A

Found everywhere, generated at an unprecedented rate. Fundamental component of society

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data in our lives

A
  • Smart Home Devices
  • Fitness Trackers (Track Biometric data)
  • Financial Transactions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Types of Data

A

Structured, Unstructured, Semi-Structured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Structured Data

A

Well organized, formattable and easily searchable ex. Financial Records. Usually stored in RDBMS or files like csv

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Unstructured Data

A

Unorganized, Unformatted and different formats ex. Social Media Posts, Emails etc. Usually stored in file systems/CMS that preserve original structure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Semi-Structured Data

A

Combination. Type of unorganized or partially organized data which doesn’t follow a rigid format but still has some level of structure. Mix of fixed and variable fields. Can be found in XML or JSON files.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Qualitative Data examlpe

A

Gender, Nationality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Quantitative Data example

A

Height, Weight

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Raw Data

A

Original Source of data, hard to use for analysis. Raw data may only need to be processed once

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Processed Data

A

Data that is ready for analysis, processing can include merging, subsetting, transforming etc. All steps should be recorded.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Raw Data Example

A

ASCII files to Binary files that are machine generated, unformatted excel files, API responses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Accuracy

A

The measure of data quality that ensures data is correct, free from errors, and represents the real-world value accurately.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Completeness

A

Indicates whether all required data is recorded or if some is missing/unavailable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Consistency

A

Ensures uniformity across data. Examples of issues include partially modified records or dangling updates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Timeliness

A

Refers to whether data is updated promptly to reflect the current state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Believability

A

Assesses how trustworthy or credible the data is.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Interpretability

A

Reflects how easily the data can be understood by users.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Data Consolidation Process

A
  1. Moving data: Ensuring all data is gathered from different sources into a unified location.
  2. Making it consistent: Aligning data formats and resolving inconsistencies.
  3. Cleaning data: Removing errors, duplicates, and filling in missing values where possible.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Why is Data Consolidation Needed?

A
  • Data is often stored in different formats, making integration difficult.
  • Data is frequently inconsistent across sources, causing discrepancies.
  • Data may be dirty due to:
    • Internal inconsistencies (e.g., conflicting records).
    • Missing values or blank fields.
    • Potentially incorrect data caused by:
      • Faulty instruments.
      • Human or system errors.
      • Transmission errors during data movement.
20
Q

Disparate Data

A

Data is often stored in diverse locations and formats, which may include:
- Relational Databases: Used in operational systems for structured data.
- XML Files: Common in web services for hierarchical data.
- Desktop Databases: Such as Microsoft Access.
- Spreadsheets: Examples include Microsoft Excel.
- JSON: Popular for semi-structured or API-related data.

21
Q

Challenges with Disparate Data

A
  • Data may reside on different operating systems
  • Databases may operate on varying hardware platforms
  • Integration of such varied data requires specialized tools and processes.
22
Q

Causes of Inconsistencies

A
  • Faulty instruments, human/computer errors, transmission errors, or business requirements.
  • Examples of discrepancies:
    • Two plants use different part numbers for the same item.
    • Systems using different formats for True/False (e.g., 1/0, T/F, Y/N).
23
Q

Other Data Quality Issues

A
  • Free-form text entry:
    • Example: Same city entered as Louisville, Lewisville, and Luisville.
    • Cleaning routines must handle variations in bad data.
  • Incomplete Data: Missing attribute values or only containing aggregate data.
    • Example: Occupation = “” or Gender = “Unknown”.
  • Noisy Data: Contains noise, errors, or outliers.
    • Example: Salary = -10 (invalid).
24
Q

Causes of Missing Data

A
  • Equipment malfunction.
  • Deleted data due to inconsistencies.
  • Misunderstandings or assumptions during data entry.
  • Data not considered important at the time.
  • Lack of historical records or change tracking.
25
Q

Handling Missing Data

A
  1. Ignore the tuple: Common for classification tasks when class labels are missing. Not effective when missing values vary significantly.
  2. Fill in missing values:
    • Manually: Labor-intensive and impractical for large datasets.
    • Automatically:
      • Global constant: e.g., "unknown".
      • Attribute mean: Fill with mean value.
      • Class-specific attribute mean: Mean of samples from the same class.
      • Most probable value: Using inference (e.g., Bayesian formula, decision trees).
26
Q

Handling Noisy Data

A
  1. Binning
  2. Regression
  3. Clustering
  4. Hybrid Methods
27
Q

The Data Cleaning Process

A
  1. Data Discrepancy Detecting
  2. Data Migration and Integration
28
Q

Binning

A
  • Sort data into bins, then smooth using:
    - Bin means.
    - Bin medians.
    - Bin boundaries.
29
Q

Regression

A

Fit data to regression functions for smoothing.

30
Q

Clustering

A

Detect and remove outliers as small clusters.

31
Q

Hybrid Methods

A

Combine automated detection with human inspection (e.g., flag suspicious values for manual review).

32
Q

Data Discrepancy Detection

A
  • Use metadata (e.g., domain, range, dependency rules).
  • Verify field overloading, uniqueness rules, null values, etc.
  • Use tools for auditing:
    • Data scrubbing: Leverage domain knowledge (e.g., postal codes, spell-check).
    • Data auditing: Discover rules and relationships (e.g., correlations, clustering).
33
Q

Data Migration and Integration

A
  • Data Migration Tools: Specify transformations for uniformity.
  • ETL Tools: Graphical interfaces to extract, transform, and load data.
  • Integrated Processes: Iterative and interactive approaches to improve accuracy and usability.
34
Q

Sharing Data

A

To effectively share data, ensure the following components are included:

  1. Raw Data: The original dataset in its unaltered form.
  2. Tidy Dataset: A processed version of the raw data with organized and meaningful structure.
  3. Code Book: A document describing variables, values, and methods in the tidy dataset.
  4. Step-by-Step Procedure: Detailed instructions to transform raw data into the tidy dataset and code book.
35
Q

The Code Book

A

essential to describe variables and provide context for the dataset.

36
Q

Sections of the Code Book

A
  1. Variable Metadata:
    • Describe all variables in the raw dataset, including their units.
    • Document any summary choices made (e.g., averages, medians).
  2. Study Design:
    • Thoroughly explain how the data was collected.
    • Include experimental study design details if applicable.
    • This section must be comprehensive and clear.
  3. File Format:
    • Provide the code book as a standard document (e.g., .txt, .docx, .pdf).
37
Q

The Instruction List

A

A step-by-step guide to replicate the data cleaning and tidying process.

38
Q

Components of the Instruction List

A
  1. Automated Scripts: Prefer scripts to automate the procedure for accuracy and reproducibility.
  2. Inputs and Outputs:
    • Input: Raw data.
    • Output: Processed, tidy data.
  3. Manual Steps:
    • Clearly explain any manual interventions.
    • Provide verification tasks and checklists to ensure correctness at each stage.
  4. Intermediate Files:
    • Explain the purpose of any intermediate files generated during the process.
39
Q

The FASTA Format:

A
  • The header line begins with > and contains:
    • gi: GenBank Identifier (e.g., 9626243).
    • ref: NCBI Reference Sequence (e.g., NC_001416.1).
    • Description: Information about the sequence (e.g., “complete genome of Enterobacteria phage lambda virus”).
40
Q

typical genome sequence consists

A
  • A: Adenosine
  • C: Cytidine
  • G: Guanine
  • T: Thymidine
  • N: Any nucleotide (A, G, C, T).
41
Q

Transcribing the Code Book for Genome Data

A

Your Code Book should include:

  1. Organism Name: The name of the organism (e.g., Enterobacteria phage lambda).
  2. GenBank Identifier: If available, provide the GenBank ID (e.g., 9626243).
  3. NCBI Reference Sequence: Reference sequence, if available (e.g., NC_001416.1).
  4. Subsequences: For subsequences, reference UCSC/Ensembl exon databases if applicable.
  5. Study Variables:
    • Clearly list variables relevant to the study.
    • Example: For patient data, include attributes such as age, conditions (list), treatments (name, dose, unit).
42
Q

Study Design Section

A

Include a description of the technique used to collect data (e.g., sequencing methods).

43
Q

Instruction List for Data Preparation

A
  • Include scripts or manual steps used for tidying the data.
  • Specify the file formats used to share data (e.g., .csv, .fasta).
  • Ensure reproducibility by providing detailed explanations of every processing step.
44
Q

Call Detail Records (CDRs) Key Observations

A
  • Data lacks headers.
  • Missing data and gaps are present.
  • Entire records may be omitted if no activity exists for a combination of SquareID, Time Interval, and Country Code.
45
Q

Code Book Variables

A
  • Square ID: ID of a square in the Milano GRID. (Type: numeric)
  • Time Interval: Start of time interval (milliseconds since Unix Epoch). End = Start + 600,000 ms (10 mins). (Type: numeric)
  • Country Code: Phone country code. Context varies by activity. (Type: numeric)
  • SMS-In Activity: Received SMS in the Square ID during the interval from the nation in Country Code. (Type: numeric)
  • SMS-Out Activity: Sent SMS in the Square ID during the interval to the nation in Country Code. (Type: numeric)
  • Call-In Activity: Received calls in the Square ID during the interval from Country Code. (Type: numeric)
  • Call-Out Activity: Issued calls in the Square ID during the interval to Country Code. (Type: numeric)
  • Internet Traffic Activity: Performed internet traffic in the Square ID during the interval by Country Code. (Type: numeric)
46
Q

Study Design

A
  • File Format: TSV (Tab-Separated Values).
  • Missing Data: If no activity exists, the value is omitted. Empty fields indicate no record.
  • Dataset Origin: Aggregated from Telecom Italia cellular network’s CDRs in Milano for:
    • Received/Sent SMS.
    • Incoming/Outgoing Calls.
    • Internet activity (generated on start, end, or when 15 mins/5 MB limits reached).
  • Purpose: Measure mobile phone network activity by SMS, call, and internet usage. SMS and call data are comparable; internet traffic is not.