Introduction Flashcards

Question

Handling Missing Data

Answer 1

1. Ignore the tuple: Common for classification tasks when class labels are missing. Not effective when missing values vary significantly. 2. Fill in missing values: - Manually: Labor-intensive and impractical for large datasets. - Automatically: - Global constant: e.g., `"unknown"`. - Attribute mean: Fill with mean value. - Class-specific attribute mean: Mean of samples from the same class. - Most probable value: Using inference (e.g., Bayesian formula, decision trees).

Answer 2

1. Binning 2. Regression 3. Clustering 4. Hybrid Methods

Answer 3

1. Data Discrepancy Detecting 2. Data Migration and Integration

Answer 4

- Sort data into bins, then smooth using: - Bin means. - Bin medians. - Bin boundaries.

Answer 5

Fit data to regression functions for smoothing.

Answer 6

Detect and remove outliers as small clusters.

Answer 7

Combine automated detection with human inspection (e.g., flag suspicious values for manual review).

Answer 8

- Use metadata (e.g., domain, range, dependency rules). - Verify field overloading, uniqueness rules, null values, etc. - Use tools for auditing: - Data scrubbing: Leverage domain knowledge (e.g., postal codes, spell-check). - Data auditing: Discover rules and relationships (e.g., correlations, clustering).

Answer 9

- Data Migration Tools: Specify transformations for uniformity. - ETL Tools: Graphical interfaces to extract, transform, and load data. - Integrated Processes: Iterative and interactive approaches to improve accuracy and usability.

Answer 10

To effectively share data, ensure the following components are included: 1. Raw Data: The original dataset in its unaltered form. 2. Tidy Dataset: A processed version of the raw data with organized and meaningful structure. 3. Code Book: A document describing variables, values, and methods in the tidy dataset. 4. Step-by-Step Procedure: Detailed instructions to transform raw data into the tidy dataset and code book.

Answer 11

essential to describe variables and provide context for the dataset.

Answer 12

1. Variable Metadata: - Describe all variables in the raw dataset, including their units. - Document any summary choices made (e.g., averages, medians). 2. Study Design: - Thoroughly explain how the data was collected. - Include experimental study design details if applicable. - This section must be comprehensive and clear. 3. File Format: - Provide the code book as a standard document (e.g., `.txt`, `.docx`, `.pdf`).

Answer 13

A step-by-step guide to replicate the data cleaning and tidying process.

Answer 14

1. Automated Scripts: Prefer scripts to automate the procedure for accuracy and reproducibility. 2. Inputs and Outputs: - Input: Raw data. - Output: Processed, tidy data. 3. Manual Steps: - Clearly explain any manual interventions. - Provide verification tasks and checklists to ensure correctness at each stage. 4. Intermediate Files: - Explain the purpose of any intermediate files generated during the process.

Answer 15

- The header line begins with `>` and contains: - gi: GenBank Identifier (e.g., `9626243`). - ref: NCBI Reference Sequence (e.g., `NC_001416.1`). - Description: Information about the sequence (e.g., "complete genome of Enterobacteria phage lambda virus").

Answer 16

- A: Adenosine - C: Cytidine - G: Guanine - T: Thymidine - N: Any nucleotide (A, G, C, T).

Answer 17

Your Code Book should include: 1. Organism Name: The name of the organism (e.g., Enterobacteria phage lambda). 2. GenBank Identifier: If available, provide the GenBank ID (e.g., `9626243`). 3. NCBI Reference Sequence: Reference sequence, if available (e.g., `NC_001416.1`). 4. Subsequences: For subsequences, reference UCSC/Ensembl exon databases if applicable. 5. Study Variables: - Clearly list variables relevant to the study. - Example: For patient data, include attributes such as age, conditions (list), treatments (name, dose, unit).

Answer 18

Include a description of the technique used to collect data (e.g., sequencing methods).

Answer 19

- Include scripts or manual steps used for tidying the data. - Specify the file formats used to share data (e.g., `.csv`, `.fasta`). - Ensure reproducibility by providing detailed explanations of every processing step.

Answer 20

- Data lacks headers. - Missing data and gaps are present. - Entire records may be omitted if no activity exists for a combination of SquareID, Time Interval, and Country Code.

Answer 21

- Square ID: ID of a square in the Milano GRID. (Type: numeric) - Time Interval: Start of time interval (milliseconds since Unix Epoch). End = Start + 600,000 ms (10 mins). (Type: numeric) - Country Code: Phone country code. Context varies by activity. (Type: numeric) - SMS-In Activity: Received SMS in the Square ID during the interval from the nation in `Country Code`. (Type: numeric) - SMS-Out Activity: Sent SMS in the Square ID during the interval to the nation in `Country Code`. (Type: numeric) - Call-In Activity: Received calls in the Square ID during the interval from `Country Code`. (Type: numeric) - Call-Out Activity: Issued calls in the Square ID during the interval to `Country Code`. (Type: numeric) - Internet Traffic Activity: Performed internet traffic in the Square ID during the interval by `Country Code`. (Type: numeric)

Answer 22

- File Format: TSV (Tab-Separated Values). - Missing Data: If no activity exists, the value is omitted. Empty fields indicate no record. - Dataset Origin: Aggregated from Telecom Italia cellular network’s CDRs in Milano for: - Received/Sent SMS. - Incoming/Outgoing Calls. - Internet activity (generated on start, end, or when 15 mins/5 MB limits reached). - Purpose: Measure mobile phone network activity by SMS, call, and internet usage. SMS and call data are comparable; internet traffic is not.

Introduction Flashcards

(46 cards)