data management and preprocessing Flashcards

1
Q

Challenges to IoT data management -

A

data volume → organizations need an optimized storage infrastructure for the growing inflows of Big Data.

Time sensitivity real-time vs batch processing → incoming data has to be (re)orginized at the facility in real time. Current alternatives to this approach - batch processing (brings challenges). (batch processing - process the data in batches, real-time processing - process the data in continuous flows).

Heterogeneity → no standard format, harvest and stream data with the help of different protocols and standards.

Data flow controls → keep track of data transformations is challenging and essential to achieve transparency and clean data flow.

Data quality, transform for usability → missing data, outliers at the storage facility, continues to be an issue today. a maximally transparent process should be implemented, and quality management must be automated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What do we need to do to gain insights from data generated by IoT? -

A

Capture: capture and process data coming from sensors and other devices.

Interoperate: ensure interoperability of data coming from multiple sensors with multiple data formats and multiple protocols.

Analyze: analyze data in real-time to compare it with historical trends.

Act: ensure that appropriate responses are built in to operational application workflows and business processes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Challenges in managing IoT data -

A

data security, data scalability, interoperability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Methods for collecting data -

A

real-time data streaming, batch processing. Different factors influence the choice of data collection method. The nature of applications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data objects -

A

are often referred to as samples, examples, instances, data points ir objects. In a database, data objects are reprsentend as data tuples, where rows correspond to the objects, and columns correspond to attributes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is an attribute? -

A

An attribute is a data field representing a characteristic of a data object. Also called dimension (data warehousing), feature (ML), or variable (statistics). Example: Time, Acc. X, Acc. Y, Gyro. Z from an actigraphy device. Observed values of an attribute are called observations. A feature vector is a set of attributes describing an object.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Different types of attribute data -

A
  • Nominal (categorical): lack meaningful order, no quantitative relationship between categories.
  • Ordinal: attributed can be ranked - ordered. The distance between values lack any significance.
  • Numeric: Attributes can be rank - ordered
    • Distances between values have meaning
    • Possible to perform mathematical operations.
    • Can be interval- scaled or ratio-scaled.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Interval-scaled and ratio-scaled attribute -

A
  1. Interval-Scaled Attributes
    • Measured on a scale with equal intervals (differences make sense).
    • No true zero (zero does not mean “none”).
    • Can be positive, negative, or zero.
    • Examples: Celsius, Fahrenheit (0°C doesn’t mean no temperature).
  2. Ratio-Scaled Attributes
    • Have a true zero (0 means “none”).
    • Allow ordering, differences, and ratio comparisons (e.g., 10 kg is twice as heavy as 5 kg).
    • Examples: Kelvin temperature, years of experience, weight, height, word count.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Discrete vs continuous attributes -

A
  1. Discrete Attributes (Fixed or countable values)
    • Have a limited number of possible values (e.g., on/off, 1/2/3).
    • Can be numbers or categories (e.g., device ID, product codes).
    • Examples: Security modes, event counts, battery levels.
  2. Continuous Attributes (Unlimited possible values)
    • Can take any value within a range (e.g., 23.5°C, 50.1% humidity).
    • Usually measured and represented as decimal numbers.
    • Examples: Temperature, humidity, pressure.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Statistical description of data -

A
  1. Why It’s Important:
    • Helps understand data properties.
    • Detects noise or outliers (unusual values).
  2. Measuring Central Tendency (Where most values are centered)
    • Mean → Average value.
    • Median → Middle value when sorted.
    • Mode → Most frequent value.
  3. Measuring Dispersion (How spread out data is)
    • Range → Difference between max and min.
    • Quartiles → Divide data into four equal parts.
    • Interquartile Range (IQR) → Spread of the middle 50% (Q3 - Q1).
    • Variance → How far values are from the mean (average squared differences).
    • Standard Deviation → Square root of variance (measures typical spread).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why preprocess the data -

A

low-quality data will lead to low-quality mining results. Factors that compromises data quality: accuracy, completeness, consistency, timeliness, believability and interpretability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Data processing techniques -

A
  1. Data Cleaning (Fixing Errors & Missing Data)
    • Ignore Missing Data: Easy, but you lose information.
    • Enter Manually: Works for small datasets, but slow.
    • Fill with a Global Constant: Simple, but may not be accurate.
    • most probable value: most accurate but computationally expensive
  2. Data Integration (Merging Data from Different Sources)
    • Entity Identification Problem: Matching the same entity across datasets.
    • Avoiding Redundancy: Use correlation analysis to remove duplicates.
  3. Data Reduction (Making Data Smaller & Faster to Process)
    • Why? Saves storage, speeds up processing, improves quality.
    • Methods:
    • Sampling: Select part of the data (random, stratified, systematic, cluster).
    • Dimensionality Reduction: Shrink feature set while keeping key info.
    • Methods: PCA, t-SNE, MDS, Autoencoders.
  4. Data Transformation (Standardizing Data for Better Results)
    • Normalization: Adjusts values to a common scale (important for machine learning).
    • methods: Min-Max Normalization. Z-Score Normalization.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How can we remove nose in noisy data? -

A

Binning → smoothes data by consulting its neighbors. Different methods such as smoothing by bin means and smoothing by bin boundaries.

Regression → Smoothing by fitting the data into a function * Linear regression * Non-linear regression * i.e. polynomial regression.

Outlier analysis → clustering, Outliers may be interesting for analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Feature selection in data processing -

A

Feature selection can help minimize the number of input features and enhance a model’s precision and effectiveness by selecting the most pertinent and predictive information from a dataset. FS techniques: * Filter-based techniques * Wrapper-based techniques * Embedded techniques.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Identify outliers?

A

statistical description techniques and information visualization methods
(i.e. scatterplots, boxplots etc)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly