data management and preprocessing Flashcards

Question 1

Q

Challenges to IoT data management -

Answer

A

data volume → organizations need an optimized storage infrastructure for the growing inflows of Big Data.

Time sensitivity real-time vs batch processing → incoming data has to be (re)orginized at the facility in real time. Current alternatives to this approach - batch processing (brings challenges). (batch processing - process the data in batches, real-time processing - process the data in continuous flows).

Heterogeneity → no standard format, harvest and stream data with the help of different protocols and standards.

Data flow controls → keep track of data transformations is challenging and essential to achieve transparency and clean data flow.

Data quality, transform for usability → missing data, outliers at the storage facility, continues to be an issue today. a maximally transparent process should be implemented, and quality management must be automated.

Question 2

Q

What do we need to do to gain insights from data generated by IoT? -

Answer

A

Capture: capture and process data coming from sensors and other devices.

Interoperate: ensure interoperability of data coming from multiple sensors with multiple data formats and multiple protocols.

Analyze: analyze data in real-time to compare it with historical trends.

Act: ensure that appropriate responses are built in to operational application workflows and business processes.

Question 3

Q

Challenges in managing IoT data -

Answer

A

data security, data scalability, interoperability.

Question 4

Q

Methods for collecting data -

Answer

A

real-time data streaming, batch processing. Different factors influence the choice of data collection method. The nature of applications.

Question 5

Q

Data objects -

Answer

A

are often referred to as samples, examples, instances, data points ir objects. In a database, data objects are reprsentend as data tuples, where rows correspond to the objects, and columns correspond to attributes.

Question 6

Q

What is an attribute? -

Answer

A

An attribute is a data field representing a characteristic of a data object. Also called dimension (data warehousing), feature (ML), or variable (statistics). Example: Time, Acc. X, Acc. Y, Gyro. Z from an actigraphy device. Observed values of an attribute are called observations. A feature vector is a set of attributes describing an object.

Question 7

Q

Different types of attribute data -

Answer

A

Nominal (categorical): lack meaningful order, no quantitative relationship between categories.
Ordinal: attributed can be ranked - ordered. The distance between values lack any significance.
Numeric: Attributes can be rank - ordered
- Distances between values have meaning
- Possible to perform mathematical operations.
- Can be interval- scaled or ratio-scaled.

Question 8

Q

Interval-scaled and ratio-scaled attribute -

Answer

A

Interval-Scaled Attributes
• Measured on a scale with equal intervals (differences make sense).
• No true zero (zero does not mean “none”).
• Can be positive, negative, or zero.
• Examples: Celsius, Fahrenheit (0°C doesn’t mean no temperature).
Ratio-Scaled Attributes
• Have a true zero (0 means “none”).
• Allow ordering, differences, and ratio comparisons (e.g., 10 kg is twice as heavy as 5 kg).
• Examples: Kelvin temperature, years of experience, weight, height, word count.

Question 9

Q

Discrete vs continuous attributes -

Answer

A

Discrete Attributes (Fixed or countable values)
• Have a limited number of possible values (e.g., on/off, 1/2/3).
• Can be numbers or categories (e.g., device ID, product codes).
• Examples: Security modes, event counts, battery levels.
Continuous Attributes (Unlimited possible values)
• Can take any value within a range (e.g., 23.5°C, 50.1% humidity).
• Usually measured and represented as decimal numbers.
• Examples: Temperature, humidity, pressure.

Question 10

Q

Statistical description of data -

Answer

A

Why It’s Important:
• Helps understand data properties.
• Detects noise or outliers (unusual values).
Measuring Central Tendency (Where most values are centered)
• Mean → Average value.
• Median → Middle value when sorted.
• Mode → Most frequent value.
Measuring Dispersion (How spread out data is)
• Range → Difference between max and min.
• Quartiles → Divide data into four equal parts.
• Interquartile Range (IQR) → Spread of the middle 50% (Q3 - Q1).
• Variance → How far values are from the mean (average squared differences).
• Standard Deviation → Square root of variance (measures typical spread).

Question 11

Q

Why preprocess the data -

Answer

A

low-quality data will lead to low-quality mining results. Factors that compromises data quality: accuracy, completeness, consistency, timeliness, believability and interpretability.

Question 12

Q

Data processing techniques -

Answer

A

Data Cleaning (Fixing Errors & Missing Data)
• Ignore Missing Data: Easy, but you lose information.
• Enter Manually: Works for small datasets, but slow.
• Fill with a Global Constant: Simple, but may not be accurate.
• most probable value: most accurate but computationally expensive
Data Integration (Merging Data from Different Sources)
• Entity Identification Problem: Matching the same entity across datasets.
• Avoiding Redundancy: Use correlation analysis to remove duplicates.
Data Reduction (Making Data Smaller & Faster to Process)
• Why? Saves storage, speeds up processing, improves quality.
• Methods:
• Sampling: Select part of the data (random, stratified, systematic, cluster).
• Dimensionality Reduction: Shrink feature set while keeping key info.
• Methods: PCA, t-SNE, MDS, Autoencoders.
Data Transformation (Standardizing Data for Better Results)
• Normalization: Adjusts values to a common scale (important for machine learning).
• methods: Min-Max Normalization. Z-Score Normalization.

Question 13

Q

How can we remove nose in noisy data? -

Answer

A

Binning → smoothes data by consulting its neighbors. Different methods such as smoothing by bin means and smoothing by bin boundaries.

Regression → Smoothing by fitting the data into a function * Linear regression * Non-linear regression * i.e. polynomial regression.

Outlier analysis → clustering, Outliers may be interesting for analysis.

Question 14

Q

Feature selection in data processing -

Answer

A

Feature selection can help minimize the number of input features and enhance a model’s precision and effectiveness by selecting the most pertinent and predictive information from a dataset. FS techniques: * Filter-based techniques * Wrapper-based techniques * Embedded techniques.

Question 15

Q

Identify outliers?

Answer

A

statistical description techniques and information visualization methods
(i.e. scatterplots, boxplots etc)