data management and preprocessing Flashcards
Challenges to IoT data management -
data volume → organizations need an optimized storage infrastructure for the growing inflows of Big Data.
Time sensitivity real-time vs batch processing → incoming data has to be (re)orginized at the facility in real time. Current alternatives to this approach - batch processing (brings challenges). (batch processing - process the data in batches, real-time processing - process the data in continuous flows).
Heterogeneity → no standard format, harvest and stream data with the help of different protocols and standards.
Data flow controls → keep track of data transformations is challenging and essential to achieve transparency and clean data flow.
Data quality, transform for usability → missing data, outliers at the storage facility, continues to be an issue today. a maximally transparent process should be implemented, and quality management must be automated.
What do we need to do to gain insights from data generated by IoT? -
Capture: capture and process data coming from sensors and other devices.
Interoperate: ensure interoperability of data coming from multiple sensors with multiple data formats and multiple protocols.
Analyze: analyze data in real-time to compare it with historical trends.
Act: ensure that appropriate responses are built in to operational application workflows and business processes.
Challenges in managing IoT data -
data security, data scalability, interoperability.
Methods for collecting data -
real-time data streaming, batch processing. Different factors influence the choice of data collection method. The nature of applications.
Data objects -
are often referred to as samples, examples, instances, data points ir objects. In a database, data objects are reprsentend as data tuples, where rows correspond to the objects, and columns correspond to attributes.
What is an attribute? -
An attribute is a data field representing a characteristic of a data object. Also called dimension (data warehousing), feature (ML), or variable (statistics). Example: Time, Acc. X, Acc. Y, Gyro. Z from an actigraphy device. Observed values of an attribute are called observations. A feature vector is a set of attributes describing an object.
Different types of attribute data -
- Nominal (categorical): lack meaningful order, no quantitative relationship between categories.
- Ordinal: attributed can be ranked - ordered. The distance between values lack any significance.
- Numeric: Attributes can be rank - ordered
- Distances between values have meaning
- Possible to perform mathematical operations.
- Can be interval- scaled or ratio-scaled.
Interval-scaled and ratio-scaled attribute -
- Interval-Scaled Attributes
• Measured on a scale with equal intervals (differences make sense).
• No true zero (zero does not mean “none”).
• Can be positive, negative, or zero.
• Examples: Celsius, Fahrenheit (0°C doesn’t mean no temperature). - Ratio-Scaled Attributes
• Have a true zero (0 means “none”).
• Allow ordering, differences, and ratio comparisons (e.g., 10 kg is twice as heavy as 5 kg).
• Examples: Kelvin temperature, years of experience, weight, height, word count.
Discrete vs continuous attributes -
- Discrete Attributes (Fixed or countable values)
• Have a limited number of possible values (e.g., on/off, 1/2/3).
• Can be numbers or categories (e.g., device ID, product codes).
• Examples: Security modes, event counts, battery levels. - Continuous Attributes (Unlimited possible values)
• Can take any value within a range (e.g., 23.5°C, 50.1% humidity).
• Usually measured and represented as decimal numbers.
• Examples: Temperature, humidity, pressure.
Statistical description of data -
- Why It’s Important:
• Helps understand data properties.
• Detects noise or outliers (unusual values). - Measuring Central Tendency (Where most values are centered)
• Mean → Average value.
• Median → Middle value when sorted.
• Mode → Most frequent value. - Measuring Dispersion (How spread out data is)
• Range → Difference between max and min.
• Quartiles → Divide data into four equal parts.
• Interquartile Range (IQR) → Spread of the middle 50% (Q3 - Q1).
• Variance → How far values are from the mean (average squared differences).
• Standard Deviation → Square root of variance (measures typical spread).
Why preprocess the data -
low-quality data will lead to low-quality mining results. Factors that compromises data quality: accuracy, completeness, consistency, timeliness, believability and interpretability.
Data processing techniques -
- Data Cleaning (Fixing Errors & Missing Data)
• Ignore Missing Data: Easy, but you lose information.
• Enter Manually: Works for small datasets, but slow.
• Fill with a Global Constant: Simple, but may not be accurate.
• most probable value: most accurate but computationally expensive - Data Integration (Merging Data from Different Sources)
• Entity Identification Problem: Matching the same entity across datasets.
• Avoiding Redundancy: Use correlation analysis to remove duplicates. - Data Reduction (Making Data Smaller & Faster to Process)
• Why? Saves storage, speeds up processing, improves quality.
• Methods:
• Sampling: Select part of the data (random, stratified, systematic, cluster).
• Dimensionality Reduction: Shrink feature set while keeping key info.
• Methods: PCA, t-SNE, MDS, Autoencoders. - Data Transformation (Standardizing Data for Better Results)
• Normalization: Adjusts values to a common scale (important for machine learning).
• methods: Min-Max Normalization. Z-Score Normalization.
How can we remove nose in noisy data? -
Binning → smoothes data by consulting its neighbors. Different methods such as smoothing by bin means and smoothing by bin boundaries.
Regression → Smoothing by fitting the data into a function * Linear regression * Non-linear regression * i.e. polynomial regression.
Outlier analysis → clustering, Outliers may be interesting for analysis.
Feature selection in data processing -
Feature selection can help minimize the number of input features and enhance a model’s precision and effectiveness by selecting the most pertinent and predictive information from a dataset. FS techniques: * Filter-based techniques * Wrapper-based techniques * Embedded techniques.
Identify outliers?
statistical description techniques and information visualization methods
(i.e. scatterplots, boxplots etc)