flashcards (brainscape)

1
Q

Which of the following is NOT a likely advantage of a persuasive virtual assistant over human persuaders?
1. Can be simultaneously used by lots of users.
2. Ability to continuously monitor the user 24/7 without getting fatigued.
3. Superior understanding of psychological nuances and social contexts.
4. Ability to process large amounts of health data for personalized analysis.

A

3

Context data must be coupled with the ability to interpret it. While virtual assistants can handle data processing and analysis of the context data, they still lack human-like common sense to fully comprehend psychological nuances, social norms and contexts that human persuaders would naturally possess. This is exactly the reason why we need “human-in-the-loop”. Therefore, 3 is the correct answer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which of the following are given as examples of a context-aware app where the app is designed to automatically do a command when certain contexts are met? (Select all that apply)
1. Active Badge system triggering reminders based on location
2. Geonotes for leaving location-based annotations
3. Siren app generating alerts for firefighters based on contexts
4. Cyberguide mobile tour guide providing contextual information
5. Automatic brightness adjustment based on ambient light levels

A

1 & 3

Based on the question explanation and the context-aware app design dimension, this app is categorized as “Context-triggered actions” where the design choices are “Automatic” and “Command”. Therefore, we should look at samples that correspond with those two design choices. 1. and 3. are the correct answers - The Active Badge and Siren app system using rules to trigger actions based on contextual conditions, so they belong “Automatic” and “Command” design dimension. The other choices are incorrect because they adopt a different choices in either design dimension.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

According to Freeman Dyson, which of the following will contribute most to the new trends in science?
a) The development of new concepts
b) The introduction of new tools
c) Collaboration between different scientific disciplines
d) Government and institutional funding policies

A

b

According to Freeman Dyson, the aspect that most influences new directions in science is the introduction of new tools. Dyson, a renowned theoretical physicist and mathematician, emphasized the significant role that the development of new experimental and computational tools plays in driving scientific progress. While other factors like the development of new concepts, interdisciplinary collaboration, and funding policies are important, Dyson particularly highlighted how new tools can lead to groundbreaking discoveries and open up entirely new fields of study.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the correct functional triad for the ‘Baby Think It Over’ persuasive technology, designed to help teenage girls understand the challenges of caring for babies?
a) Tool: User behavior tracking; Medium: Baby simulator robot; Social Actor: Teenage girls
b) Tool: Educational messages; Medium: Baby simulator robot; Social Actor: Teenage girls
c) Tool: User behavior tracking; Medium: Interactive real-world simulation; Social Actor: Baby simulator robot
d) Tool: User behavior tracking; Medium: baby simulator robot; Social Actor: School teachers

A

c

The Tool in this case is user behavior tracking, implying that the technology keeps track of how the user interacts with and cares for the simulator. The Medium being an interactive real-world simulation means that the baby simulator provides a realistic experience of caring for a baby, allowing users to understand and respond to various scenarios in real time. The Social Actor as the robot (or the baby simulator) signifies that it is the entity through which the interaction occurs and upon which the user’s actions are focused.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Which of the following best describes the key difference between participatory and opportunistic mobile sensing paradigms?
a) Participatory sensing requires users to actively collect the sensor data, whereas opportunistic sensing relies on automatic sensor data collection without user intervention.
b) Participatory sensing only utilizes built-in smartphone sensors, while opportunistic sensing can incorporate external sensor devices.
c) Participatory sensing collects data automatically while opportunistic sensing relies on user involvement for high-quality data collection.
d) Participatory sensing focuses on collecting data for individual use, whereas opportunistic sensing is designed for large-scale data collection across multiple users or devices.

A

a

a) Correct
b) The distinction between participatory and opportunistic sensing is based on user involvement in data collection, not on the use of internal versus external sensors.
c) It is the opposite. Opportunistic sensing collects data automatically, and it is participatory sensing that relies on user involvement for high-quality data collection.
d) Both sensing paradigms can be scaled to individual or community levels. The defining factor of the two paradigms are their method of data collection, not the scale of deployment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

20235599
In the context of context-aware computing, which of the following best illustrates the use of context to enhance the effectiveness of persuasive technology?
a) Persuasive technology primarily uses the user’s current activity and environmental conditions to trigger contextually relevant behavioral suggestions.
b) Persuasive technology primarily relies on static user profiles and predetermined schedules for interventions.
c) Persuasive technology needs manual confirmation of the context from the users to be effective.
d) Persuasive technology applications avoid using sensor data from mobile devices to infer context, relying instead on explicit user settings.

A

a

a) Correct
b) Persuasive technology uses both static user profiles and dynamic contextual information to customize interventions, not just predetermined schedules.
c) While manual inputs can enhance context-aware computing, it primarily utilizes automatic sensing and data analysis to determine context in persuasive technology.
d) Persuasive technology often utilizes sensor data from mobile devices to infer context, enhancing interventions beyond what is possible with just explicit user settings.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Which of the following explanations is consistent with the definition of proper nouns?
a) Contexts: Contexts in one scenario can have some overlapped area, like persuasive technology being the overlapped area of computers and persuasion.
b) Proximate selection: This technology is aimed at making the located objects “emphasized” or “being easier to choose.”
c) Context-triggered actions: This technology is faced with one challenge in the accuracy of language for rules.
d) Persuasive technology: Nintendo’s Pocket Pikachu’s medium is raising a virtual pet.

A

b

In a given scenario, different contexts cannot have overlapping areas (for instance, in the context of indoor individuals, there is a distinction between sleeping and not sleeping, which are the only two mutually exclusive contexts in the awake state). Therefore, option (a) is incorrect. Option (b) reflects the verbatim content from the slides and is accurate. Option (c) refers to the expressiveness of language for rules rather than accuracy. Regarding option (d), “raising a virtual pet” pertains to the social actor, not the medium, making option (d) incorrect. So the correct answer is b.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Which of the following is the correct order of sensor data processing pipeline:
a) Data collection –> Model Building –> Segmentation –>Evaluation –> Feature Extration
b) Data collection –> Model Building –> Segmentation –> Feature Extration –> Evaluation
c) Model Building –> Segmentation –> Feature Extration –> Data Collection –> Evaluation
d) Data collection –> Segmentation –> Feature Extration –> Model Building –> Evaluation

A

The correct order in a sensor data processing pipeline is (d) which typically involves collecting data first, followed by segmentation, feature extraction, model building, and finally, evaluation. This sequence ensures that raw data is processed and refined before building a model and evaluating its performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

20246104:
Which of the following is incorrect about advantages of persuasive technology over human persuaders? Persuasive technology a) Is less persistent than human beings b) Offers greater anonymity c) Scales more easily d) Goes where humans cannot go or may not be welcomed.

A

20246104:
The incorrect option is (a), computer technology is more persistent than human beings as the IT systems can operate 24/7 without any changes to its behaviours. Human beings can be influenced by other factors however, machines and are pre-programmed and therefore remain persistent through its life cycle unless manually altered.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Which of the following statements is incorrect about cyber-physical systems? (select one)
1. Digital transformation is used for managing interconnected systems between their physical assets and computational capabilities.
2. A rich variety of inputs and outputs, such as gesture input, voice commands, and wearable devices, are utilized.
3. The sensors primarily operate indoors to collect real-time data.
4. Data is gathered from cross-domain sensors and IoT devices, enabling data-driven intelligence.
5. Devices are available in diverse form factors, including smartphones, smart bulbs, and smart switches.

A

3

Transformative technologies are employed in cyber-physical systems to manage interconnected systems, integrating their physical assets with computational capabilities. These systems boast a diverse array of devices, leading to a rich variety of inputs and outputs. The vast amount of data collected from these inputs and outputs facilitates data-driven intelligence within cyber-physical systems. Notably, sensors in such systems can be installed both indoors and outdoors, depending on application requirements. Hence, option 3 is the correct answer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Which of the following best describes the functional triads of persuasive technology? (select two)
1. The tool makes the target behavior more difficult to perform.
2. The medium provides users with unrealistic experiences that deter motivation.
3. The social factor performs calculations or measurements that motivate.
4. The medium assists users in exploring cause-and-effect relationships.
5. A social factor rewards users with positive feedback and models target behaviors.

A

4 & 5

The tool facilitates the target behavior, making it easier to perform, while the medium provides users with vicarious experiences that serve as motivation. Additionally, the tool performs calculations or measurements that motivate, and the medium enables users to explore cause-and-effect relationships. Moreover, a social factor rewards people with positive feedback and exemplifies target behaviors or attitudes. Therefore, options 4 and 5 are the correct answers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

20246104
Which statement is incorrect about the sensing paradigm? (Choose one)
a. Participatory sensing is entirely free from privacy issues.
b. Monitoring urban noise pollution by users measuring and sharing ambient noise (using their phone) is an example of participatory sensing.
c. Sensors installed in public transportation to automatically monitor passenger counts is a example of opportunistic sensing.
d. Opportunistic sensing minimize user data collection efforts.

20235599
Which statement is incorrect about the Sensing paradigm? (Choose one)
a. “Participatory” sensing is entirely free from privacy issues.
b. Monitoring urban noise pollution by users measuring and sharing ambient noise (using their phone) is an example of “Participatory” sensing.
c. Sensors installed in public transportation to automatically monitor passenger counts is a example of “Opportunistic” sensing.
d. “Opportunistic” sensing minimize user data collection efforts.

A

a

“Participatory “ sensing involves individuals actively engaging in data collection. However, even though users voluntarily collect data in participatory sensing, privacy issues can still arise regarding the protection and usage of the collected data. So, (a) is incorrect.
“Opportunistic” sensing means automated sensor data collection, which reduces the burden placed on the user. So (c) and (d) are correct.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Question:
In the context of context-aware computing, which of the following scenarios follow the definition of sensor fusion?
a) Utilizing a sophisticated algorithm to interpret data from a single high-precision accelerometer to determine the user’s physical activity.
b) Analyzing high-resolution video feeds from a single camera to deduce the user’s specific actions and environment.
c) Integrating input from a user’s keyboard strokes with application usage data to predict the user’s next task.
d) Synthesizing data from an array of sensors, including an accelerometer, GPS, and a light sensor, alongside ambient sound recordings to construct a detailed understanding of the user’s current context and environment.

A

d

Sensor fusion is defined as a fusion of multiple sensors to infer a user’s context. Here’s the breakdown of each option:
Option A: This option involves using data from only one sensor, an accelerometer, to predict the user’s activity. Since sensor fusion requires the integration of multiple sensor outputs, relying solely on an accelerometer does not qualify as sensor fusion.
Option B: Similar to Option A, this choice uses data from just one source, a camera, to interpret the user’s context. The absence of integration with other sensor data means this approach does not embody sensor fusion.
Option C: This option focuses on user input and collected data from the user’s interactions, which does not involve the integration of various sensor types. Sensor fusion aims to combine different sensory inputs to create a comprehensive context picture, which is not achieved by analyzing user input alone.
Option D: This is the correct choice for illustrating sensor fusion. It involves combining data from multiple sensors such as accelerometers, GPS, light sensors and ambient sound to form a detailed and nuanced understanding of the user’s environment and activities. This multi-sensor integration follows the definition of sensor fusion, leveraging the strengths of each sensor type to enhance context awareness.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Question: Which of the following best describes the advantage of persuasive computing over human persuaders? (select two)
A) Persuasive computing technologies cannot scale easily to reach a large audience.
B) They offer greater anonymity and can manage huge volumes of data.
C) They are less persistent than human beings in achieving behavioral changes.
D) They can use various modalities to influence, such as data, graphics, and simulations.

A

B, D

Persuasive computing has several advantages over human persuaders, including the ability to be more persistent, offer greater anonymity, manage vast amounts of data, and use multiple modalities to influence behavior. These technologies can scale easily and operate in environments where humans may not be welcome or cannot reach. The correct answers are B and D, highlighting the capabilities of persuasive computing to handle data and utilize various communication modalities to influence user behavior effectively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Question: Which of the following is a primary type of social cue used by persuasive technology acting as social actors? (select one)
A) Offering discounts and rewards unrelated to user behavior.
B) Providing positive feedback and modeling target behavior or attitude.
C) Relying solely on text-based communication without feedback.
D) Avoiding any interaction that simulates human-like exchanges.

A

B

Persuasive technology can act as a tool, medium, or social actor. When acting as a social actor, it can be persuasive by rewarding users with positive feedback, modeling a target behavior or attitude, and providing social support. This approach leverages social cues, such as language use, social dynamics, and roles, to influence behavior. Therefore, B is the correct answer as it accurately reflects how persuasive technology uses social interactions to encourage changes in behavior or attitude.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

A) Citizens voluntarily taking and uploading photos to assess the amount of trash in city parks
B) Software that automatically collects data to analyze users’ web browsing patterns
C) A mobile app where consumers scan the barcodes on food packaging to share nutritional information
D) A system that collects data from sensors installed in cars to monitor traffic conditions in a smart city
E) A feature on smartwatches that automatically collects data on an individual’s daily activity and sleep patterns
————————————————————————————–
1)A, B
2)A, C
3)A, B, D
4)C, E
5)C, D, E

A

2

In this question, A) and C) belong to the participatory sensing paradigm, involving activities where users voluntarily collect and share data. Both cases require active participation from the users.

On the other hand, B), D), and E) are examples of the opportunistic sensing paradigm. These represent methods that automatically collect data through sensors or software, rather than requiring direct user intervention.

Therefore, the correct pairing of options that belong to the same sensing paradigm is 2) A, C.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Question: The following are descriptions of elements from the Functional triads of persuasive technology.
Which is the correct order of elements to fill in the blanks?
Providing social support—1)_____
Helping people rehearse a behavior (simulating environment or objects)—2)_____
Making target behavior easier to do—3)_____

A) 1)Social actor, 2)Medium, 3)Tool
B) 1)Medium, 2)Social actor, 3)Tool
C) 1)Tool, 2)Social actor, 3)Medium
D) 1)Social actor, 2)Tool, 3)Medium

A

A

The correct answer is A) 1)Social actor, 2)Medium, 3)Tool. In persuasive technology, a Social actor provides social support, offering encouragement or empathy. A Medium lets people practice behaviors in a simulated setting, preparing them for real-life scenarios. A Tool simplifies the desired behavior, making it more accessible and easier to adopt. Each plays a unique role in influencing and guiding user behavior towards a targeted outcome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Which of the following examples best represents the opportunistic sensing paradigm?

A) Residents using a mobile app to record and report noise levels in their neighborhoods.
B) Citizens collecting water samples and using a testing kit to assess water quality in local water bodies.
C) Automatically collecting GPS location traces from users’ smartphones for traffic analysis.
D) Users taking photos of overflowing garbage cans to actively report and manage waste disposal.

A

C

20235599
The opportunistic sensing paradigm involves automated sensor data collection without requiring active participation from users. In the given options, automatically collecting GPS location traces from users’ smartphones aligns with this definition. This method utilizes the built-in GPS capabilities of smartphones to passively gather location data as users move about, without having them to actively engage with a specific app or device. This data can then be used for various purposes such as traffic analysis, location-based services, or urban planning, making it a prime example of opportunistic sensing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

In the activity recognition process, which step involves identifying portions of data that are likely to contain information about activities?

A) Data acquisition and pre-processing
B) Data segmentation
C) Feature extraction
D) Model building and classification

A

B

Data segmentation is the step in the activity recognition process where portions of data likely to contain information about activities are identified. During this step, techniques such as sliding window and energy-based methods are employed to isolate relevant segments of sensor data. These segments are then used for further analysis in subsequent steps, such as feature extraction, to extract meaningful information for activity recognition. Therefore, data segmentation plays a crucial role in identifying and preparing the data for subsequent processing in the activity recognition pipeline.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

In the context of acquiring context information for context-aware computing, which of the following is NOT listed as a method or tool for acquiring context?
A) Smart environment infrastructure, such as active badge systems for location information.
B) Mobile sensors embedded in devices for sensing motion, light, and other environmental factors.
C) Sensor fusion, combining data from multiple sensors to infer a user’s context.
D) Utilizing social media activity to directly infer a user’s current physical environment.

A

D

While social media can give away information about a user’s location, this information is not reliable as it is irregular data (if it even exists), the accuracy depends heavily on the user input and social media usage in general. Also, the up-to-dateness of the information can heavily vary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Which of the following best describes the sensor data processing pipeline in IoT data science processes?
A) Collect -> Analyze -> Implement -> Monitor
B) Collect -> Segment -> Extract -> Classify
C) Identify -> Process -> Store -> Analyze
D) Sense -> Process -> Actuate -> Feedback

A

B

20246104
First the data needs to be collected. Then the data is segmented into windows. Features are extracted from each window. Finally, a classification algorithm is used to determine the activity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Which of the following statements is NOT true about Mobile Sensing Architecture? (Select one)

  1. Mobile Sensing Architecture involves inform, share, and persuasion stages.
  2. The most labor-intensive work in sensor data science is the integration of sensor data.
  3. Data visualization is one of the representative methods for Share stage
  4. Supervised learning in mobile sensing requires the data to be hand-labeled.
  5. Persuasive technology systems aim to change user behavior by providing tailored feedback.
A

2

Mobile Sensing Architecture involves the sense, learn, inform, share, and persuasion stages. The most labor-intensive work in sensor data science is sensor data and label collection in the sense stage. The representative methods for the Share stage include data visualization, community awareness, social network use, etc. Supervised learning in mobile sensing requires the data to be hand-labeled, and unsupervised learning does not. Persuasive technology’s goal is to change users’ attitudes and behavior with tailored feedback. Therefore, the incorrect answer is number 2.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Which of the following pairs correctly match a context-aware application category with its characteristics? (Select two)

  1. Proximate selection: Automatic, Information.
  2. Contextual information: Automatic, Information.
  3. Automatic contextual reconfiguration: Automatic, Information.
  4. Context-triggered actions: Manual, Command.
  5. Contextual commands: Manual, Command.
A

3 & 5

Proximate selection is a user interface technique where the located-objects that are nearby are emphasized or otherwise made easier to choose. In general, proximate selection involves entering a “locus” and “selection.” According to the context, Contextual information displays information and Contextual Commands perform the command. they need to get information about the context manually. Automatic Contextual Reconfiguration detects the user’s context automatically and adjusts information accordingly. Context-triggered actions automatically execute a command when certain context conditions are met. Therefore, the correct answers are number 3 and 5.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

20246104
Which of the following steps is NOT typically part of the sensor data processing pipeline in mobile sensing with smartphones?
A) Data Collection
B) Segmentation
C) Model Deployment
D) Feature Extraction

20235599
Which of the following steps is NOT part of the sensor data processing pipeline in mobile sensing with smartphones?

A) Data Collection
B) Segmentation
C) Receiving User Feedback
D) Feature Extraction

A

C

The sensor data processing pipeline in mobile sensing with smartphones consists of the following steps:

Data Collection: Gathering sensor data from various sources such as built-in sensors (e.g., accelerometer, GPS) on smartphones.
Segmentation: Organizing the collected data into meaningful segments or chunks for further analysis.
Feature Extraction: Extracting relevant features or characteristics from the segmented data to represent the underlying patterns or trends.
Model Building: Developing machine learning or statistical models using the extracted features to learn from the data.
Evaluation: Assessing the performance and effectiveness of the built models using validation techniques.
However, “Model Deployment” is not typically considered as part of the data processing pipeline. Model deployment involves implementing the trained model into a production environment where it can be used to make predictions or decisions based on new data. While it is an important step in the broader process of deploying a system or real-world use, it is not directly involved in the processing of sensor data itself.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the main trade-off when designing energy-efficient algorithms for continuous smartphone sensing?

a) Accuracy vs. speed
b) Accuracy vs. memory usage
c) Accuracy vs. energy consumption
d) Accuracy vs. internet connectivity

A

c

Continuous smartphone sensing involves constantly collecting data from various sensors like accelerometers, gyroscopes, and microphones. This continuous operation drains the phone’s battery significantly.
Energy-efficient algorithms are designed to minimize the energy used by these sensors while still collecting usable data.
Accuracy refers to how well the sensor data reflects the real world.
The key trade-off lies in balancing these two aspects. Here’s how:

More frequent data collection (higher sampling rate) increases accuracy but consumes more energy.
Less frequent data collection (lower sampling rate) conserves energy but might miss important details, reducing accuracy.
Therefore, the goal is to design algorithms that can achieve an acceptable level of accuracy while minimizing the energy consumption of sensors during continuous data collection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

The “second-hand smoke problem” in mobile sensing refers to:

a) Sensor data corruption due to physical damage
b) Privacy concerns of users exposed to other people’s sensors
c) Limited battery life of smartphones
d) Incompatibility between different sensor models

A

b

The answer is b) Privacy concerns of users exposed to other people’s sensors.

Here’s why:

The term “second-hand smoke problem” draws an analogy to involuntary exposure. Just like inhaling smoke from someone else’s cigarette, a mobile sensing system might collect data about people nearby without their consent.
This scenario raises privacy concerns because sensor data can potentially reveal personal information about these bystanders.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the best description for purpose of data segmentation?
a) It is for exploratory data anaysis to enhance better understanding of data
b) Data segmentation is preprocessing data
c) It is to identify those data segments that are likey to contain information about activities for feature extraction
d) Data segmentation is labelling data for classification

A

c

Data segmentation is for identifying and preparing the data for feature extraction in the activity recognition pipeline. In this process, we can use technics such as sliding window, or engergy based approach. Sliding window is using a window(=frame) of sample, simply slide that window with fixed overlapping and energy based approach is figure out different activities have different activity “intensities” of (or energy).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Context is somewhat vaguely defined terminology in most cases. However according to Schmidt, Beigle, Gellersens’s model(2021) , context can be defined explicity with four definitions. What is the wrong definition about a context?
a) A context describes a situation and the environment a device/user is in
b) A context is defined by a unique name
c) For each context a set of features is relevant
d) Context is entirely determined by the user’s preferences and has no relation to the device or environment they are in.
e) For each relevant feature a range of values is determined(implicitly or explictly) by the context

A

d

Schmidt, Beigle, Gellersens has defined a context as followings :
- A context describes a situation and the environment a device/user is in
- A context is defined by a unique name
- For each context a set of features is relevant
- Context is entirely determined by the user’s preferences and has no relation to the device or environment they are in.
- For each relevant feature a range of values is determined(implicitly or explictly) by the context

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Which of the following is NOT a challenge associated with the continuous sensing capability of smartphones for mobile sensing applications?

A) High computation demand
B) High battery consumption
C) Limited sensor programmability due to operating system and sensor variations
D) Effective data anonymization

A

D

Continuous smartphone sensing, especially in the context of mobile applications, faces several technical challenges. While high computation demand and battery consumption are direct consequences of such sensing, and sensor programmability issues arise due to hardware and software diversity, user privacy through data anonymization represents a broader, systemic challenge across mobile sensing domains. It’s not inherent to the continuous sensing feature but is crucial for ethical design and deployment. The focus here is on understanding the specific operational hurdles of continuous data collection and processing on smartphones, distinguishing them from overarching privacy considerations which, while vital, are managed through different mechanisms in the context of IoT data science.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Which of the following best exemplifies a context-aware application that utilizes persuasive technology to change a user’s behavior?

A) A GPS application that simply navigates the user from point A to point B
B) A digital calendar that shows the event list of each day
C) A fitness app that tracks the user’s physical activity and encourages more movement based on the user’s location and past behavior
D) A weather app that provides the current weather conditions

A

C

Persuasive technology aims to change a person’s attitudes or behaviors through the use of interactive technology, while context-aware computing tailors software behavior based on the user’s current context, such as location or activity. A fitness app that tracks physical activity and encourages movement integrates both concepts by using the user’s location and past behavior (context) to motivate increased physical activity (persuasion). Unlike the other options, which may use context-awareness (A, B, D), option C specifically leverages context-aware computing to persuasively encourage a change in user behavior, aligning with the objectives of persuasive technology.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Which aspect of mobile sensing architecture deals with the challenge of achieving fine-grained control over sensors while ensuring compatibility across different operating systems and sensor models?

A) Data integration
B) User interface design
C) Programmability
D) Energy management

A

C

Programmability focuses on managing smartphone sensors via system APIs, where controlling sensors precisely and ensuring portability across diverse operating systems and sensor models are major challenges.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Consider a bike navigator app that uses sensors to monitor the rider’s speed and adherence to regular roads. It ranks riders based on their safety scores on the same routes and rewards points accordingly. Analyze this bike navigator app from the perspective of the functional triads of persuasive technology. Which roles does it fulfill in encouraging safer riding practices? (Select all that apply)

A) A tool by making target behaviors easier to do through navigation aids and safety monitoring
B) A social actor by rewarding riders with positive feedback and creating a competitive ranking system based on safety scores
C) A medium by providing riders with real riding experiences
D) A database by storing records of all types of bikes

A

A, B

A) The app acts as a tool by providing navigation aids and monitoring safety-related behaviors (e.g., speed and route adherence), making it easier for riders to engage in safer riding practices.
B) By creating a ranking system based on safety scores and rewarding points for safe riding, the app serves as a social actor, encouraging positive behavior through competition and social reinforcement.
C) While the app assists in route planning and promotes safety, it does not directly provide simulated experiences or scenarios; its primary function is real-time navigation and safety feedback, not simulating different riding experiences.
D) Even though the app might store data on routes and statistics, its persuasive role is not as a database but rather in its interactive features that encourage safer riding practices.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is a characteristic of the Participatory sensing paradigm?

A) Automated sensor data collection
B) Passive involvement of users
C) Active sensor data collection by users
D) Low burden placed on the user

A

C

Participatory sensing involves active involvement of users in collecting sensor data, as seen in the example of managing garbage cans by taking photos. Users actively participate in data collection, contributing to the complexity of operations but also influencing the quality of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Which of the following is NOT a current solution for addressing privacy issues in mobile sensing systems?

A) Cryptography
B) Privacy-preserving data mining
C) Publicly sharing collected data
D) Processing data locally versus cloud services

A

C

Current solutions for addressing privacy issues in mobile sensing systems include cryptography, privacy-preserving data mining, and processing data locally versus using cloud services. These solutions aim to protect user privacy by encrypting sensitive information, anonymizing data for analysis, and minimizing the transmission of personal data over the network. However, publicly sharing collected data would contradict the fundamental responsibility of respecting user privacy, as it could lead to unauthorized access or misuse of sensitive information by third parties.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Suppose a mobile sensing application utilizes accelerometer data to detect whether a user is walking or running. The application works well in most cases but struggles to differentiate between walking and running when the user is doing a fast-paced walk or a slow jog. What can we improve during the data gathering and pre-training phase that could enhance the application’s performance in these edge cases? (Select two that apply)

A) Increase the sampling rate of the accelerometer
B) Add a feature that allows users to manually input their activity
C) Increase processing power
D) Use data from the gyroscope in addition to the accelerometer

A

A, D

The correct options are:
A) Increase the sampling rate of the accelerometer. By increasing the sampling rate, more detailed data about the user’s movements could be captured and the behavior inference could have a better accuracy.
D) Use data from the gyroscope in addition to the accelerometer. Gyroscope data will help the inference process by providing additional context, such as the user’s body posture (form the device’s orientation).

Additional notes:
Option B could also be considered, but it is a less preferable option compared to options A and D. Relying on the users to manually label their type of activity every time they go for a walk or run is not practical and there is a good chance they will forget to label it.
Option C focuses on reducing the training time, rather than directly improving accuracy in inferring the user’s type of activity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Suppose your smartphone is part of a city-wide project to measure noise pollution. Which actions align most closely with the opportunistic sensing approach? (Select two that apply)

A) You decide when to start the noise measurement app.
B) The app automatically starts measuring noise levels when you enter a park.
C) The app periodically prompts you to input the noise levels at various intervals.
D) The app also uses the GPS sensor in your phone while measuring the noise levels as you move around the city.

A

B, D

The correct options are options B and D, as both options shows a form of automation in the data collection process, which aligns with the opportunistic sensing paradigm.

Meanwhile, options A and C rely more on a manual action or input by the user, which aligns more closely to the participatory sensing paradigm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Which of the following terms best explains ‘this’? ‘This’ is used for combining information to gain a more comprehensive and accurate understanding of the environment or the user’s situation.

A) Proximate Selection
B) Intervention
C) Sensor Fusion
D) Human-in-the-loop

A

C

The correct answer is C(Sensor Fusion).
A. Proximate Selection involves considering importance to provide convenience and better understanding to users and others.
B. Intervention refers to the act of becoming involved in a situation to alter, change, or influence its course or outcome.
C. Sensor fusion means fusing multiple sensors to infer the user’s context.
D. Human-in-the-loop refers to a mode of operation in systems or processes where human involvement is integrated into the workflow.
Therefore, ‘C. Sensor fusion’ best explains this problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is an example of context triggered action?

a. Light triggered display
b. Orientation sensitive display
c. Active badge
d. Geonotes

A

C

Here C, Active badge, is the context triggered action and the rest are not.

a and b are Automatic Context Triggered and d is not context triggered.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Which of the following is a part of “Sense” in Mobile Sensing Architechture?

a. Phone Context
b. Semi supervised learning
c. Profile user preferences
d. Statistical analysis

A

A

The right answer is a. b and d are part of “Learn” and c the part of Inform, Share, Participation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

The following is an explanation of sensing paradigms. Which of the following is incorrect? (select one)

A) Taking photographs of locations or discussing events is an example of participatory sensing.
B) In opportunistic sensing, users may feel less burdened.
C) Opportunistic sensing allows for automatic data collection from the surrounding environment.
D) In participatory sensing, the quality of data is unrelated to the participants.

A

D

The correct answer is D.
To solve this problem, knowledge of participatory sensing and opportunistic sensing is required.
With participatory sensing, users consciously opt to meet an application request out of their own will. Therefore, sensor data is actively collected by the user, and the quality of the data is dependent on the participants.
On the other hand, with opportunisitc sensing, sensor data is automatically collected through methods such as interconnection between devices and lowers burden placed on the user.
Therefore, D is incorrect in saying that the quality of data in participatry sensing is not related to the participants.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

In the field of computer science, which of the following definitions inaccurately describes the term ‘context’?

  1. A context describes both the situation and the environment in which a device/user is situated.
  2. A context does not possess a unique name.
  3. Each context has a set of relevant features.
  4. The context implicitly or explicitly determines a range of values for each relevant feature
A

2

The correct answer is option 2. In the field of computer science, unique identifiers or names are frequently assigned to distinguish between different contexts within various applications. Therefore, option 2 inaccurately defines the term ‘context’ by suggesting it does not possess a unique name.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Which type of user interface technique is associated with ‘proximate selection,’ making located objects ‘emphasized’ or ‘easier to choose’?

  1. Siren
  2. Light sensitive display
  3. Orientation-sensitive UI
  4. Nearby printer selection
A

4

The correct answer to the question is option 4. ‘Proximate selection’ is a user interface technique that helps users choose objects or options physically close to their current location or context. In this specific scenario, selecting a nearby printer for printing tasks exemplifies ‘proximate selection’.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Which of the following strategy games can be classified as persuasive technology? Select all that apply.

a. A strategy game in which players play against NPCs that get progressively smarter.
b. A strategy game in which players play against random players.
c. A strategy game in which players play against players who are of similar rank.
d. A strategy game in which two players are randomly matched as teammates against NPCs that get progressively smarter.

A

a & c

Persuasive technology is used to purposefully induce a change in behaviour or attitude in the user. Strategy games that have enemies that get increasingly harder to beat force the player to think more strategically, meaning that option a can be considered a persuasive technology. Using the same reasoning, option c is also a persuasive technology and option b is not a persuasive technology. For option d, since you are matched with a randomly skilled teammate, you will not be forced to think more strategically since you may get extremely high skilled players, hence it is not a persuasive technology.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Which of the following data collection examples subscribe to the participatory sensing paradigm? Select all that apply.

a. Whenever a user performs some physical activity, they have to log it in an app.
b. Whenever a user performs some physical activity, their phone senses it and logs it into an app.
c. Whenever a user logs a physical activity they performed, the logging app will automatically record the time they logged it at as well as the current temperature.
d. Whenever a user throws their phone in the air, the phone records the temperature, humidity, and atmospheric pressure.

A

a & d
(c can be included)

Participatory sensors require users to manually collect/enter data. Using this definition, option a can be trivially recognized as a participatory collection scheme, and option b can be trivially recognized as an opportunistic collection scheme. In option c, even though the user manually logs something, the collected data is the time at which they logged it as well as the temperature at that time, which is automatically collected and hence opportunistic. In option d, the user must throw their phone in order to initiate the data collection every single time, hence this is a participatory scheme.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

In which scenario is the sliding window not necessary?

A) Real-time traffic monitoring
B) Stock market data analysis
C) Facial Recognition
D) IoT sensor data analysis

A

B

Static image analysis does not require dividing the data into smaller windows for processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

A machine’s bearings are starting to wear out. Which of the following monitoring techniques would be most likely to detect this issue early on?

(A) Oil analysis for viscosity changes
(B) Thermal camera inspection for abnormal heat sources
(C) Vibration analysis for changes in amplitude or frequency patterns
(D) Ultrasonic detection for corrosion

A

C

Bearing wear often leads to increased vibration. Vibration analysis can detect these changes early on, allowing for preventive maintenance before a major breakdown occurs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

According to Klaus Schwab, the Fourth Industrial Revolution is characterized by which of the following?

a. Mechanical production systems
b. Electrical mass production systems
c. Cyber-physical systems
d. Electronics, IT, automated production

A

C

Cyber-physical systems - This is the defining characteristic of the Fourth Industrial Revolution, which is the focus of the question. Cyber-physical systems integrate computing, networking, and physical processes. With the advent of the internet of things (IoT), artificial intelligence (AI), and machine learning, these systems enable new ways of creating value and are a step beyond the previous revolution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Question: Which of the following statements about embedded systems and machine learning are false? (Multiple Answers)
a. Embedded systems like Arduino Sense have high resources compared to modern computers.
b. Writing machine learning architecture code is a fraction of the process, with data collection, preprocessing, and feature engineering taking more time.
c. The development of machine learning systems is a non-linear process involving multiple iterations from data collection to deployment.
d. Android’s sensing rate configuration ensures sampling rates by reducing resources allocated to computationally intensive tasks.

A

A & D

a. This statement is false because Arduino Sense, with its 512KB of RAM, cannot support MobileNetv1, which requires 16.9MB. The statement implies that lightweight models such as MobileNetv1 are suitable for deployment on such embedded systems, which is incorrect given the hardware constraints.
d. This statement is false as well. According to the lecture notes, Android’s sensing rate configuration does not guarantee the specified rates; instead, the actual sensing rate is device-dependent and varies based on operating conditions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What role do opportunistic sensing paradigms play in data collection for smart city initiatives?
a) They enable automatic collection of sensor data without user intervention.
b) They rely on active participation of citizens to report environmental data.
c) They primarily utilize external sensor devices for data collection.
d) They focus on individual data collection rather than large-scale analytics.

A

A

b) this statement describes participatory sensing paradigms where citizens actively contribute data through their involvement.

c)This statement suggests a specific method of data collection, focusing on external sensors rather than automatic collection from various sources.

d) This statement misrepresents the purpose of opportunistic sensing, which aims to gather data from multiple sources for comprehensive analytics, rather than individualized data collection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Which of the following statements accurately describes the role of sensors in mobile phones?
A) Sensors in mobile phones primarily focus on enhancing the processing power of the device.
B) The accelerometer in mobile phones is used solely for capturing photos in the correct orientation.
C) Proximity sensors in mobile phones can be used to turn off the screen during phone calls.
D) GPS sensors in mobile phones are mainly utilized for adjusting the brightness of the screen.
E) The gyroscope in mobile phones is used to detect when the user holds the phone to their face during calls.

A

C

A) False: Processing power is not the main function.

B) False: Accelerometers in mobile phones have multiple uses, not limited to orienting photos.

D) False: GPS sensors in mobile phones are primarily utilized for location-based services, not for adjusting screen brightness.

E) False: Gyroscopes in mobile phones primarily detect device orientation for tasks like gaming and augmented reality, rather than detecting when the phone is held to the user’s face during calls.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

What is the preferred method for imputing missing values in a time-series dataset where the order of data points is significant, and why?

A) Mean imputation, because it is the simplest method.
B) Mode imputation, because it uses the most frequent value.
C) Winsorizing, because it limits extreme values.
D) Interpolation, because it provides more natural values by considering the temporal order of the data.

A

In time-series datasets, where the temporal order and continuity of the data points are important, interpolation is a preferred method for imputing missing values. Unlike mean or mode imputation, which might not account for the time-dependent nature of the data, interpolation uses values from neighboring data points to estimate the missing values. This method ensures that the imputed values follow the dataset’s natural flow and variability over time, leading to more accurate and realistic data restoration. Winsorizing is more about limiting extreme values rather than imputing missing ones and might not be suitable for filling gaps in time-series data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

In the context of time series analysis, what is the primary purpose of using a sliding window technique for data segmentation?

A) To increase the computational complexity of the model for better accuracy.
B) To transform qualitative data into quantitative data.
C) To apply a fixed-size window that moves over the data points for feature extraction or pattern recognition.
D) To permanently alter the original time series data for storage efficiency.

A

C

Answer is (C).

A) Increasing computational complexity is not a primary purpose of this technique. The sliding window method is actually a way to manage complexity by analyzing data in manageable, sequential segments.
B) The technique does not inherently transform qualitative data into quantitative data, although it might be used as part of a preprocessing step that includes such transformations.
C) This is correct because the sliding window technique is primarily used to analyze sequential data segments for pattern recognition, feature extraction, or smoothing purposes.
D) The technique does not alter the original time series data; it’s a method for analyzing the data. The original data remains intact.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

A researcher is analyzing temperature data from a sensor to identify patterns of temperature fluctuation over time. The sensor records temperature every minute. The researcher decides to use a sliding window technique with an overlap to segment the data before analysis.
The window size is set to 10 minutes, and the overlap between consecutive windows is specified to be 5 minutes. Given this setup:

How many unique readings will be included in two consecutive windows?

A) 10 readings
B) 15 readings
C) 20 readings
D) 5 readings

A

B

To solve this, you would calculate the number of readings in one window (10 readings, since the window is 10 minutes and the sampling rate is 1 reading per minute) and then consider the overlap (5 minutes, meaning 5 readings from the end of the first window are also at the beginning of the second window). Thus, the first window has 10 unique readings, and the second window also contains these 5 overlapped readings plus 5 new ones, totaling 15 unique readings in two consecutive windows when the overlap is accounted for.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

20218257
Which of the following correctly matches the methods of data collection for activity and emotion in Ground Truth Labeling? (Select all that apply.)
A) Activity - Direct Elicitation: User is asked directly to label their current activity.
B) Emotion - Naturalistic: Watching “emotional” videos or performing tasks designed to elicit specific emotional states.
C) Activity - Naturalistic: Asking people to label their current activity whenever there is a change in activity.
D) Emotion - Observation: A third person judges a user’s emotion, for example, by watching facial videos and labeling emotions.
E) Activity - Observation: Real-time following by an observer or video recording with subsequent post-hoc labeling.

A

C,D,E

20218257
A) Activity - Direct Elicitation: This is incorrect. While elicitation is used for emotions, here we’re collecting activity data. Option A describes gathering emotional states, not actions.
B) Emotion - Naturalistic: Not quite! Naturalistic tasks aim to evoke specific emotions, which isn’t the same as observing natural emotions. Option B describes eliciting emotions, while we want to observe natural ones.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

Question:Consider a dataset with the following characteristics

Mean: 50
Median: 45
Third Quartile (Q3): 60
First Quartile (Q1): 40
Standard Deviation: 10

Determine if the value 78 is considered an outlier based on the following methods:

A. 3σ rule from the mean value
B. Boxplot rule using 1.5 times the Interquartile Range (IQR)

Which of these methods identify the value 78 as an outlier?

1.A only
2.B only
3.Both A and B
4.Neither A nor B

A

4

“The 3σ rule sets an upper limit at 80 (mean + 3 * standard deviation). Since 78 falls below this limit, it does not qualify as an outlier by this method.”
“The boxplot rule sets an upper limit at 90 (Q3 + 1.5 * IQR). Since 78 also falls below this threshold, it is not considered an outlier by this criterion either.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

Question: Which one of the following outlier detection methods considers the local density around each data point?

A) Chauvenet’s criterion
B) Mixture model
C) Distance-based approach
D) Local Outlier Factor (LOF)

A

D

20218257
Chauvenet’s criterion, Gaussian mixture models, and distance-based approaches detect outliers based on rarity or separation, without considering local data clustering or density.
The Local Outlier Factor (LOF) assesses the degree to which a point is an outlier by comparing its local reachability distance with that of its neighbors, thus effectively detecting outliers in areas with diverse data densities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

Question: In a study investigating infant emotional awareness, researchers employ an EDA (Electro-Dermal Activity) sensor that records data at a 2 KHz frequency. This sensor data is synchronized with video recordings of the infants. After recording, child psychology experts analyze the videos to determine emotions like happiness or sadness, and this analysis is used to label the EDA data accordingly. Considering this setup, which combination of the data fetching method and ground truth labeling technique is being applied in this scenario?

A) Event-based fetching and Naturalistic labeling
B) Polling-based fetching and Elicitation labeling
C) Polling-based fetching and Observation labeling
D) Event-based fetching and Observation labeling

A

C

In this scenario, the EDA sensor is recording data at a consistently high frequency of 2 KHz, which aligns with the concept of polling-based fetching, where data is collected continuously at regular intervals. For ground truth labeling, the method used is Observation labeling, where experts analyze the video recordings after the fact (post-hoc) to determine the emotions of the children. This approach does not rely on real-time labeling or direct elicitation but rather on the expert analysis of observed behavior.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

Which of the following statements accurately describes the IQR (Interquartile Range) method in outlier detection?

A. IQR measures the spread of data around the mean of the dataset.
B. IQR is calculated as the difference between the minimum and maximum values in the dataset.
C. First quartile Q1 = the value under which 25% of data points are found when they are arranged in decreasing order.
D. IQR is computed as the difference between the third quartile (Q3) and the first quartile (Q1) of the dataset.

A

Option A is incorrect because IQR doesn’t measure the spread of data around the mean. It measures the spread of data around the median.

Option B is incorrect because IQR is not calculated as the difference between the minimum and maximum values. It focuses on the quartiles.

Option C is incorrect because first quartile Q1 = the value under which 25% of data points are found when they are arranged in increasing order.

Option D is correct becuase the IQR of a set of values is calculated as the difference between the upper and lower quartiles.
First quartile Q1 = the value under which 25% of data points are found when they are arranged in increasing order.
Third quartile Q3 = the value under which 75% of data points are found when arranged in increasing order.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

Question: Which of the following statements accurately describe wearable sensors?(Select all that apply.)
a) EDA: Shows greater responsiveness to thermal stimuli compared to psychological stimuli.
b) PPG: When measured simultaneously with ECG at the same time, PPG demonstrates a faster peak arrival speed than ECG.
c) EEG: Commonly utilized in fundamental research concerning neurological and psychiatric disorders.
d) EOG: Measures the electrical potential difference at various positions on the eye.

A

C,D

EDA is indeed more sensitive to psychological stimuli, making it a powerful method for emotion detection, contradicting statement a.
The claim in statement b is reversed; ECG signals precede those of PPG because the heart’s electrical activity happens before the blood volume changes it causes can be detected.
Statements c and d are accurate, with EEG being a cornerstone in neurological and psychiatric research due to its ability to capture brain electrical activity, and EOG being valuable for measuring eye movement through electrical potential differences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

Which situation is better suited for applying a distance-based outlier detection method?

A) When the data follows a single normal distribution.
B) When the data can be described using K normal distributions (mixture models).
C) When the data exhibits a uniform distribution.
D) When the data contains missing values.

A

B

Distance-based outlier detection methods, such as the k-nearest neighbors (k-NN) approach or the Mahalanobis distance, are particularly useful when dealing with data that can be modeled as a mixture of multiple normal distributions (i.e., mixture models). In such cases, outliers may deviate significantly from the expected patterns represented by the mixture components. These methods calculate the distance of each data point from the distribution centers and can effectively identify outliers within the mixture.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

Suppose we have a dataset with N data points. We apply a simple distance-based outlier detection method using parameters $f_{min}$ and $d_{min}$. If a fraction of $f_{min}$ of the points are found to be outside the distance threshold $d_{min}$, what can we infer about the remaining points?

A) At least $(1−f_{min})\times N $points are close to each other.
B) All points are outliers.
C) The dataset contains no outliers.
D) The number of close points cannot be determined.

A

A

The condition states that if a fraction of $f_{min}$ of the points are outside the distance threshold, then at least $(1−f_{min}) \times N$ points must be close (i.e., within the distance threshold).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

How does the Kalman filter effectively handle noise and missing values in sensory data?

A) By applying a high-pass filter to remove noise and interpolate missing values based on the median value of adjacent data points.
B) By predicting the current state based on previous states and measurements, while minimizing the error covariance to handle noise and estimate missing values.
C) By compressing the data to reduce noise and using pattern recognition to fill in missing values.
D) By transforming sensory data into a frequency domain and filtering out frequencies that correspond to noise and missing data.

A

B

The Kalman filter predicts the current state of the system using a mathematical model of the system’s dynamics. This model accounts for the previous state and any control inputs that might affect the current state.
The filter estimates the uncertainty of its predictions and measurements (error covariance) and uses this estimation to weight its predictions and measurements. This process helps in effectively reducing the impact of noise in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

Which of the following is most commonly used to measure heart rate variability through sensor data collection?

A) EDA (Electrodermal Activity)
B) ECG (Electrocardiogram)
C) PPG (Photoplethysmogram)
D) EEG (Electroencephalogram)
E) EOG (Electrooculogram)

A

B

ECG is widely recognized for its ability to measure the electrical activity of the heart, making it especially useful for assessing heart rate variability (HRV), among other cardiac functions. PPG is also used to measure heart rate by detecting blood volume changes in the microvascular bed of tissue, but ECG is more directly associated with heart rate variability due to its detailed capture of the electrical signals that trigger heartbeats.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

Which statement is correct about ground truth labeling?(choose 1)
a. The Experience Sampling Method is typically conducted in a natural setting
b. Elicitation means the observer will follow up and label the object in real-time.
c. In a natural setting, collector ask users to follow predetermined scenarios to collect data.
d. The “Observation” refers to requesting individuals to label their current state.

A

a

“Elicitation” involves collectors asking users to adhere to predefined scenarios in order to gather data. “Natural setting” entails prompting individuals to label their current activity. “Observation” refers to directly observing and recording the characteristics or situations of a subject. The Experience Sampling Method (ESM) involves sending messages to users prompting them to input their current state (label), and it is typically conducted in a natural setting. Therefore, option (a) is correct.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

Which statement is incorrect about outlier? (choose 2)
a. Average is more robust than median in terms of the fact that outliers would pull a median toward outliers.
b. An outlier refers to data that is significantly distant from other data points
c. A lower reliability of a sensor could cause outliers.
d. It’s always best to get rid of outliers.

A

a,d

The median is more robust than the mean in the presence of outliers. Outliers can significantly affect the mean, pulling it towards their extreme values. However, the median is less influenced by outliers since it depends only on the middle value of the dataset. Therefore, (a) is incorrect.

An outlier is a data point that significantly differs from other data points in a dataset. Low reliability of a sensor means that precise measurements cannot be guaranteed, thereby diminishing confidence in the measurement results. Therefore, (b) and (c) are correct statements.

It’s not always best to remove outliers. While outliers can distort the distribution of data, they can also provide valuable insights or indicate important anomalies in the data. Removing outliers without proper justification or understanding of their origins can lead to biased or inaccurate analyses. Therefore, (d) is incorrect.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

Which of the following physiological responses is mainly used as an indicator of emotional arousal?

  1. Electro-Cardio-gram (ECG)
  2. Photo-Plethysmo-gram (PPG)
  3. Electro-Encephalo-gram (EEG)
  4. Electro-Dermal Activity (EDA)
A

4

Electro-Cardio-gram (ECG): ECG measures the electrical activity of the heart.
Photo-Plethysmo-gram (PPG): PPG measures blood volume changes in the microvascular bed of tissue, which can indirectly reflect changes in emotional states through variations in heart rate. However, like ECG, it is not specifically used to assess emotional arousal.
Electro-Encephalo-gram (EEG): EEG records electrical activity of the brain and is crucial in neurological research and diagnosis.
Electro-Dermal Activity (EDA) primarily measures the changes in the electrical conductance of the skin due to sweat gland activity. When a person experiences emotional arousal, particularly through the activation of the sympathetic branch of the autonomic nervous system, sweat gland activity increases, leading to a higher skin conductance. Therefore, EDA is used as an indicator of emotional arousal and responsiveness to psychologically significant stimuli. This physiological response is more sensitive to emotional changes rather than thermal stimuli, making it a valuable tool in assessing emotional states and reactions. Thus, the correct answer is 4.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

Consider the data set consisting of:
{88, 23, 131, 36, 1001, 294, 391, 1, -2, 94, -99, 2, 82, 42, -11, 43} (N = 16, mean = 106)
The data below the 5th percentile lies between -99 and 9.75, while the data above the 95th percentile lies between 148.6 and 1001.
Which of the following is the correct result after 90% winsorization?

a) {88, 23, 131, 36, 9.75, 9.75, 9.75, 148.6, 148.6, 94, 148.6, 148.6, 82, 42, 148.6, 43}
b) {88, 23, 131, 36, 148.6, 294, 391, 1, -2, 94, 9.75, 2, 82, 42, -11, 43}
c) {88, 23, 131, 36, 148.6, 148.6, 148.6, 9.75, 9.75, 94, 9.75, 9.75, 82, 42, 9.75, 43}
d) {88, 23, 131, 36, 0, 294, 391, 1, -2, 94, 0, 2, 82, 42, -11, 43}

A

c

Winsorization is a method used to mitigate the impact of outliers by substituting extreme values with values closer to the rest of the data. Specifically, in a 90% winsorization, the lowest 5% of the data are replaced with the value at the 5th percentile, and the highest 5% of the data are replaced with the value at the 95th percentile.
Given that the data below the 5th percentile lies between -99 and 9.75, and the data above the 95th percentile lies between 148.6 and 1001, the correct result after 90% winsorization should replace the lowest 5% of the data with 9.75 and the highest 5% with 148.6.
Option 3 accurately reflects this process by substituting the appropriate values with 9.75 and 148.6, resulting in a winsorized dataset. Hence, option c) is the correct answer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

Which of the following best describes the Experience Sampling Method (ESM)?

A. ESM typically gives questionnaires to participants in controlled laboratory environments.
B. A method used in qualitative research for participant observation
C. A research approach involving random sampling of experiences in real-time
D. ESM primarily relies on retrospective self-reports to gather data about individuals’ experiences and behaviors.

A

C

Option A is incorrect because is ESMa research procedure for studying what people do, feel, and think during their dailylives.

Option B is incorrect because participants need to report on their current experiences by themselves.

Option C is correct because participants being prompted at random intervals to report on their current experiences

Option D is incorrect because ESM involves collecting data in real-time, rather than relying on retrospective self-reports.

The Experience Sampling Method (ESM): participants being prompted at random intervals to report on their current experiences involves collecting data on participants’ experiences, behaviors, or thoughts in real-time as they occur in their natural environment. This method typically involves participants being prompted at random intervals to report on their current experiences, providing researchers with insights into everyday life experiences and behaviors. ESM allows researchers to capture momentary experiences and reduce recall bias that may occur with traditional retrospective self-report measures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

Which of the two following are the most likely applications of the data gathered from EDA (Electrodermal Activity) and PPG (Photoplethysmogram) sensors in wearable devices?

A) Personalizing hydration reminders.
B) Detecting nightmares.
C) Tracking the user’s surrounding pressure levels.
D) Measuring the speed of the user’s movement.

A

A, B

Correct answers: A) and B)
Option A: EDA sensors can detect changes in sweat gland activity. An increase in sweating may indicate dehydration, which could be used as an indicator for an application to suggest hydration to the user.
Option B: EDA readings reflect changes in sweat gland activity, which can be correlated with stress during nightmares. With PPG, one of the metrics that can be observed is BPM, which is usually tied to emotional stress.

Explanations for the incorrect options:
Option C: This is typically measured with barometers.
Option D: These kinds of measurements typically rely on motion sensors such as accelerometers and gyroscopes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

Which of the following scenarios involving sensor readings from wearable devices is most likely to produce actual outliers that require removal or further processing?

A) Spikes in EDA (measures skin conductivity) readings when a user is doing an intense physical exercise.
B) Low ECG (measures heart rate) readings while a user is asleep.
C) Sudden drop in PPG (measures heart rate) readings when the wearable device briefly loses contact with the user’s skin.
D) Consistently high SpO2 (measures oxygen level) readings of a user

A

C

Correct answer: C)
Since the sudden drop in the measurement value is caused by mechanical failure or interference while gathering the data, this can be considered as an outlier that does not reflect the wearer’s physiological state and should be removed.

Explanation for the incorrect options
Option A: Spikes in EDA readings can occur during intense physical activity due to increased sweat production, which is a normal physiological response and not necessarily an outlier.
Option B: Lower ECG readings during sleep are expected due to the decreased heart rate as the body enters a state of rest, which can’t be categorized as outliers
Option D: This can be categorized as a systematic error, which can be addressed by recalibrating the device or adjusting the values based on domain knowledge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

Which sensor technology is used to measure electrodermal activity (EDA)?

A) Gyroscope
B) Photo-plethysmo-gram (PPG)
C) Galvanic skin response (GSR)
D) Accelerometer

A

C

Electrodermal activity (EDA), also known as galvanic skin response (GSR), is a measure of the skin’s conductivity, which changes in response to sweat gland activity. This activity is primarily controlled by the sympathetic nervous system, making EDA a useful indicator of emotional arousal. When a person experiences emotional arousal, the sympathetic branch of the autonomic nervous system becomes more active, leading to increased sweat gland activity and, consequently, higher skin conductance. This physiological basis allows EDA measurements to serve as indicators of psychological or emotional states.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

According to the document, which method is described as a way to handle outliers by substituting extreme values with less extreme ones, thereby reducing the influence of potentially spurious outliers?

A) Kalman Filtering
B) Chauvenet’s Criterion
C) Winsorizing
D) Local Outlier Factor

A

C

A) Kalman Filtering: Kalman Filtering is a recursive algorithm used for estimating the state of linear dynamic systems from a series of incomplete and noisy measurements. It’s not specifically designed for handling outliers in a statistical dataset.
B) Chauvenet’s Criterion: Chauvenet’s Criterion is a rule for identifying and removing outliers from a dataset. It determines whether a data point should be considered an outlier based on the probability of its deviation from the mean, which is not the method described in the question.
C) Winsorizing is a statistical transformation method used to reduce the effect of possibly spurious outliers by substituting extreme data points with less extreme ones. This could involve replacing values below the 5th percentile and above the 95th percentile with values closer to the median, thus mitigating the impact of outliers on the dataset.
D) D) Local Outlier Factor (LOF): LOF is an algorithm for identifying density-based local outliers, particularly in datasets with clusters. It measures the local deviation of a data point with respect to its neighbors, aiming to identify regions of similar density.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

What would be the result of applying a 80% Winsorization to the following dataset: {10, 15, 20, 25, 30, 35, 40, 45, 50, 100}?

A) {10, 15, 20, 25, 30, 35, 40, 45, 50, 50}
B) {15, 15, 20, 25, 30, 35, 40, 45, 50, 50}
C) {20, 20, 20, 25, 30, 35, 40, 45, 50, 50}
D) {20, 20, 25, 25, 30, 35, 40, 45, 50, 50}

A

C

An 80% Winsorization involves replacing the lowest 10%
and the highest 10% of values with the 10th and 90th percentiles
respectively. In this dataset, 10% of 10 is 1, and 10% of 100 is 10.
So, we replace the lowest value with the 10th percentile (20) and
the highest value with the 90th percentile (50),
resulting in {20, 15, 20, 25, 30, 35, 40, 45, 50, 50}.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

Which ground truth labeling method would be most appropriate for accurately tracking the sleeping patterns of individuals using wearable devices?

A) Elicitation through predetermined scenarios
B) Natural observation through real-time sensors
C) Emotion observation through facial recognition
D) Experience sampling through random user prompts

A

B

Natural observation through real-time sensors would be the most suitable method for tracking sleeping patterns using wearable devices. This method involves directly monitoring physiological signals such as heart rate, movement, and sleep stages using sensors embedded in the wearable device. It allows for continuous and accurate tracking of sleeping patterns without relying on user input or predetermined scenarios.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

Electrodermal Activity (EDA) is primarily used to measure:
A) Heart rate variability
B) Muscle tension
C) Skin conductivity changes due to sweat gland activity
D) Brain wave patterns

A

C

Correct answer: C.
Electrodermal Activity (EDA) is a physiological response that measures the electrical conductance of the skin, which varies with the activity of the sweat glands.
Heart rate variability is often measured using electrocardiography (ECG). Electromyography (EMG) sensors are used to measure muscle tension or activity. Brain wave patterns can be monitored using Electroencephalography (EEG).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

How does the presence of outliers in a dataset affect the mean?
A) It has no effect on the mean, making it a reliable measure in all cases.
B) It causes the mean to shift towards the outliers, potentially misrepresenting the data’s central tendency.
C) It makes the mean calculation computationally easier.
D) It reduces the mean’s value, making lower values more prevalent.

A

B

Correct answer: B. The mean is calculated by summing all the values in a dataset and then dividing by the number of values. This calculation method means that every value, no matter how large or small, influences the result. When there are outliers they can significantly skew the mean. This skew can lead the mean to misrepresent the central tendency of the data, giving a distorted view of what’s typical or common within the dataset.
A is incorrect because outliers can significantly shift the mean away from the central mass of the data, making it a less reliable measure of central tendency in distributions with outliers.
C is incorrect because the process for calculating the mean (adding all the values together and then dividing by the number of values) remains the same regardless of whether outliers are present or not.
D is incorrect because outliers can either increase or decrease the mean, depending on whether the outliers are significantly higher or lower than the rest of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

Suppose you need to design an experiment to understand how users interact with a new software application under specific conditions. You plan to have users perform predetermined tasks with the software within a controlled environment to obtain their particular behaviours or responses.

Which of the following activity is best to represent above case study?
A) Observation
B) Natural
C) Elicitation
D) Emotion

A

C

In the described case study, what you want is to setup a controlled environment with predefined scenarios for the users to perform tasks using new software. This method is known as elicitation because it actively creates situations designed to draw out specific responses or interactions from the participants. Unlike naturalistic observation, where the researcher would observe and record behaviours without intervention, or natural labelling, where users report on their activities as they change, elicitation deliberately induces a certain environment or set of circumstances to gather data on how users behave under those particular conditions. Therefore the answer is C

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

Arnold is participating in a study analyzing his movements during outdoor activities using a GPS device and a step counter. However, the data collected contains some irregularities and missing values. The research team decides to use a method that can detect outliers and simultaneously impute missing data points, taking advantage of their understanding of Arnold’s typical movement patterns and the reliability of the devices used. Given the scenario and the tools at hand, which method would be the most suitable for processing Arnold’s movement data? A) Mean/Median/Mode imputation, because it is a straightforward technique that can replace missing data with the most frequently observed values. B) Kalman filter, because it not only detects outliers in Arnold’s presence at a position and velocity but also imputes missing values using the GPS data and step counter measurements. C) Interpolation-based imputation, because it can fill in missing values based on the data points immediately before and after the gaps. D) Winsorizing, because it can adjust the extreme outliers to a specified percentile, thus limiting their impact on the analysis.

20234921
You want to analyze movement activity with data from a GPS device and a step counter. Which method would be the most suitable to detect outliers and simultaneously impute missing data points, taking advantage of their understanding of typical movement patterns and the reliability of the devices used.
A) Mean/Median/Mode imputation, because it is a straightforward technique that can replace missing data with the most frequently observed values.
B) Kalman filter, because it not only detects outliers in Arnold’s presence at a position and velocity but also imputes missing values using the GPS data and step counter measurements.
C) Interpolation-based imputation, because it can fill in missing values based on the data points immediately before and after the gaps.
D) Winsorizing, because it can adjust the extreme outliers to a specified percentile, thus limiting their impact on the analysis.

A

B

The correct answer is B. The Kalman filter is an advanced algorithm that excels in situations where there is a need to estimate the state of a dynamic system over time. In Arnold’s case, the dynamic system is his movement through space as tracked by GPS and a step counter. The Kalman filter provides a more precise and tailored approach to cleaning and imputing the data in this context, as opposed to more general methods like mean/median/mode imputation or interpolation, which do not utilize the additional information available about the system’s behavior over time. Winsorizing is not as suitable since it primarily addresses extreme values rather than missing data points and does not take advantage of the dynamic model of Arnold’s movements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

In the lecture, the 3MAD (Median Absolute Deviation) and 3sigma (standard deviation) methods for outlier detection were introduced. It is known that 3MAD is generally more robust than 3sigma. Which of the following statements is NOT a correct reason for this?

A) 3sigma is appropriate even if the dataset does not follow a normal distribution.
B) 3sigma uses the mean to calculate the standard deviation, making it more susceptible to outliers.
C) 3MAD is less robust than 3sigma because MAD is more influenced by outliers in the data.
D) 3sigma is more robust to outliers because it squares the differences from the mean, reducing the impact of large deviations compared to MAD.

A

B

B is true “3sigma uses the mean to calculate the standard deviation, making it more susceptible to outliers.” - This statement is true. Since the mean is sensitive to outliers, and standard deviation is derived from the mean, the 3sigma method is also sensitive to outliers. This sensitivity makes it less robust compared to 3*MAD, which uses the median.

A is not correct 3sigma assumes the data follows a normal distribution. If the data is not normally distributed, using 3sigma for outlier detection may not be appropriate or effective.

C is incorrect because it’s the opposite of the established fact that MAD is less influenced by outliers compared to the mean. MAD uses the median, which is more robust to outliers than the mean, making 3MAD more robust for outlier detection than 3sigma.

D is misleading. While squaring the differences does penalize larger deviations more, it actually makes the sigma method more sensitive to outliers, not less. Outliers, which have large deviations from the mean, will have an even larger impact after being squared, making 3sigma less robust to outliers compared to 3*MAD.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

Which of the following is the cheapest option to get heart rate variability?
A) EOG (Electrooculogram)
B) ECG (Electrocardiogram)
C) EDA (Electrodermal Activity)
D) PPG (Photoplethysmogram)
E) EEG (Electroencephalogram)

A

D

Correct answer is PPG. Photoplethysmography (PPG) is a simple
and low-cost optical technique that can be used to detect blood volume
changes in the microvascular bed of tissue. It is widely used to
measure heart rate and heart rate variability (HRV) by
detecting the pulse wave that travels through the blood vessels
each time the heart beats. PPG sensors are commonly found in
many consumer-grade wearable devices, such as fitness trackers
and smartwatches, because of their cost-effectiveness and
ease of integration.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
81
Q

Alex is participating in an experiment regarding the motion recognition using biosensors in smart watches. He was told to download an app in his smart watch and the app routinely notifies him to select his current activity among sitting still, running and etc. During a day, the app randomly notifies him 6 times except at night. After notification he has to select his current activity within 3 minutes. Alex was told that this experiment will end next Friday and hence he has to wear the smart watch for 5 days.

In above experience sampling method, what is the parameter the practitioners should additionally consider?

A) notification schedule
B) notification expiry
C) inter-notification time
D) study duration

A

C

A) Notification schedule: This refers to the timing and frequency of notifications sent to participants. The description mentions that notifications are sent randomly six times a day, except at night, which indicates that the schedule is already a consideration
B) Notification expiry: This is the window of time within which participants must respond to a notification, set at 3 minutes in the study.
C) Inter-notification time: This refers to the time between consecutive notifications. Although notifications are said to be random, ensuring a minimum or maximum time between them (to avoid clustering or long gaps) is crucial for balancing data collection throughout the day and reducing participant burden.
D) Study duration: The overall length of the study is mentioned as 5 days, ending next Friday.
However, the practitioner didn’t consider inter-notification time, i.e. time gap between random notifications. Hence, the answer is C

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
82
Q

Among the outlier detection options below, which one of it does not require distributional (normality) assumption?

A) 3 sigma rule
B) Inter Quantile Range (IQR) method
C) Local Outlier Factor
D) Chauvenet’s criterion

A

C

A) 3 sigma rule is from the fact that if X follows normal distribution with mean mu and sigma, X will fall into 3 sigma interval, mu - 3*sigma

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
83
Q

Which of the following statements best explains why 3MAD is considered more robust than 3σ?

A) MAD is calculated based on the mean, which is less sensitive to outliers than the median used in σ calculation.

B) MAD is more resistant to deviations from normality and works well with non-normal distributions, unlike σ, which assumes a normal distribution.

C) MAD provides consistent estimates of dispersion even in the presence of outliers, while σ tends to overestimate the spread of the data in the presence of extreme values.

D) MAD focuses on the median, which is less influenced by extreme values, resulting in a more robust estimate of dispersion compared to σ, which can be inflated by outliers.

A

D

A) MAD (Median Absolute Deviation) is calculated based on the median, not the mean. MAD measures the dispersion of a dataset by calculating the median of the absolute deviations from the median.

B) This statement is partially correct. MAD is indeed more robust against deviations from normality and can work well with non-normal distributions. However, standard deviation (σ) does not necessarily assume a normal distribution; it is commonly used with various distribution types.

C) While MAD does provide consistent estimates of dispersion even in the presence of outliers, this statement oversimplifies the comparison with σ. Standard deviation (σ) can also be robust to outliers under certain conditions, especially if the data follows a normal distribution or if robust methods like Winsorized standard deviation are used.

D) MAD focuses on the median, which is less influenced by extreme values, resulting in a more robust estimate of dispersion compared to σ, which can be inflated by outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
84
Q

Which of the following statements accurately describes the difference between PPG and ECG for heart rate monitoring?

A) ECG directly measures heart activity by detecting electrical signals produced by the heart muscle.

B) PPG relies on optical measurements, capturing blood flow changes via a small LED light.

C) ECG is more reliable than PPG in measuring heart rate.

D) PPG sensors are ideal for average or moving average measurements.

E) Both A and B are correct.

A

A, B, E

A) ECG (Electrocardiogram) directly measures heart activity by detecting electrical signals produced by the heart muscle. It provides precise information about the heart’s electrical conduction system and is commonly used in medical settings.

B) PPG (Photoplethysmography) relies on optical measurements, capturing blood flow changes via a small LED light. PPG sensors typically measure changes in light absorption caused by blood volume changes, providing an indirect measurement of heart rate and blood flow.

C) While ECG is often considered the standard for measuring heart rate due to its accuracy and direct measurement of cardiac electrical activity, PPG can also be reliable when used properly, especially in consumer-grade devices.

D) PPG sensors can provide real-time heart rate measurements, and while they can calculate moving averages, they are not limited to this type of measurement.

E) As A) and B) are correct, this option is also correct.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
85
Q

What does the Kalman filter technique primarily address?A) Only outlier detectionB) Only missing value imputationC) Both outlier detection and missing value imputationD) Neither outlier detection nor missing value imputation

A

C

The Kalman filter is a sophisticated method that is used for both detecting outliers and imputing missing values, leveraging prior knowledge about the data’s process and measurement models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
86
Q

Which of the following is NOT a method for detecting outliers?

A) Using Chauvenet’s criterion
B) Using Gaussian Mixture Models
C) Using the Least Squares Method
D) Using the Local Outlier Factor

A

C

The Least Squares Method is primarily used for modeling relationships in regression analysis, not for detecting outliers. Chauvenet’s criterion, Gaussian Mixture Models, and the Local Outlier Factor are techniques that can be used for outlier detection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
87
Q

You are working on a project involving correct exercise form for weighted exercises (deadlift, bicep curl, etc.) and are tasked to collect data (sensor and video).
You are told that not all of the participants know/use the right form during exercise. Which is the best ground truth labelling method for labelling the collected data then?

A: Elicitation (using only participants that know and can execute the proper form)
B: Elicitation (using all participants)
C: Natural
D: Observation

A

D

In order to collect sufficient data, you need both positive and negative samples (people with good and bad form); thus A is not valid. Since human movement can vary based on a variety of factors, someone with good form might execute a movement incorrectly because they don’t feel well, thus making option B invalid. Using similar logic option C is also invalid; participants might think they have the correct form without knowing that they are doing an exercise incorrectly. Option D is correct since you can see if someone performed a movement correctly, making it the most reliable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
88
Q

Assume you are processing some temperature data and discover that there are a couple of missing values. Which imputation strategy would work the best? Assume that the temperature data is sinusoidal.

A: Mean
B: Median
C: Mode
D: Interpolation

A

D

When given sinusoidal data, the mean and median imputation will produce a straight line across the middle of the sinusoidal data; although this solves the issue of missing data it doesn’t do it very well. A similar problem occurs with mode imputation,except this time the straight line can occur at any point. Interpolation is the best option since it will preserve the sinusoidal shape.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
89
Q

Which of the following statements is incorrect? (Select two)

  1. Pseudo sensors of smartphones can be used for activity recognition.
  2. Showing sad videos to obtain ground truth labeling should be avoided as it may affect users’ emotions.
  3. Users can be instructed to engage in activities such as running or walking to gather ground truth labels for activity.
  4. According to the Experience Sampling Method, individuals can be asked about their current emotional state to collect emotion ground truth labels.
  5. Self-labeling is always more accurate than labeling from observation.
A

2, 5

Pseudo sensors of smartphones can indicate the type of activity based on software.
For ground truth labeling, a method of showing a video of a specific emotion and then asking the user to rate the current emotion can be used. Therefore, showing sad videos to obtain ground truth labeling is not avoided.
Users may be required to follow predetermined scenarios for ground truth labeling of activities.
According to the Experience Sampling Method, users can be asked to randomly label a current emotion state.
It cannot be said which is always more accurate: self-labeling or labeling from observation.
Therefore, 2 and 5 are incorrect.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
90
Q

Consider the data set consisting of: {-63, -2, 14, 18, 25, 56, 75, 87, 92, 1028} (N=10)

The following are the ranges of inliers after applying outlier detection to the data:
- 3σ rule: [-771.9, 1037.9]
- 3MAD rule: [-130.7, 221.7]

With the 3σ rule, the number of inliers is (a)___. With the 3MAD rule, the number of inliers is (b)___. Based on this, 3(c)___ is more robust than 3(d)___.

Which of the following is correct for (a), (b), (c), and (d)?

  1. 10, 10, MAD, σ
  2. 10, 9, MAD, σ
  3. 10, 9, σ, MAD
  4. 9, 9, σ, MAD
A

2

To solve this problem, we need to know about distribution-based outlier detection. After applying the 3σ rule, the number of inliers is 10 because all values in the data lie within the range of inliers. On the other hand, after applying the 3MAD(Median Absolute Deviation) rule, ‘1028’ is outside the range of inliers, so the number of inliers is 9. Based on this, 3MAD is more robust than 3σ.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
91
Q

In the context of ground truth labeling for sensor data collection, what does the “Natural” method of labeling refer to?

a) Predefining activities for users to perform and label accordingly.
b) Users annotating their current activity or emotion when a change is detected.
c) Observers recording and labeling a user’s activity or emotion from a distance.
d) Implementing automated algorithms to label sensor data outputs.

A

b

a) Incorrect. This describes the “Elicitation” method where users follow predetermined scenarios, not the “Natural” method which involves in-situ labeling by the users themselves.
b) Correct. The “Natural” method involves in-situ labeling, which means asking people to label their current activity or emotion whenever there’s a change of activity.
c) Incorrect. This option describes the “Observation” method where labeling is done by an observer or through video recording with post-hoc labeling, not by the users themselves in their natural setting.
d) Incorrect. The use of automated algorithms to label sensor data would not involve direct user input and hence does not align with the “Natural” method of ground truth labeling, which relies on user-generated labels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
92
Q

In sensor data analysis, why might a researcher use the Kalman filter for outlier detection and imputation?

a) To substitute extreme values with more typical ones using Winsorization.
b) To assume a single distribution for attribute noise reduction.
c) To employ prior knowledge of process and measurement models for data calibration.
d) To automate the detection of outliers based on local density and distance factors.

A

c

a) Incorrect. Winsorizing modifies extreme data points to reduce the impact of outliers but does not incorporate prior knowledge about the data, which is the key aspect of the Kalman filter’s approach.
b) Incorrect. A single distribution assumption for attribute noise reduction is a characteristic of distribution-based outlier detection methods, not the Kalman filter which utilizes a model-based approach.
c) Correct. The Kalman filter uses models of the system’s process and measurement to predict and correct state estimates, which is particularly useful in sensor data calibration and addressing outliers and missing values.
d) Incorrect. Automating outlier detection based on local density and distance factors refers to other methods like the local outlier factor (LOF), which are different from the Kalman filter’s model-based prediction and correction approach.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
93
Q

Which one is not a method for ground truth labeling of sensor dataset?

a. Eicitation: Asking users to follow predetermined scenarios
b. Natural: in-situ labeling - asking people to label a current activity
c. Auto Inference: Using pretrained ML models to label assign labels
d. Observation: real time following or video recording with post hoc labeling

A

no answer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
94
Q

Which outlier detection method assumes the data to be normally distributed?

a. Simple search based method
b. Local outlier factor
c. Chauvenet’s criterion
d. Isolation forest

A

no answer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
95
Q

Which of the following is NOT an example of Experience Sampling Method (ESM)?

A) Random notification schedule with a maximum of 10 times a day.
B) Hourly interval notification schedule.
C) Notifications triggered by incoming calls and app use.
D) Daily survey sent at a fixed time each day.

A

D

ESM typically employs random or interval notification schedules, triggering prompts at various times throughout the day to capture momentary experiences. These prompts can also be event-based, such as when specific actions occur (e.g., incoming calls, app use).

Option D describes a daily survey sent at a fixed time each day, which does not align with the principles of ESM. In ESM, the timing of prompts is variable and often unpredictable, aiming to capture experiences as they naturally occur rather than at pre-scheduled intervals. Therefore, a daily survey sent at a fixed time each day does not fit the criteria of ESM and is not considered an example of this method.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
96
Q

Which of the following scenarios is incorrect when handling missing data through imputation methods?

A) Filling missing values with the median of the observed data.
B) Estimating missing values based on the trend between neighboring data points.
C) Predicting missing values using a regression model trained on the available data.
D) Discarding observations with missing values to maintain data integrity.

A

D

Imputation methods are techniques used to handle missing data by replacing them with estimated values. Options A, B, and C describe typical scenarios of imputation. Option D is incorrect because it involves removing valuable data points rather than imputing missing values. This approach can lead to biased results and reduced sample sizes, potentially compromising the validity of the analysis. Therefore, discarding observations with missing values is not considered a proper imputation method.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
97
Q

What is the primary purpose of ground truth labelling in sensor data collection?
a) To increase the storage capacity of sensors
b) To calibrate sensor accuracy
c) To create a reference for data analysis
d) To reduce the cost of sensor manufacturing

A

C

Ground truth labelling involves assigning known labels to data collected from sensors, providing a reference against which the data can be analyzed and the performance of sensing systems can be evaluated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
98
Q

Considering the use of Winsorizing in data preprocessing, what is the primary goal when applying this technique to a dataset with extreme outliers?
a) To increase the range of data by extending extreme values.
b) To replace all data points with the mean to simplify analysis.
c) To limit extreme values to reduce their influence on the analysis.
d) To evenly distribute all data points across the dataset.

A

C

Winsorizing is a method of limiting extreme values in the dataset to reduce the effect of potentially spurious outliers. By capping extreme values to a certain percentile at both ends of the data range (e.g., the 5th and 95th percentiles), Winsorizing reduces the influence of outliers on the analysis, leading to more robust statistical estimates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
99
Q

What does EDA (Electro-Dermal Activity) measure and why is it significant? (select one)
A) The electrical conductivity of the skin to track device usage patterns.
B) The skin’s momentary electrical conductivity changes in response to stimuli, indicating emotional arousal.
C) The ambient temperature around the device to adjust screen brightness.
D) The battery life of wearable devices for efficient energy consumption.

A

B

EDA measures the skin’s electrical conductivity, which changes momentarily in response to various stimuli. These changes are primarily due to the activity of the sweat glands, controlled by the sympathetic nervous system, making EDA a valuable indicator of emotional arousal. This makes it a crucial measure in studies related to stress, excitement, or emotional states.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
100
Q

What is Chauvenet’s criterion used for in the context of outlier detection, and what does it entail? (select one)
A) Assuming data follows a single distribution to identify outliers.
B) Using the k-nearest neighbors to measure the local density around a point.
C) Finding a probability band centered on the mean to reasonably contain all samples.
D) Substituting extreme values with less extreme values to reduce outliers’ effects.

A

C

Chauvenet’s criterion involves determining a probability band centered on the mean of a normal distribution that should reasonably contain all samples in the dataset. It helps identify outliers by excluding data points that fall outside this band, assuming a normal distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
101
Q

What could be a potential challenge when measuring physiological signals in a real-world setting?
a) Limited access to advanced data analysis tools
b) Difficulty in securing participants for the study
c) Inability to accurately measure due to movement interference
d) Lack of trained personnel to operate the measurement devices

A

C

Explanation: Physiological signal measurements can be affected by movement interference, making it challenging to obtain accurate data in real-world settings where participants may be moving or active.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
102
Q

Which of the following statements about the k-nearest neighbors is the least accurate?
a) The value of k should be adjusted according to the distribution and characteristics of the data.
b) A small k value can increase accuracy by making it more sensitive to the distances between data points so that it is less likely to overfit.
c) A large k value can provide more generalized results, but the accuracy may decrease as the classification boundaries become smoother.
d) In dense regions of data, a small k value can be chosen to give more weight to the influence of nearby neighbors, and in sparse regions, a large k value can be chosen to consider a wider area.

A

B

Explanation: While a small k value can be more sensitive to the local data points and potentially improve accuracy, it also makes the model more susceptible to overfitting the training data. This means the model performs well on the training data but may not generalize well to unseen data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
103
Q

Which of the following statements is incorrect? (Select two)

  1. The elicit method for ground truth labeling involves recording sensor data while asking users to follow predetermined scenarios.
  2. Using recorded videos of the user’s facial expressions is the elicit method for labeling emotion data.
  3. Allowing users to watch ‘emotional’ videos constitutes the observation method for ground truth labeling.
  4. Randomly asking a user to label their current emotional state is a natural setting method for labeling ground truth.
A

2,3

Option 2 mislabels an observation method as the elicit method for emotion data, while Option 3 inaccurately assigns an elicit method for emotion data as an observation method for ground truth labeling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
104
Q

Which of the following statements is incorrect? (Select one)

  1. An outlier is an observation point that is distant from other observations.
  2. Distribution-based outlier detection methods assume a certain distribution of the data.
  3. Distance-based outlier detection methods only consider the distance between data points.
  4. Chauvenet’s criterion is a distance-based outlier detection method.
A

4

Chauvenet’s criterion is a Distribution-based outlier detection method

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
105
Q

What can be measured using both PPG and ECG?

a) Heart rate
b) Blood pressure
c) Blood oxygen saturation
d) Electrical activity of the heart

A

a

PPG and ECG both measure heart rate. PPG measures the blood volume changes in the skin, while ECG records the electrical activity of the heart. Ultimately, both technologies are utilized to measure heart rate. Therefore, (a) is the correct answer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
106
Q

Which of the following statements is incorrect regarding Chauvenet’s criterion?

a) Chauvenet’s criterion is one of the statistical methods used for outlier detection.
b) Chauvenet’s criterion evaluates the extent to which data points deviate from the range of standard deviation.
c) Chauvenet’s criterion is a method for excluding the largest or smallest values from a data set.
d) Chauvenet’s criterion can be applied to data following a normal distribution.

A

c

Chauvenet’s criterion employs statistical methods for outlier detection. It assesses the extent to which data deviate from the range of standard deviation and can be applied to data following a normal distribution. However, Chauvenet’s criterion does not involve excluding the largest or smallest values from a data set. Instead, it evaluates the degree to which data points deviate from a criterion to identify and potentially remove outliers. Therefore, (c) is the incorrect answer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
107
Q

What is a primary advantage of using Electrodermal Activity (EDA) sensors in wearable technology for psychological research?

A) They can directly measure cognitive thoughts and processes.
B) They provide a direct measure of environmental temperature.
C) They can indicate emotional arousal by measuring changes in skin conductance.
D) They are primarily used to measure physical activities such as walking or running.

A

c

Electrodermal Activity (EDA), also known as Galvanic Skin Response (GSR), measures the electrical conductance of the skin, which varies with its moisture level. This method is particularly useful in psychological research because the sweat glands are controlled by the sympathetic nervous system, and thus, changes in skin conductance can be indicators of emotional arousal or stress. Unlike cognitive thoughts (A) or measuring environmental factors (B), EDA provides insights into the autonomic physiological responses to emotional stimuli, which are not directly observable. This makes it valuable for studying the subconscious aspects of human emotion and stress responses, far beyond what is possible with measures of physical activity (D)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
108
Q

What is the preferred method for imputing missing values in a time-series dataset where the order of data points is significant, and why?

A) Mean imputation, because it is the simplest method.
B) Mode imputation, because it uses the most frequent value.
C) Winsorizing, because it limits extreme values.
D) Interpolation, because it provides more natural values by considering the temporal order of the data.

A

d

In time-series datasets, where the temporal order and continuity of the data points are important, interpolation is a preferred method for imputing missing values. Unlike mean or mode imputation, which might not account for the time-dependent nature of the data, interpolation uses values from neighboring data points to estimate the missing values. This method ensures that the imputed values follow the dataset’s natural flow and variability over time, leading to more accurate and realistic data restoration. Winsorizing is more about limiting extreme values rather than imputing missing ones and might not be suitable for filling gaps in time-series data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
109
Q

Which of the following best describes the impact of using overlapped windowing as opposed to distinct windowing?

A) Overlapped windowing significantly reduces the computational complexity of feature extraction.
B) Overlapped windowing can lead to overfitting due to the high similarity between features generated from adjacent windows.
C) Overlapped windowing eliminates the need for selecting a window size parameter (λ).
D) Overlapped windowing is only useful for numerical data and cannot be applied to categorical data.

A

B

Overlapped windowing involves choosing how much windows should overlap, typically to ensure sufficient data coverage and to capture relevant information across windows. However, it also cautions that overly high overlap can lead to features being too similar, potentially causing overfitting because of the limited variation among the generated features. This concept is discussed in the context of handling numerical data in the time domain, where overlapping windows are utilized to create features from sensory data​.

So the answer is B

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
110
Q

Question: Fill in the blanks with the correct options:

Length is an example of (1)____ data because it can be (2)____ and measured. When variables like length are analyzed using Overlapped Windowing, applying too large a window size can lead to a (3)____ problem.

Options:
A. (1) Numerical, (2) Continuous, (3) Overfitting
B. (1) Categorical, (2) Discrete, (3) Underfitting
C. (1) Numerical, (2) Discrete, (3) Overfitting
D. (1) Categorical, (2) Continuous, (3) Overfitting
E. (1) Numerical, (2) Continuous, (3) Underfitting

A

A

Length is a numerical data type because it quantifies an amount. It’s continuous as it can represent any value within a range, not just integers. When analyzing such data with Overlapped Windowing, choosing too large a window size can lead to overfitting. Overfitting occurs since features become too similar (limited variation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
111
Q

What is the number of windows generated when applying a sliding window technique with a total time window of 12s, a window size of 4s, and an overlap of 50%?
A) 3 windows
B) 4 windows
C) 5 windows
D) 6 windows

A

C

First, let’s calculate the number of windows without overlap: (Total time window) / (Window size) = 12s / 4s = 3 windows. Now, since there’s a 50% overlap, each window will overlap with the next one by 2s (half of the window size). So, in total, we’ll have 5 windows: the 3 distinct windows plus two overlapping windows.

112
Q
  1. The temperatures in degrees Celsius recorded in a city over a week.
  2. The brands of smartphones used by a group of students.
  3. The ratings of a new movie as ‘poor’, ‘average’, ‘good’, and ‘excellent’.
  4. The ages of participants in a survey.

Choose the correct type of data for each set:

A. 1 - Numerical, 2 - Categorical, 3 - Ordinal, 4 - Numerical
B. 1 - Ordinal, 2 - Numerical, 3 - Categorical, 4 - Categorical
C. 1 - Categorical, 2 - Ordinal, 3 - Numerical, 4 - Ordinal
D. 1 - Ordinal, 2 - Categorical, 3 - Numerical, 4 - Numerical”

A

A

Temperatures are numerical as they are continuous measurements.
Smartphone brands are categorical as they represent different groups.
Movie ratings are ordinal as they convey an order of preference or quality.
Ages are numerical as they are discrete measurements.

113
Q

Which of the statement is incorrect about the frequency domain features?(plz select one)
A. We can use this technique to catogory patterns of machine operations.
B. We can identifying walking patterns from accelerometer data.
C. The principle is to measure similarity between k-th sinusoidal basis functions and the original time series.
D. X(k) corresponds to frequency F(k) = fs / (k * N)

A

D

A and B are real life situations can use time frequency signal to extract features. C is correct. D is wrong cause F(k) = fs*k/N,k stands for how many periods in the samples.

114
Q

What advantage does an overlapped time window approach offer over a distinct time window approach?
1. Reduced memory usage due to smaller data segments
2. Faster processing speed
3. Improved temporal resolution by capturing more frequent updates
4. Simpler implementation without the need for boundary management

A

3

Option 1, 2 are the advantages of distinct time window over overlapped time window.
Managing the overlapping boundaries and coordinating the processing of overlapping windows can add complexity to the implementation compared to distinct time window so option 3 is incorrect.
In an overlapped time window approach, consecutive windows typically overlap with each other, meaning that each data point is included in multiple windows. This overlapping allows for more frequent updates and a finer-grained analysis of the data. As a result, the temporal resolution, or the ability to capture changes in the data over time, is enhanced. Therefore, option 3 is correct

115
Q

What is a true statement regarding feature extraction?
A) Raw data that is directly used as features is immune to noise and outliers.
B) The primary goal of feature engineering is to increase the complexity of the data rather than simplifying it.
C) Features are exclusively numeric values and cannot encompass strings or categorical values.
D) Feature extraction entails selecting informative and discriminating features to effectively describe various datasets, patterns, or classes.
E) Feature extraction is dispensable when dealing with sensory data collected from sensors such as IMU or bio-physiological sensors.

A

D

Raw data often contains noise and outliers. The primary goal of feature engineering, including feature extraction, is to simplify the data representation while preserving relevant information. Features can include numeric values, strings, or categorical values. Feature extraction is essential when dealing with sensory data. Hence, the correct answer is D.

116
Q

Which of the following pairs does not correctly correspond data with its type?
a. Discrete data: The number of children
b. Continuous data: Weight
c. Nominal data: Hair color
d. Ordinal data: Gender

A

D

a. Discrete data consists of integer values and is measured in a distinct form. The number of children is an integer value and corresponds to discrete data.
b. Continuous data takes on values over an infinite range and is measured in decimal form. Weight corresponds to continuous data.
c. Nominal data represents categories without any order or ranking. Hair color corresponds to nominal data.
d. Ordinal data refers to data with categories that have a specific order or ranking. However, gender is treated as nominal data. Nominal data represents categories without any order or ranking

117
Q

In a survey conducted to understand people’s coffee drinking habits, respondents were asked to state their age, how many cups of coffee they drink per day, their favorite type of coffee, and to rank their preference for coffee on a scale from 1 to 5 (with 5 being the highest preference). Which of the following correctly classifies the types of data collected?
A) Age - Continuous Data, Cups of Coffee per Day - Discrete Data, Favorite Type of Coffee - Nominal Data, Preference for Coffee - Ordinal Data

B) Age - Nominal Data, Cups of Coffee per Day - Continuous Data, Favorite Type of Coffee - Ordinal Data, Preference for Coffee - Categorical Data

C) Age - Ordinal Data, Cups of Coffee per Day - Nominal Data, Favorite Type of Coffee - Discrete Data, Preference for Coffee - Continuous Data

D) Age - Categorical Data, Cups of Coffee per Day - Quantitative Data, Favorite Type of Coffee - Continuous Data, Preference for Coffee - Nominal Data

A

A

The correct answer is A

  • Age is a continuous variable because it can take on any value within a range.
  • The number of cups of coffee per day is discrete because it counts the number of cups, which are whole numbers.
  • The favorite type of coffee is nominal data because it categorizes the coffee types without any order.
  • The preference ranking for coffee is ordinal data because it expresses an order of preference without specifying the exact differences between ranks.
118
Q

In the context of time-domain feature engineering, choosing the overlap ratio impacts the model’s performance due to:

A) Enhanced feature detail but higher compute cost
B) Simplified model at the cost of missing detail
C) No impact on computational resources
D) Improved compute efficiency with potential underfitting

A

A

The choice of overlap ratio is crucial as it influences the level of detail captured in features and the computational resources required. A higher overlap can capture more nuanced patterns, beneficial for the model’s performance but at the expense of increased computational demands.

119
Q

Question: Regarding the window size parameter (λ) in feature engineering from sensory data, which of the following statements is correct? (select one)

A) A large window size invariably increases model variation and prevents overfitting.
B) It directly measures the noise level in sensor data.
C) It exclusively controls the amount of overlap between data windows in a time series.
D) It specifies the length of data segments for summarizing features.

A

D

The window size parameter in feature engineering, denoted by λ, determines the length of data segments over which features are summarized, capturing essential temporal characteristics within those intervals (D). It does not directly measure noise (B). It does not solely control window overlap—overlap is related to overlap ratio rather than size (C). Contrary to statement (A), an overly large window size can actually lead to features being too similar (limited variation) and may result in overfitting, as it might not capture the necessary detail needed for the model to generalize well to new, unseen data.

120
Q

What is the significance of overlapping windows in the context of feature engineering from sensory data?
A) Overlapping windows are primarily used to reduce the computational complexity of the feature engineering process.
B) Overlapping windows increase the size of the dataset by duplicating data points, thereby improving the accuracy of machine learning models.
C) Overlapping windows aim to capture more detailed information between data segments, potentially improving feature variability and model performance.
D) Overlapping windows are used to correct errors in the data collection process, ensuring that missing data points are accurately imputed.

A

C

windows allow us to capture more nuanced patterns in the data by allowing for partial overlap between consecutive windows, which can enhance the model’s ability to detect subtle features in sensory data and ensure that no data at the edges of a data segment goes missing

121
Q

In the context of feature engineering from sensory data, why is it important to extract features like mean, standard deviation, and window size from time-domain data?

A) To increase the computational complexity of the data analysis process.
B) To directly use the raw sensory data for machine learning models without any preprocessing.
C) To capture the underlying patterns and characteristics of the data, facilitating more accurate and efficient predictions by machine learning models.
D) To ensure that all features are nominal, thus simplifying the machine learning model.

A

C

Extracting features such as mean, standard deviation, and specifying the window size from time-domain data is crucial for understanding the underlying patterns and variability in sensory data. These features help summarize the data’s central tendency and variability over specified intervals, providing meaningful insights that can significantly enhance the accuracy and efficiency of predictions made by machine learning models. This process is essential for transforming raw sensory data into a format that can be effectively utilized for analysis and decision-making

122
Q

What is the purpose of windowing in spectral analysis?

a) Windowing is a technique used to isolate specific frequency components of a signal for analysis.

b) Windowing involves dividing a signal into segments and applying a window function to each segment to reduce spectral leakage and improve frequency resolution.

c) Windowing refers to the process of removing noise from a signal by filtering out unwanted frequencies.

d) Windowing is the process of reducing the amplitude of a signal to improve its clarity.

A

a)

Windowing is a technique used in spectral analysis to reduce spectral leakage and improve frequency resolution. It involves dividing a signal into segments and applying a window function to each segment before performing the Fourier transform. This process helps minimize the distortion caused by spectral leakage, resulting in more accurate frequency representation in the frequency domain analysis.

123
Q

When conducting feature extraction on numerical data in the frequency domain, what effect does using a smaller window size have on the frequency spectrum?

A) Increases the precision of frequency identification
B) Reduces the computational requirements
C) Enhances the temporal resolution
D) Causes more leakage in the spectrum

A

Utilizing a smaller window size during the feature extraction process in the frequency domain tends to cause more leakage in the frequency spectrum. Leakage refers to the spreading of signal content to many frequencies rather than being confined to specific ones, which can complicate the interpretation and analysis of the frequency data. This effect is mentioned as a consequence of using smaller window sizes, as it can distort the true frequency content of the signal being analyzed.

124
Q

Which of the following statements about data features is incorrect? (Select One)

  1. A feature is an individual measurable property or characteristic of a phenomenon being observed.
  2. The Fourier Transform summarizes the values in the frequency domain.
  3. The Fourier Transform decomposes signals using sinusoidal functions.
  4. Repetitive patterns are not observed in the frequency domain.
A

4

The Fourier Transform reveals frequency components of a signal, including repetitive patterns. Option 4 contradicts the fundamental principle of Fourier analysis, which identifies repetitive patterns as distinct frequencies in the frequency domain. Therefore, option 4 is incorrect concerning data features and Fourier Transform properties.

125
Q

Which of the following statements about overlapping windows is incorrect? (Select One)

A. Too high overlap ratios can cause overfitting issues.
B. In situations with limited data, considering a higher overlap percentage might be beneficial.
C. Using overlapping windows can avoid data duplication.
D. Compared to distinct windows, overlapping windows have the advantage of preserving data at the boundaries.

A

C

(A) points out that excessive overlap increases the overlap between data points, which can create a tendency to over-adapt to certain patterns or noise during analysis or modeling. (B) suggests that in situations where data is limited, a high overlap ratio can maximize data utilization and increase the accuracy of your analysis. (C) incorrectly suggests that using overlapping windows can prevent data duplication. Overlapping windows maintain the continuity of information and enable sophisticated analysis through overlapping data points. (D) emphasizes that overlapping windows can better preserve the boundaries of the data compared to distinct windows.

126
Q

In the windowing process, why is it not recommended to set the window overlapping factor too high? (Select one)

A. Excessive overlap may reduce the impact of windowing, making the signals less continuous.
B. High overlap can lead to overfitting because the features become too similar.
C. A high overlapping factor can cause spectral leakage, which distorts the signal.
D. High overlap can make it harder to accurately locate transient events within the data.

A

B

Option B is correct because high overlap leads to redundant data, making features too similar and reducing variation, which leads to overfitting.

Explanations for the wrong choices:
Option A is incorrect because the windowing process aims to make the signals more continuous. Excessive overlap will enhance continuity, not reduce it.
Option C is incorrect because spectral leakage is related to the window size and the type of window function used, not the overlapping factor.
Option D is incorrect because increasing the overlap factor generally does not make it harder to locate transient events. In fact, a higher overlap can improve the detection of such events by ensuring they are captured in multiple windows.

127
Q

Question: Which of the following best classifies the types of data according to their categories and characteristics?
A) Temperature readings and color preference are both examples of nominal data because they can be categorized without a natural order.
B) Counting the number of students in a class represents discrete data, while measuring their average height represents continuous data.
C) Gender is an example of ordinal data because it can be naturally ranked or ordered.
D) The frequency of an activity, such as how often someone exercises, falls under continuous data because it is measurable and has infinite possibilities.

A

B

Correct Answer: B) Counting the number of students in a class represents discrete data while measuring their average height represents continuous data.

Option A: Temperature readings are continuous data because the temperature can vary in a continuous range and can be measured precisely. Colour preference, while it is nominal data because it categorizes colours without a natural order, cannot be paired with temperature readings as they are different types of data.
Option C: Gender is an example of nominal data, not ordinal data. Nominal data includes categories without a natural order or ranking, and gender categories typically do not have a rank order associated with them.
Option D: The frequency of an activity is not continuous data but rather ordinal data if the frequency is expressed in categories such as “Never, Rarely, Sometimes, Usually, Always.” If the frequency were measured in exact counts or measurements (e.g., number of times exercised per week), it could be considered discrete data. However it does not represent continuous data since frequency categories do not have an infinite range of values.

128
Q

To develop a stress detection model at home, some data were collected. Among the collected data, temperature is (a)____ data, income level (Low, Middle, Upper) is (b)_____ data, gender is (c)____ data, and the number of siblings is (d)____ data.

Which of the following is correct for (a), (b), (c), and (d)?

  1. discrete, continuous, nominal, ordinal
  2. nominal, discrete, ordinal, continuous
  3. ordinal, nominal, discrete, continuous
  4. continuous, nominal, ordinal, discrete
  5. continuous, ordinal, nominal, discrete
A

5

To solve this quiz, we need to understand the types of data. Temperature is continuous data because we can measure it. Income level is ordinal data since we can order it from low to high (Low < Middle < Upper). Gender is nominal data because it represents categories without inherent order or rank. The number of siblings is discrete data because we can count it as distinct, separate values. Therefore, the correct answer is option 5.

129
Q

Select one option which correctly set the hierarchy within the types of data. Here, A: {B,C} implies that A incldues B and C.

A) Numerical data: {Continuous data, Nominal data} and Quantitative data: {Discrete data, Ordinal data}
B) Qualitative data: {Categorical data, Discrete data} and Continuous data: {Numerical data, Ordinal data}
C) Quantitative data: {Continuous data, Discrete data} and Categorical data: {Nominal data, Ordinal data}
D) Nomial data: {Ordinal data, Qualtitative data} and Numerical data: {Continuous data, Quantitative data}

A

C

Numeical data which is equivalent to Quantitative data includes Continuous data (which you can measure continuously) and Discrete data (which you can count). Also, Categorical data which is equivalent to Qualitative data includes Nomial data (which you can brand) and Ordinal data (which you can rank). Hence, the correct answer is C.

130
Q

What is the result of too much overlapping in windowing?

A. Better accuracy
B. Overfitting
C. Might miss what happens in the window edges
D. Information loss at the edge

A

B

Overfitting. Because too much overlapping means same data is being used over and over again, which is bad. This is also mentioned in slide 11

131
Q

Question: What is the primary purpose of using dummy coding, such as OneHotEncoding, when dealing with nominal values in data preprocessing for machine learning models?

A) To transform nominal values into a format that can be directly used in regression models, allowing the model to interpret each unique nominal value as a separate feature.
B) To increase the dimensionality of the dataset exponentially, thereby enhancing the computational complexity of model training.
C) To normalize the distribution of nominal variables so that they have a mean of 0 and a standard deviation of 1.
D) To reduce the importance of nominal variables in the dataset by converting them into binary variables that are less informative than the original values.

A

A

Explanation: Dummy coding, or OneHotEncoding, is a technique used to convert nominal (categorical) variables into a binary matrix representation. This process involves creating a new binary (dummy) variable for each level of the categorical variable. In the context of machine learning, this is particularly useful because it allows models to treat each category as a separate entity or feature. This transformation is crucial for models that can only interpret numerical input, such as linear regression, where the presence or absence of a category is indicated by 1 or 0, respectively. Thus, dummy coding enables the inclusion of nominal values as distinct features in the model, facilitating the analysis of data that includes categorical variables.

132
Q

Question :
What are the one wrong example of feature extraction from frequency domain of sensory data?
1) Get Highest amplitude frequency
2) Frequency weighted singal average
3) Power spectrum entropy
4) Counting the number of data points in the frequency domain

A

Getting the highest amplitude frequency involves identifying the frequency component with the maximum amplitude in the frequency domain, which is a common approach in signal processing to find the dominant frequency.

Frequency-weighted signal average is a method where each frequency component is weighted by its amplitude before calculating the average, providing a feature that gives more importance to frequencies with higher amplitudes.

Power spectrum entropy involves calculating the entropy of the power spectrum, which is a measure of the distribution of power among different frequency components and can provide insights into the complexity or predictability of the signal.

However, 4) Counting the number of data points in the frequency domain is not a meaningful feature extraction method from the frequency domain. The number of data points is simply a function of the length of the time-domain signal and the parameters of the Fourier transform, and it does not provide insightful information about the characteristics of the signal in the frequency domain.

133
Q

What is the primary goal of feature engineering in the context of machine learning with sensory data?

a) To increase the computational complexity of machine learning algorithms.
b) To transform raw data into a format that is more understandable to algorithms.
c) To reduce the accuracy of predictive models to prevent overfitting.
d) To enhance the visual appeal of data when plotted on graphs.

A

b

a) Incorrect. Increasing the computational complexity is not a goal of feature engineering; it aims to simplify the model’s understanding of data.
b) Correct.
c) Incorrect. Feature engineering’s purpose is not to reduce model accuracy; rather, it enhances model performance by providing relevant features.
d) Incorrect. The main objective of feature engineering is not to improve data’s visual appeal but to make it more actionable for predictive modeling.

134
Q

Why would we want to use a large window size as opposed to a small one when processing time-series data (select all that apply)?

A: There are a lot of datapoints.
B: Variation between datapoints over a long period is minimal.
C: Short term variation can be easily captured.
D: We want a rougher curve.

A

A,B

A larger window is used to drastically reduce number of data point as well as when values don’t change for a long time; hence A and B are valid answers. A smaller window is desired when there is a lot of variation in a short time and when don’t wnat as smooth of a curve; hence C and D are invalid answers.

135
Q

What is the purpose of applying a Hamming window in the frequency domain analysis?
A) To enhance the resolution of the frequency spectrum.
B) To amplify the signal’s high-frequency components.
C) To smooth the ends of a signal, reducing spectral leakage.
D) To increase the computational speed of the Fourier Transform.

A

C

The purpose of applying a Hamming window (or other window functions like Hanning or Blackman) in frequency domain analysis is to smooth the abrupt ends of a signal. This technique helps in reducing spectral leakage, which occurs when energy from the main frequency component spreads to other frequencies, potentially distorting the signal’s frequency spectrum

136
Q

Which of the following statements best describes a limitation of using DBSCAN for clustering GPS coordinates in mobility data processing?

A) DBSCAN clusters GPS coordinates based on temporal characteristics, which may lead to long traces instead of distinct visiting/staying points.
B) DBSCAN does not require defining a distance metric, which can result in inaccurate clustering of GPS coordinates.
C) DBSCAN ignores spatial characteristics, leading to clusters forming only in buildings rather than streets.
D) DBSCAN does not consider temporal characteristics, resulting in clusters resembling long traces instead of distinct visiting/staying points.

A

D

Answer: D) DBSCAN does not consider temporal characteristics, resulting in clusters resembling long traces instead of distinct visiting/staying points.

Explanation: The limitation described in the context is that DBSCAN ignores temporal characteristics, leading to clusters that resemble long traces rather than representing distinct visiting or staying points. This can affect the accuracy of identifying semantically meaningful places in mobility data processing.

137
Q

Which approach is suggested for approximately calculating the duration of a particular state within a time window, considering the complexity of accounting for various temporal overlap cases?
A) Analyzing all four cases of temporal overlap (inner, outer, left-overlap, right-overlap) within the time window to determine the exact duration.
B) Re-sampling the label state data at a small time interval and utilizing forward-filling to approximate the duration.
C) Estimating the duration by averaging the start and finish times of the state within the time window.
D) Applying a sliding window technique to capture the state transitions and calculate the exact duration within each window.

A

B

Answer: B) Re-sampling the label state data at a small time interval and utilizing forward-filling to approximate the duration.
Explanation: The suggested approach involves re-sampling the label state data at a small time interval and using forward-filling to approximate the duration, which simplifies the calculation process. This method provides an approximate duration by multiplying the number of items by the unit time, offering a less cumbersome alternative to considering all cases of temporal overlap.

138
Q

Question: Which of the following reasons best explains why density-based clustering algorithms, such as DBSCAN, might not perform well in certain situations?

A) They require a fixed number of clusters to be specified in advance.
B) They perform exceptionally well in identifying clusters of non-uniform shapes and sizes, making them too flexible for most practical applications.
C) They cannot effectively handle noise and outliers in the dataset, leading to many points being misclassified.
D) They struggle with datasets where clusters have significantly varying densities, as the algorithm uses a global set of parameters (epsilon and minPts) that may not be suitable for all density variations.

A

D

Explanation: Density-based clustering algorithms, like DBSCAN, define clusters based on areas of high density separated by areas of low density. While they are advantageous for their ability to find clusters of arbitrary shapes and for their robustness to outliers, they can struggle in datasets where clusters have significantly varying densities. This is because the algorithm uses a single set of parameters (epsilon and minPts) to determine the neighborhood size and the minimum number of points required to form a dense region, respectively. These parameters are global and may not fit well with all clusters if their density varies widely across the dataset, leading to potential issues in accurately identifying all clusters.

139
Q

How are nominal values typically handled in machine learning algorithms?
1. They are ignored, as machine learning algorithms only work with numerical data.
2. They are converted into ordinal values to facilitate feature selection.
3. They are encoded using dummy variables to represent categorical information.
4. They are transformed into continuous variables for easier mathematical manipulation.

A

Numeric values can be readily used for machine
learning, but “nominal” values cannot be used
directly. Dummy variable is a variable that can assume
either one of two values (usually 1 and 0),
where 1 represents the existence of a certain
condition and 0 indicates that the condition
does not hold. By this methods, nominal values can be used in machine learning algorithms. Therefore, the correct answer is 3

140
Q

How can you make your model less overfit to the training data?
1. Decrease the amount of training data to reduce model complexity.
2. Increase the number of hidden layers in the neural network architecture.
3. Apply regularization to selecting only useful features.
4. Decrease the learning rate to slow down the training process.

A

Options 1 and 2. are incorrect because reducing the amount of training data or increasing the model complexity typically exacerbates overfitting rather than reducing it.
Option 3 is incorrect because decreasing the learning rate may slow down the training process but does not directly address overfitting.
Regularization try to eliminate factors that do not impact the prediction outcomes by grading features based on importance. For example, mathematical calculations apply a penalty value to features with minimal impact. Therefore, option 3 is correct answer.

141
Q

What is the most suitable distance metric to use for a dataset consisting of GPS coordinates?
A. Euclidean Distance
B. Manhattan Distance
C. Haversine Distance
D. Cosine Similarity

A

C

The Haversine Distance is the most appropriate for measuring distances between points on the Earth’s surface, represented by GPS coordinates (latitude and longitude). It accounts for the spherical shape of the Earth and provides the shortest distance between two points on the globe. Euclidean and Manhattan distances are more suited for flat Cartesian coordinates, and Cosine Similarity is typically used for measuring similarity in orientations or in text data, not for physical distances on a sphere.

142
Q

In adapting k-fold cross-validation for time-series datasets, how should the dataset be partitioned and the validation folds be selected to accurately reflect temporal dependencies?

A. Shuffle the data and select random folds for validation to ensure randomness in data selection.
B. Partition the dataset into k consecutive folds without shuffling and select initial folds for validation, disregarding the time order.
C. Partition the dataset into k consecutive folds according to time order and select folds with later timestamps for validation.
D. Use stratified sampling for creating folds, ensuring a uniform distribution of the output variable across all folds, irrespective of the time order.

A

C

For time-series data, it’s essential to respect the sequential nature and temporal dependencies within the data. Therefore, the dataset should be divided into k consecutive folds following the time order. Importantly, the folds selected for validation should have later timestamps than those used for training, ensuring that the validation process mimics the real-world scenario where predictions are made for future events, based on past data.

143
Q

Question: How does Fourier Transform (FT) contribute to feature engineering in the frequency domain of sensory data? (select one)
A) FT simplifies data by converting all sensory information into a single constant value.
B) FT decomposes signals into sinusoid functions to identify patterns, such as walking patterns from accelerometer data, based on the similarity between sinusoidal basis functions and the original time series.
C) FT primarily enhances the temporal resolution of sensory data without altering its frequency content.
D) The primary use of FT is to increase the battery life of devices collecting sensory data.

A

B

The Fourier Transform (FT) is a powerful tool in frequency domain feature engineering, as it decomposes time series data into its constituent sinusoidal components. This decomposition allows for the identification of repeating patterns within the data, such as the detection of walking patterns in accelerometer data. By analyzing the amplitude and phase of these sinusoidal components (using Fast Fourier Transform, FFT), it is possible to extract meaningful features that describe the underlying phenomena or activities captured by the sensory data.

144
Q

Question: In machine learning, what comprehensive strategy effectively addresses imbalanced datasets to improve model performance across classes? (select one)
A) Increasing training data volume without specific focus on class distribution.
B) Integrating data-level adjustments, algorithmic enhancements, and tailored evaluation metrics, such as rebalancing techniques, cost-sensitive learning, and precision-recall analysis.
C) Limiting the model to linear algorithms to simplify the decision boundary in imbalanced scenarios.
D) Applying standard normalization techniques to all features as a primary strategy for imbalance correction.

A

B

Addressing imbalanced datasets effectively necessitates a holistic strategy encompassing data preparation, algorithm optimization, and appropriate evaluation methods. This includes rebalancing the dataset through techniques like oversampling or undersampling, incorporating algorithmic modifications such as cost-sensitive learning to weigh classes differently, and employing evaluation metrics like AUC-ROC or precision-recall curves that provide a nuanced understanding of model performance across different classes. Such a comprehensive approach ensures that the model is robust, fair, and performs well across both majority and minority classes.

145
Q

Which of the following descriptions is incorrect?
a. When converting nominal data with k classes into dummy variables, the minimum degrees of freedom is 2k.
b. Numerical values can be converted into categorical data for feature extraction.
c. GPS coordinates in mobility data can be used for feature extraction through various clustering techniques.
d. The DBSCAN algorithm is an example of density-based clustering.

A

A

a. When converting nominal data with k levels into dummy variables, one level is used as the reference category, allowing for k - 1 degrees of freedom.
b. Numerical values can also be categorized. For example, we can categorize a temperature as low, medium, and high.
c. Clustering of GPS coordinate data involves identifying clusters using various methods based on defined distance metrics, which can be used for feature extraction.
d. The DBSCAN algorithm is a density-based clustering technique that considers epsilon and min_points.

146
Q

To reduce overfitting, which of the following is incorrect?
a. Early stopping: Stopping training before the model learns too much from the training data.
b. Pruning in decision tree: Reducing the complexity of the model.
c. Repeated reuse of the same training data.
d. Regularization: Selecting only useful features.

A

C

a. Stopping training before the model learns too much from the training data is an effective method to prevent overfitting.
b. Pruning in decision trees is a technique used to reduce the complexity of the model by reducing the depth of the tree or removing unnecessary branches, thereby improving the model’s generalization performance.
c. Using the same data repeatedly for training can lead to overfitting. In such cases, the model may perform well on the training data but struggle to generalize to new data.
d. Regularization is a technique to simplify the model by selecting only useful features, thereby preventing overfitting.

147
Q

Question: Choose the correct word to fill in the blanks.
If class information is used for selection, feature selection is a part of ___ methods.
Dimension reduction techniques like PCA and Auto-encoders are categorized under ___ methods.
Dropouts and early stopping in neural networks are examples of ___ methods.

A. supervised - unsupervised - regularization
B. unsupervised - unsupervised - regularization
C. unsupervised - supervised - dimension reduction
D. supervised - unsupervised - dimension reduction

A

A

The correct answer is A. supervised - unsupervised - regularization. This is because feature selection often involves supervised methods where the relevance of features is determined based on their relationship with the output variable. Dimension reduction techniques like PCA (Principal Component Analysis) and auto-encoders are typically unsupervised, as they reduce data dimensions without relying on output labels. Finally, techniques like dropouts and early stopping are regularization methods used to prevent neural networks from overfitting to the training data.

148
Q

Question: Select the option where the term and its description are correctly matched
A.Early stopping : by reducing the complexity of
the classifiers to avoid learning from noisy data (e.g.,
pruning)
B.Network reduction : by selecting
only useful (generalizable) features
C.Expansion of the training data : by stopping training before it’s
optimized too much to the training data
D.Feature selection and regularization : by increasing training
dataset size, including data augmentation (e.g.,
random noise addition)
E.K-fold cross-validation: The data is divided randomly into k parts, and each part is held out in
turn and the learning scheme trained on the remaining k-1 parts

A

E

The correct answer is E: K-fold cross-validation. This technique involves dividing the data randomly into k segments. In each round of cross-validation, one segment is reserved for testing the model, while the remaining k-1 segments are used for training. This process is repeated k times, with each segment serving as the test set once. This method is highly effective for assessing the model’s performance and its ability to generalize to unseen data. It ensures that every data point is used for both training and testing, providing a comprehensive evaluation of the model’s performance.

149
Q

Which of the following statements is correct about dummy coding for handling categorical data in feature engineering?

A. Dummy coding allows for k degrees of freedom, where k is the number of categories.
B. Dummy coding uses k features in the representation, where k is the number of categories.
C. Dummy coding removes the extra degree of freedom by using only k-1 features in the representation, where k is the number of categories.
D. Dummy coding is the same as one-hot encoding and uses k features for k categories.

A

C

Dummy coding removes the extra degree of freedom by using only k–1 features in the representation. This is in contrast to one-hot encoding, which allows for k degrees of freedom, where k is the number of categories.

150
Q

What is the correct way to perform cross-validation when feature selection is involved?

A. Perform feature selection on the entire dataset, then apply cross-validation.
B. Within each cross-validation fold, perform feature selection using only the training data for that fold, excluding the test data.
C. Perform feature selection on the test data of each cross-validation fold.
D. Perform feature selection on the training data of one fold and apply it to all other folds.

A

B

The correct approach is to perform feature selection within each cross-validation fold, using only the training data for that fold and excluding the test data. This ensures that the test data remains unseen during the feature selection step, correctly mimicking the application of the classifier to an independent test set. Choice A involves performing feature selection on the entire dataset, including the test data, which violates the principle of cross-validation. Choices C and D are incorrect because feature selection should not be performed on the test data or using data from other folds.

151
Q

When processing mobility data, such as GPS coordinates from a smartphone, which method is identified as an effective approach for generating meaningful features for machine learning models?

A) Calculating the straight-line distance between each pair of consecutive GPS coordinates to determine speed.
B) Using the latitude and longitude as direct inputs to the model without any preprocessing.
C) Clustering GPS coordinates to identify semantically meaningful places, such as home or work, using algorithms like DBSCAN.
D) Applying a Fourier transform to the time series of GPS coordinates to extract frequency-based features.

A

C

The lecture highlights the importance of identifying semantically meaningful places from GPS data for feature generation in mobility data processing. This is achieved by clustering GPS coordinates, which helps in discovering frequently visited or stayed locations. The use of a clustering algorithm, such as DBSCAN, is specifically mentioned as an approach to handle this task. Clustering allows for the transformation of raw GPS coordinates into meaningful categorical data, such as the identification of places like “home” or “work,” which can significantly enhance the interpretability and effectiveness of machine learning models. This approach goes beyond simply using raw coordinates or attempting to directly calculate features such as speed or distance traveled, instead providing a layer of semantic understanding to the mobility patterns of individuals.

152
Q

Why is stratified K-fold cross-validation preferred over standard K-fold for uneven class distributions?

A) It reduces computation by needing fewer folds.
B) It keeps the class proportions consistent across folds, matching the overall dataset.
C) It focuses only on the minority class to estimate model sensitivity better.
D) It boosts model accuracy by balancing classes during evaluation.

A

B

Stratified K-fold cross-validation is chosen for datasets with uneven class distributions because it ensures each fold has the same class ratio as the entire dataset, preserving class balance across all folds. This approach provides a more accurate performance assessment by avoiding bias that might occur if some folds had significantly different class proportions than others.

153
Q

How to dummy code a record with the category “Standing” given the four categories “On Table”, “Washing Hands”, “Standing”, “Driving”, assuming that the first category is dropped?

a) (0, 0, 1, 0)
b) (0, 0, 0)
c) (0, 1, 0, 0)
d) (0, 1, 0)

A

d)

Dummy coding uses k - 1 variables and usually treats the first category as the baseline (the intercept in a linear model). Therefore, the mapping is as follows:
On Table = (0, 0, 0) -> intercept
Washing Hands = (1, 0, 0)
Standing = (0, 1, 0)
Driving = (0, 0, 1)

154
Q

What is the primary purpose of using stratified k-fold cross-validation?

a) To ensure that each fold has approximately the same proportion of different categories or classes.
b) To randomly partition the dataset into k equal-sized folds.
c) To minimize the computational complexity of the cross-validation process.
d) To validate the data without considering its distribution.

A

a)

The correct answer is a) To ensure that each fold has approximately the same proportion of different categories or classes. Stratified k-fold cross-validation is specifically designed to address the issue of class imbalance in the dataset. By ensuring that each fold maintains a similar distribution of classes as the original dataset, this method provides more accurate and representative estimates of model performance, particularly in scenarios where certain classes are underrepresented. This approach is crucial for ensuring that the model is trained and evaluated on a balanced representation of the data, leading to more reliable generalization to unseen samples. Options b), c), and d) are incorrect because they do not capture the primary purpose of stratified k-fold cross-validation, which is to address class imbalances and ensure fair evaluation of the model across different classes or categories.

155
Q

Which statements about dummy coding are true?
a) Dummy coding is a method used to represent categorical variables in numerical form, commonly employed in regression analysis for efficient inclusion of categorical variables.

b) Dummy coding assigns arbitrary numerical values to categorical variables, potentially introducing bias and inaccuracies in statistical analysis.

c) In dummy coding, binary variables known as dummy variables are created for each category within a categorical variable, preserving categorical information for regression modeling.

d) Dummy coding is primarily utilized for imputing missing values in datasets, ensuring data integrity and completeness.

A

a), c)

Dummy coding is a method utilized in regression analysis to transform categorical variables into numerical form. It involves creating binary variables, known as dummy variables, for each category within the categorical variable. These dummy variables retain the categorical information while allowing for efficient inclusion in regression models. Thus, dummy coding enables the representation of categorical variables in regression analysis, facilitating the examination of their effects on the outcome variable.

156
Q

Which statement accurately describes overfitting in machine learning?

a) Overfitting arises when a model performs exceptionally well on the training dataset but fails to generalize effectively to new, unseen data, leading to poor performance in real-world scenarios due to an overly complex model.

b) Overfitting occurs when a model is too simplistic and cannot capture the complexity of the underlying patterns present in the data, resulting in underperformance across both training and testing datasets.

c) Overfitting is characterized by a model fitting the noise or random fluctuations in the training data rather than the true underlying relationship, causing it to make erroneous predictions when presented with new data.

d) Overfitting denotes a scenario where a model achieves consistent and equally high performance on both the training and testing datasets, demonstrating robustness and reliability in its predictions across different datasets and scenarios.

A

a), c)

Overfitting occurs when a model fits the training data too closely, capturing noise instead of the underlying patterns. This leads to excellent performance on the training data but poor generalization to new data. Essentially, the model memorizes the training data instead of learning meaningful relationships, resulting in inaccurate predictions on unseen data. To prevent overfitting, techniques like regularization and cross-validation are used to ensure the model generalizes well.

157
Q

When engineering features from sensory data in the time domain, which of the following is NOT a commonly used method for summarizing values within a specified window?

A) Calculating the mean value
B) Computing the standard deviation
C) Determining the maximum amplitude frequency
D) Identifying the maximum and minimum values

A

C

While calculating the mean value, computing the standard deviation, and identifying the maximum and minimum values within a specified window are common methods for summarizing time-domain sensory data, determining the maximum amplitude frequency is typically a method used in the frequency domain, not the time domain. This distinction is crucial for correctly preprocessing and extracting features from sensory data, which can significantly impact the performance of machine learning models by providing them with relevant and insightful features derived from raw data.

158
Q

In the context of machine learning model evaluation, why is it important to consider the class distribution of the dataset, and how can imbalanced datasets impact the performance metrics?

A) Because imbalanced datasets can lead to overfitting, making the model less generalizable to unseen data.
B) Because imbalanced class distribution can bias performance metrics, leading to misleadingly high or low scores for certain classes.
C) Because balanced datasets require less computational resources for model training and evaluation.
D) Because the class distribution does not affect model performance or the interpretation of performance metrics.

A

B

Imbalanced class distribution in a dataset can significantly bias the performance metrics of a machine learning model. For instance, in a dataset with a high imbalance ratio (where one class significantly outnumbers the other), conventional metrics such as accuracy can become misleading. A model might predict only the majority class and still achieve high accuracy, overlooking the minority class’s predictive performance. This issue necessitates the use of techniques like SMOTE for rebalancing the dataset or alternative metrics such as AUC-ROC or AUC-PR that are less sensitive to class imbalance. These approaches help ensure a more accurate evaluation of the model’s ability to generalize and perform across all classes, making them critical considerations in model evaluation processes.

159
Q

Which of the following statements about the feature selection process in machine learning models is NOT correct?

A) In filter methods, important features are selected based on the general characteristics of the training data.
B) Wrapper methods involve a process of selecting subsets of features and evaluating iteratively the performance of each model.
C) Correlation-based feature selection (CFS) selects important features solely based on the inter-correlation among features.
D) Information gain measures the mutual information between individual features and the class to assess the importance of features.

A

C

A) describes the filter method for feature selection. Filter methods use the general characteristics of the training data, independently of the learning algorithm, to evaluate the importance of features. B) describes the wrapper method for feature selection. It selects a subset of features and then iteratively tests the performance of each model. D) correctly explains the use of information gain for evaluating the importance of features. C) provides an incorrect description of Correlation-based Feature Selection (CFS). CFS considers not only the level of inter-correlation among them but also the usefulness of individual features for predicting the class.

160
Q

Choose the data splitting method that is NOT appropriate for model evaluation.

A) Using 80% of the dataset for training and the remaining 20% for validation.
B) Randomly dividing the data into k parts and training the learning algorithm on the remaining parts after sequentially excluding each part.
C) Sequentially excluding each instance in the dataset and training the learning algorithm on the remaining instances.
D) Performing simple random sampling across the entire dataset for model evaluation.

A

D

A) describes Hold-out validation, a traditional method widely used to assess a model’s generalization performance. B) refers to K-fold cross-validation, where the dataset is randomly divided into k parts. For each part, the model is trained on k-1 parts and validated on the remaining one. C) points to Leave-One-Out Cross-Validation, an extreme form of K-fold cross-validation where k equals the number of instances in the dataset. D) suggests Simple Random Sampling. In this method, since the training and validation data are not separated, the validation results can be skewed.

161
Q

Please select the wrong statements below(only one):
A. The difference between one-hot coding and dummy coding is dummy coding removes one feature compared to one-hot coding.
B. For mixed data in time domain, We can make categories from the numerical values.
C. For GPS coordinates, we should use Euclidean distance instead of Haversine distance.
D. For DBSCAN algorithm, epsilon is the radius of one point to it’s neighber area, MinPts is the least points(including itself) in one core point’s neighber area.

A

C

For GPS coordinates, we should use Haversine distance cause it takes into the Earth’s tilt angle into consideration.

162
Q

Which of the following statements regarding regularization/shrinkage is incorrect?
A) Regularization/shrinkage involves fitting a model involving all p predictors, but the estimated coefficients are shrunken toward zero.
B) The shrinkage effect, also known as regularization, helps reduce variance in the model.
C) Regularization methods always estimate all coefficients to be exactly zero for some predictors.
D) Shrinkage methods can also perform variable selection by setting some coefficients to zero.

A

C

The correct answer is C) Regularization methods always estimate all coefficients to be exactly zero for some predictors.

163
Q

When clustering with GPS coordinates, what is the reason we use Haversine distance instead of commonly used Euclidean distance?

A. GPS coordinates have categorical information which can not be measured with Euclidean distance.
B. It is common for GPS coordinates to have missing values which makes it hard to use Euclidean distance.
C. Haversine distance can make use of temporal information while Euclidean distance can not.
D. GPS coordinates includes longitudes and latitudes which give the coordiates on the surface of sphere.

A

D

The reason we use Haversine distance when measuring distance between two GPS coordiantes is because longitude and latitude give coordinate information of a point on the surface of sphere. Hence, we should use Haversine distance which measures distance between two points on the surface of sphere. If we use Euclidean distance which measures distance between points in the Euclidean space, the result can be very misleading. The statements from rest of the options are not true.

164
Q

Choose the statement which is incorrect regarding the dimension reduction technique.

A. Principal Component Analysis is one of the commonly used dimension reduction techniques.
B. Dimension reduction refers to method of projecting the p predictors to M-dimensional subspace, where M < p.
C. Dimension reduction techniques give enhanced interpretability of features.
D. Autoencoder can perform non-linear dimension reduction.

A

C

One of the drawbacks of dimension reduction techniques is that it is hard to interpret the features after dimension reduction since they are linear or even nonliear combinations of original features. Hence, the answer is C. The other statements are all correct.

165
Q

When handling categorical data, we should consider applying the right algorithm or strategies to properly process the data. Which of the following methods is the correct strategy? (Choose 1)

A. Fill in missing data with the interpolation method.
B. Directly apply a low pass filter so that we eliminate high frequency jitters.
C. Apply one-hot encoding so we can perform linear regression on the data.
D. Normalize the data using the MinMaxScaler function.

A

C

The correct answer is C. By applying one-hot encoding, we essentially convert the categorical data into numerical data, which can then be processed similarly to typical numerical data (for instance, linear regression).

Explanation for the wrong options:
Option A is incorrect because interpolating categorical data is not possible. However, filling in missing data with forward or backward fill methods is still possible, albeit with careful consideration.
Option B is incorrect because filtering can only be applied to numerical data. However, if we map the labels to arbitrary numerical values on a one-to-one basis, performing LPF should be possible.
Option D is incorrect for a similar reason to Option B. However, proceeding with normalization after performing a one-to-one mapping is still problematic because, after scaling, the numerical data will not correspond to the initial mapping.

166
Q

When training ML models, it is critical to avoid overfitting so the model can generalize better to new data. Among the strategies listed below, please select the most effective one to prevent overfitting.

A. Performing hold-out cross validation instead of K-fold cross validation
B. Reducing model complexity
C. Increasing the number of features in the dataset
D. Increasing the training epoch

A

B

The correct option is B. Overfitting occurs when a model is too complex, capturing noise in the training data as if it were a real pattern. Simplifying the model (by reducing parameters, pruning a few layers or neural connections, or reducing decision tree depth) will help the model generalize better to unseen data by focusing on the main trends instead of the noises.

Explanation for the wrong options:
Option A: Performing hold-out cross-validation is a strategy to analyze time-domain data so that the model can learn the temporal dependency of the data. It is not necessarily a way to reduce the overfitting problem.
Option C: This can lead to more overfitting because by adding more features, we increase the dimensionality of the problem. This can make the model more complex and more likely to fit the noise in the training data.
Option D: If a model is trained for too many epochs, it can start to learn the noise in the training data, leading to overfitting.

167
Q

Which of the following is NOT correct regarding Overlapped Windowing?

a) Overlapped windowing is one of the techniques used for processing time-series or sequence data.
b) Increasing the ratio of overlapping can prevent overfitting by increasing the amount of data.
c) It helps maintain the continuity of data in time-series data.
d) The size and ratio of overlap of overlapping windows can be adjusted according to the characteristics of the data being analyzed and the analysis objectives.

A

b

Answer:b
a) Overlapped Windowing is used in signal processing, data analysis, etc., and it involves dividing time-series data into multiple small windows. This allows for analyzing and processing various parts of the data.
b) Increasing the amount of overlapping does increase the amount of data, however, too much overlapping can increase the risk of the model overfitting to the training data.
c) Overlapped Windowing helps maintain the continuity of time-series data, enabling the model to recognize temporal patterns in the data and make predictions based on them.
d) The size and degree of overlap of overlapping windows can indeed be adjusted based on the characteristics of the data being analyzed and the analysis objectives.
Therefore, the incorrect statement is b) Increasing the ratio of overlapping can prevent overfitting by increasing the amount of data.

168
Q
  1. Which one is NOT correct as a way to solve overfitting?

a) Early stopping
b) Increasing network complexity
c) Feature selection and regularization
d) Expansion of training data

A

b

Answer:b
Overfitting occurs when a model becomes too specific to the training data and fails to generalize well to new data. Different techniques are used to tackle this problem.
a) Early stopping: This method involves stopping the training process before the model overfits the training data too much, thus preventing overfitting effectively.
b) Increasing network complexity: Making the model more complex often worsens overfitting rather than alleviating it. Therefore, it’s not an effective method for dealing with overfitting.
c) Feature selection and regularization: These techniques help reduce model complexity and prevent overfitting by focusing on important features and constraining weights.
d) Expansion of training data: Providing the model with more diverse examples can help improve its ability to generalize to new data and mitigate overfitting.
Hence, increasing network complexity (b) is not an effective method for dealing with overfitting.

169
Q

Question: In the context of correlation-based feature selection (CFS) for model prediction, which type of feature is most desirable for effective model performance? (select one)

A) A feature with high correlation with other features but low correlation with the class label.

B) A feature with a low correlation with both other features and the class label.

C) A feature with a high correlation with both other features and the class label.

D) A feature with low correlation with other features but high correlation with the class label.

A

D

Explanation: Correlation-based Feature Selection (CFS) focuses on selecting feature subsets that are highly correlated with the class label while maintaining low inter-correlation among themselves. This approach aims to reduce redundancy and enhance the predictive power of the model. A feature that exhibits low correlation with other features avoids redundancy, and a feature that exhibits high correlation with the class label ensures its relevance and contribution to accurate model prediction. Therefore, option D is the correct choice.

170
Q

Question: Regarding the use of categorical features in data analysis tasks, which of the following statements is not accurate? (select one)

A) Nominal data cannot be readily used for machine learning.

B) To represent a nominal feature with k labels in numeric form, at least k one-hot vectors are required, with each vector indicating the presence of a class with a 1 and the absence with a 0.

C) Nominal features can be converted into numeric values through OneHotEncoding, where each category is represented by a unique dummy variable.

D) Numeric values can be readily used for machine learning.

A

B

  • Option A acknowledges the need to convert nominal data for machine learning, typically achieved through OneHotEncoding. Option C elaborates on this conversion, where each nominal category is represented uniquely. Option D underscores the direct applicability of numeric values in machine learning.
  • Option B is incorrect as it misstates the number of one-hot vectors needed. For a nominal feature with k labels, only k-1 vectors are necessary due to the k-1 degrees of freedom principle. For instance, with labels A, B, and C, two vectors are sufficient: [1, 0] for A, [0, 1] for B, and [0, 0] for C.
171
Q

What is the primary purpose of feature selection in machine learning models?
A) To increase the number of features for better model accuracy
B) To identify a subset of the most relevant features to improve model performance
C) To transform the features into a higher dimensional space
D) To ensure all features have equal variance

A

B

Feature selection reduces the complexity of the model by eliminating irrelevant or less significant features. This enhances the model’s performance by focusing on features that have more predictive power.

172
Q

What does cross-validation help achieve?
A) It increases the speed of training the model
B) It ensures that every feature is used for training and testing
C) It helps in assessing how the model will generalize to an independent dataset
D) It creates more data points for training the model

A

C

cross-validation involves partitioning the data into subsets, training the model on some subsets (training set) and testing it on the remaining subsets (validation set), thereby providing a reliable estimate of the model’s performance on unseen data.

173
Q

What is the primary focus of data-centric AI? (select one)
1. Model architecture
2. Data collection and preprocessing
3. Algorithm optimization
4. Interpretability of AI models

A

2

Data-centric AI emphasizes the importance of data collection, cleaning, and preprocessing in the AI development process. While model architecture and algorithm optimization are essential components of AI development, data-centric AI recognizes that the quality and relevance of the data used to train AI models significantly impact their performance. By prioritizing data collection and preprocessing, data-centric AI aims to ensure that AI models are trained on high-quality data, leading to more accurate and reliable predictions. Therefore, correct answer is 2.

174
Q

Which of the following is a common application of the Discrete Fourier Transform (DFT)? (select one)
1. Image compression
2. Speech recognition
3. Data encryption
4. Audio equalization

A

3

The Discrete Fourier Transform (DFT) is a mathematical technique used to transform signals from the time-domain into the frequency-domain. Audio equalization involves adjusting the amplitude of different frequency components in an audio signal. The DFT is commonly used in audio processing applications such as equalization to analyze and manipulate the frequency content of audio signals. Therefore, the correct answer is audio equalization.

175
Q

Which method can be employed to enhance the resolution of the Discrete Fourier Transform (DFT) for a given sampling frequency fs ?

1) Decreasing the number of samples
2) Maintaining a constant number of samples
3) Increasing the number of samples
4) Reducing the sampling frequency

A

3

Increasing samples in the Discrete Fourier Transform enhances resolution by providing more data points for better frequency distinction, crucial in fields like signal processing. Reducing samples or sampling frequency decreases resolution, while keeping samples constant doesn’t improve it.

176
Q

What is a key feature of Data-Centric AI regarding the relationship between AI models and data?
1) Isolation of model and data improvement processes
2) Model-centric approach prioritizing algorithm sophistication
3) Mutual improvement of both model and data through iteration
4) Static data annotation without considering model performance

A

3

3) Mutual improvement of both model and data through iteration
Explanation: Data-Centric AI emphasizes the iterative improvement of both models and data, recognizing that the quality of data directly impacts model performance and vice versa. This iterative process allows for continual refinement and optimization.

177
Q

What does the term “data smells” usually refer to in the context of data analysis?
A. Errors in data that depend on the context in which it is used.
B. Obvious errors in data that do not depend on context.
C. Latent errors in data that are independent of the context.
D. Errors in data that are obvious and depend on the context.

A

C

“Data smells” typically refer to subtle, latent errors or issues within a dataset that are not immediately obvious. These errors are often independent of the specific context. Unlike more overt or context-dependent errors, data smells are more insidious and may only become apparent upon deeper analysis. Identifying and addressing these latent issues is crucial for ensuring the quality and reliability of data analysis.

178
Q

Given this information about a sensor signal that is sampled at its Nyquist frequency:
Duration of the signal: 2 seconds
Bandwidth of the signal: 10Hz
What would be the frequency range after performing a Discrete Fourier Transform (DFT) on this signal, assuming no frequency shift operation is applied?
A. [0, 9.5] Hz
B. [0, 19.5] Hz
C. [-10, 10] Hz
D. [-9.5, 9.5] Hz

A

B

With a bandwidth of 10 Hz, the Nyquist frequency is 20 Hz. Sampling at the Nyquist frequency means the sampling rate is 20 Hz. The duration of the signal is 2 seconds, so the total number of samples is 20 × 2 = 40 samples.

The frequency bins are equally spaced from 0 up to (but not including) the sampling frequency. The spacing between each bin is
Sampling Frequency / Number of Samples = 20 Hz / 40 = 0.5Hz

Therefore, the frequency bins will range from 0 Hz in increments of 0.5 Hz up to 19.5 Hz.

179
Q

In the context of IoT data science, why is addressing data quality issues crucial for developing reliable AI models?

A) To ensure the physical durability of IoT devices.
B) To guarantee the security of data transmission over networks.
C) To enhance the visual appeal of data presentations.
D) To improve the reliability and validity of the insights derived from data analytics.

A

D

Addressing data quality issues, like label errors and biases, directly impacts AI model accuracy, ensuring reliable predictions and insights. High-quality data underpins trustworthy IoT applications, making addressing these issues crucial for dependable outcomes, especially in high-stakes scenarios.

180
Q

Why is it important to choose a sampling rate at least twice the highest frequency of the signal, according to the Nyquist theorem?

A) To double the data storage requirements.
B) To ensure signal processing uses the maximum amount of computational resources.
C) To prevent aliasing, ensuring the sampled signal accurately represents the original signal without frequency distortion. (Correct Answer)
D) To increase the complexity of signal processing.

A

C

Choosing a sampling rate at least twice the highest frequency of the signal, as per the Nyquist theorem, is critical to prevent aliasing. Aliasing can cause different signals to become indistinguishable when sampled, leading to frequency distortion and inaccurate signal representation. This principle ensures the integrity and reliability of signal processing in digital systems.

181
Q

a. “Data-centric AI” assumes the exclusion of data cascade problems.
b. “Model-centric AI” focuses on enhancing models based on provided data.
c. “Data-centric AI” primarily focuses on improving data quality, with the indirect goal of also enhancing model performance through better data.
d. “Data-centric AI” includes error analysis, data modification, and quality assessment.

A

A

The data cascade problem refers to the phenomenon where incorrect data can have a cascading effect on the model, leading to inaccuracies in the model’s performance. Data-centric AI has emerged to address this issue, which is often ignored in model-centric approaches.
Therefore, A is incorrect statement.

182
Q

Which statements about DFT are incorrect?
a. When we use zero padding in the time domain, resolution in the frequency domain becomes lower
b. DFT changes a signal from the time domain to the frequency domain.
c. As the sample size increases, delta_k becomes denser.
d. If the sampling rate is doubled while keeping the time fixed, f_delta(resolution in frequency domain) remains the same.

A

A

Zero-padding in the time domain increases the length of the signal, resulting in a higher number of samples and frequency bins in the frequency domain. This increased number of bins provides a finer frequency resolution, allowing for better frequency localization. So, statement A is correct.

183
Q

Which statements about Data-Centric AI is incorrect?
a. Compared to Mode-Centric AI,Data-Centric AI uses different data processing methods between iterations.
b. It aims at systematic improvement of data consistency (annotation/labels).
c. AI works as a sociotechnical system.
d. AI centeredness of ‘data work’.

A

d

For Data-Centric AI, we consider Human centeredness of ‘data work’ instead of AI,so d is incorrect.

184
Q

Which distinguishing feature sets Data-Centric AI apart from Model-Centric AI?

a) Data-Centric AI primarily focuses on model architecture modifications, while Model-Centric AI deals with hyperparameter optimization.

b) Data-Centric AI analyzes data errors, modifies data through techniques like data augmentation, and assesses data quality, whereas Model-Centric AI solely focuses on model improvement using the same training data.

c) Data-Centric AI and Model-Centric AI are interchangeable terms referring to different aspects of the same AI methodology.

d) Data-Centric AI is solely concerned with hyperparameter optimization, while Model-Centric AI addresses data errors and inconsistencies.

A

b

Data-Centric AI differs from Model-Centric AI in that it not only focuses on model improvement but also analyzes data errors, modifies data to address issues such as misfit and inconsistency, and evaluates data quality through benchmarking. In contrast, Model-Centric AI primarily concentrates on hyperparameter optimization and model refinement using the same training data without directly addressing data-related issues.

185
Q

Which statement about the spectral resolution of the Discrete Fourier Transform (DFT) iscorrect?

a) Increasing the sampling frequency improves the spectral resolution.
b) Zero padding increases the spectral resolution.
c) Spectral resolution remains unchanged regardless of the number of samples. d) Spectral resolution depends only on the original signal’s frequency content.

A

b

Zero padding, which involves adding zeros to the existing samples before computing the Discrete Fourier Transform (DFT), enhances resolution by interpolating between existing data points. It allows us to achieve better frequency resolution without altering the original signal.

186
Q

Which of the following is NOT one of the categories of data smells?
a) Understandability Smells
b) Completeness Smells
c) Believability Smells
d) Consistency Smells

A

b

Data smells are latent by definition, i.e., they represent issues that are present in the dataset and can cause problems, but are not obvious now. Data completeness should be caught at the level of data preparation, so early on in the data analysis pipeline.

187
Q

What term describes the loss of high-frequency information due to inadequate sampling?
a) Oversampling
b) Nyquist theorem
c) Frequency folding
d) Aliasing

A

d

a) Oversampling: Oversampling involves using a higher sampling rate than necessary to capture more signal detail, which isn’t the case when the sampling rate is too low.
b) Nyquist theorem: The Nyquist-Shannon theorem states the minimum sampling rate required to avoid aliasing but doesn’t directly explain the loss of high-frequency information due to low sampling rates.
c) Frequency folding: This occurs during aliasing when higher frequencies are incorrectly represented as lower frequencies, but it doesn’t directly explain the loss of high-frequency information due to low sampling rates.
d) Aliasing: Aliasing happens when higher frequencies are misrepresented as lower frequencies due to low sampling rates, directly explaining the loss of high-frequency information.

188
Q

What are the key layers in a modern data platform architecture for ensuring data quality?
a). Data collection, data cleaning, data labeling, model training, and model deployment.
b). Data preprocessing, feature engineering, model training, and model evaluation.
c). Data ingestion, data storage, data transformation, business intelligence, and data governance.
d). Data ingestion, data wrangling, data visualization, model interpretation, and model monitoring.

A

c

A modern data platform architecture for ensuring data quality typically consists of the following key layers: data ingestion, data storage and processing, data transformation and modeling, business intelligence and analytics, data observability (monitoring, tracking, triaging incidents), and data discovery and governance (documenting critical data assets). These layers collectively enable the management and maintenance of data quality throughout the data lifecycle.

189
Q

Given the following DFT coefficients of 8 samples:

Xk = [2, 0, 1, 0, 0, 0, -1, 0]

After applying the FFTShift function to the DFT coefficients, what would be the resulting order of the coefficients?
a). [0, 0, -1, 0, 2, 0, 1, 0]
b). [0, -1, 0, 0, 2, 0, 1, 0]
c). [2, 0, 1, 0, 0, -1, 0, 0]
d). [0, -1, 0, 0, 0, 1, 0, 2]

A

a

The FFTShift function rearranges the DFT coefficients by shifting the zero-frequency (DC) component to the center of the array. For a real-valued signal, the DFT coefficients are symmetric around the DC component. In this case, the DC component is the first element (X[0] = 2), and the remaining coefficients are arranged in alternating positive and negative frequency components.

After applying FFTShift, the DC component is moved to the center, and the positive and negative frequency components are arranged in a contiguous manner on either side of the DC component. The resulting order is: [negative freq. components, DC component, positive freq. components].

Therefore, the correct answer is option a.

190
Q

Which of the following concepts is crucial for preventing aliasing in digital signal processing?

A) Nyquist sampling rate
B) Discrete Fourier Transform (DFT)
C) Symmetry property of DFT
D) Resolution property of DFT

A

a

The Nyquist sampling rate is crucial in digital signal processing to prevent aliasing. It mandates that the sampling rate must be at least twice the signal’s highest frequency (Nyquist frequency) to accurately represent it digitally. Aliasing occurs with insufficient sampling rates, misinterpreting high frequencies as lower ones, causing distortion. Adhering to Nyquist’s rate ensures faithful representation of analog signals without artifacts, vital for digital signal processing fidelity. Understanding and applying Nyquist’s principle is essential for maintaining accuracy and fidelity in digital systems.

191
Q

Which of the following is not a methodology for data quality assessment and improvement, nor a representative quality metric according to Carlo Batini et al. (2009)?

A) Consistency
B) Timeliness
C) Sampling
D) Validity

A

c

According to Carlo Batini et al.(2009), the listed quality metrics (Consistency, Completeness, Timeliness, Validity, Uniqueness, Accuracy) represent characteristics or dimensions used to assess the quality of data. These metrics help in understanding the strengths and weaknesses of the data for different aspects such as format consistency, absence of missing values, up-to-dateness, representation of the real world, uniqueness of values, and accuracy in reflecting true values.

Option C) Sampling is not a methodology for data quality assessment or improvement, nor is it a representative quality metric as outlined by Batini et al. Sampling typically refers to the process of selecting a subset of data from a larger population for analysis or evaluation. While sampling can be a technique used within some methodologies, it is not inherently a methodology itself nor a quality metric. Therefore, it is the correct choice for this question.

192
Q

You are working with high-resolution digital signals and have decided to perform undersampling to improve computational efficiency. To prevent aliasing in this scenario, which of the following steps must be taken? (select one)

A) Add zero padding to the data after undersampling.

B) Filter out the frequencies that are higher than half of the undersampling frequency.

C) Remove data points that are considered outliers.

D) Eliminate all frequencies that are higher than the undersampling frequency

A

B

Undersampling is a process applied to high-resolution signals to enhance computational efficiency. However, this practice risks causing aliasing, where higher frequency components can distort the signal when sampled below their Nyquist rate. To avoid this, it’s essential to filter out frequencies greater than half of the undersampling frequency, Therefore we should remove signals with frequencies higher than half of the undersampling frequency.

193
Q

Which one of the following is a key characteristic that distinguishes Data-Centric AI from Model-Centric AI? (select one)

A) Hyperparameter optimization of the model.
B) Initial preprocessing of data.
C) Iterative training to improve model performance on test data.
D) Assessing and improving data quality after the initial model training

A

D

Data-centric AI emphasizes the continuous enhancement of data quality after the initial training of the model. This includes tasks such as systematic data fit improvement, ensuring data consistency, and fostering mutual growth through model-data iteration. In contrast, Model-Centric AI often focuses on refining the model through hyperparameter tuning using the same initial dataset.

194
Q

In the context of IoT and data science, what does a “Data Cascade” refer to?
A) A waterfall model of data processing
B) A rapid increase in data volume
C) Compounding events causing negative downstream effects from data issues
D) The sequential flow of data through sensors

A

C

Data Cascades involve compounding events that cause negative downstream effects due to data issues, leading to technical debt over time. This highlights the importance of addressing data quality early in the data lifecycle.

195
Q

In the context of DFT (Discrete Fourier Transform), what does zero padding accomplish?
A) It changes the fundamental frequency of the signal
B) It allows the DFT to handle signals of any length
C) It increases the resolution of the DFT
D) It reduces the computational complexity of the DFT

A

C

Zero padding involves adding zeros to the end of a signal to increase the number of samples, which can improve the resolution of the DFT by making the frequency bins closer together. This technique does not change the fundamental frequency of the signal but rather helps in providing a more detailed frequency analysis by interpolating between the original DFT bins.

196
Q

Choose the correct order of Digital Signal Processing (DSP) systems.
(Select one)

A) Analog Filter -> Digital Processing -> ADC -> DAC -> Analog Filter
B) Digital Processing -> ADC -> DAC -> Analog Filter -> Analog Filter
C) ADC -> Analog Filter -> Digital Processing -> Analog Filter -> DAC
D) Analog Filter -> ADC -> Digital Processing -> DAC -> Analog Filter

A

D

The sequence in Digital signal processing (DSP) systems is as follows: First, raw analog data is passed through an input analog filter to produce a filtered input. This filtered signal is then converted from analog to digital format using ADC. After this conversion, the digital signal undergoes digital processing. Once the processing is complete, the signal is converted back to analog format using DAC. The final step is to pass the reconstructed signal through an output analog filter for output.

So the sequence is input analog filter -> ADC -> Digital Processing -> DAC -> output analog filter.

197
Q

Which of the following is not a component of the Dataset Development Lifecycle?
(Select one)

A) Maintenance
B) Design
C) Implementation
D) Using
E) Requirements Analysis

A

D

The components of the Dataset Development Lifecycle include Requirements Analysis, Design, Implementation, Testing, and Maintenance. “Using” refers to the stage where the developed dataset is utilized for various tasks or analyses. While it is an essential part of the overall data lifecycle, it is not specifically a component of the dataset development process. The primary focus of the Dataset Development Lifecycle is on creating, improving, and maintaining datasets, rather than on their utilization in practical applications or analyses. Therefore, “Using” is not considered a component of the Dataset Development Lifecycle.

198
Q

Among the following representative quality metrics for data, which one does not
match with its original explanation? (Choose one)

A) Consistency - The extent to which data correctly reflects true value
B) Timeliness - The extent to which data is up-to-date
C) Uniqueness - The extent to which data contains unique value
D) Completeness - The extent to which data is not missing

A

A

Representative quality metrics for data consist of 6 metrics which are Consistency, Completeness, Timeliness, Validity, Uniqueness and Accuracy.
Among them, “Consistentcy” is a metric of the extent to which data follows same format. Hence, the answer is A. The extent to which data correctly reflects true value is measured as “Accuracy”. Rest of the options match with their original explanation. Even though not in the option, “Validity” is the extent to which data represents real-world.

199
Q

Among the following, what is the root cause for aliasing problem?

A) Zero padding
B) Low Resolution
C) Low sampling rate
D) Noisy data

A

C

The aliasing refers to problem where the sampling rate is too low so that the sampled signal does not properly represent the true signal. It is recommended to sample with at least twice the original signal’s bandwidth (Nyquist sampling rate). Zero padding and Resolution are concepts regarding the DFT and noise of the data is not a main cause of aliasing problem.

200
Q

What is different between Model-Centric AI and Data-Centric AI?
A) Model-Centric AI focuses on refining algorithms, while Data-Centric AI emphasizes improving data quality.

B) Model-Centric AI aims to increase data volume, whereas Data-Centric AI prioritizes enhancing model complexity.

C) Model-Centric AI relies solely on human-generated data, while Data-Centric AI utilizes only machine-generated data.

D) Model-Centric AI is about automating data cleaning processes, and Data-Centric AI is about manual data analysis.

A

A

The lecture contrasts Model-Centric AI and Data-Centric AI by highlighting that Model-Centric AI focuses on refining the algorithms or models themselves, whereas Data-Centric AI emphasizes the importance of improving the quality of the data used for training these models. This distinction underlines the shift from solely enhancing algorithmic approaches to recognizing the critical role of high-quality, well-labeled, and consistent data in achieving effective AI outcomes. Therefore, option A is the correct answer

201
Q

What is the Nyquist sampling rate in relation to a signal’s bandwidth?

A) Twice the signal’s bandwidth for accurate signal representation.
B) Equal to the signal’s bandwidth to minimize data size.
C) Half the signal’s bandwidth to reduce processing load.
D) Four times the signal’s bandwidth for enhanced clarity.

A

A

The Nyquist sampling rate should be twice the signal’s bandwidth to accurately represent the signal without aliasing.

202
Q

Which of the following is not a correct description of Data-Centric AI?

a) Data-Centric AI is an approach that emphasizes the quality and quantity of data to enhance the performance of models.
b) Data-Centric AI focuses on the characteristics and quality of data rather than the model’s structure or algorithms.
c) Data-Centric AI aims to improve the entire AI process by focusing on data collection, preprocessing, quality management, and model evaluation.
d) Data-Centric AI utilizes the same data to improve model performance.

A

d

Data-Centric AI aims to enhance the performance of models by prioritizing the quality and quantity of data. To improve model performance, it is essential to collect, manage, preprocess, and appropriately evaluate data. Therefore, option d) “Utilizing the same data to enhance the performance of models like Data-Centric AI” is not a description of Data-Centric AI but rather of Model-Centric AI.

203
Q

When the bandwidth of a certain signal is fixed at 20 kHz and the signal duration is increased from 1s to 2s, which of the following statements is correct?

a) Nyquist sampling rate increases.
b) Nyquist sampling rate decreases.
c) Frequency resolution increases.
d) The amount of sampled data remains unchanged.

A

c

(a), (b): Since the bandwidth of the signal is fixed, increasing the duration of the signal does not change the Nyquist sampling rate. Thus, option (a) and (b) are incorrect.
(c): Since the bandwidth of the signal is fixed, increasing the duration of the signal enhances the frequency resolution. Frequency resolution is inversely proportional to the time duration; hence, increasing the duration enhances the frequency resolution. Therefore, option c) is correct.
(d): The amount of sampled data doubles as the time increases from 1 second to 2 seconds. Thus, option d) is incorrect.

204
Q

How can you increase the frequency resolution when applying the DFT on a signal?

A: Increase signal duration.
B: Increase sampling frequency.
C: Decrease sampling frequency.
D: Decrease signal duration.

A

A

The resolution is given by f/N (in reality a lower value means a higher resolution), where f is the sampling rate and N is the amount of samples taken; the amount of samples itself is given by f*s, where s is sample duration in seconds; so the formula simplifies down to 1/s; this means that the only thing that will increase the resolution is increasing the sample duration, hence the answer is A.

205
Q

What is the unique characteristic of Data-Centric AI compared to Model-Centric AI?
A: Data Preprocessing
B: Hyperparameter Optimization
C: Data Optimization
D: Iterative Optimization

A

C

The answer is C; all of the other steps exist (and are necessary) for both Data-Centric and Model-Centric AI. In order for the model to train properly in both cases, data preprocessing must be done. In order to train the model hyperparamter and iterative optimization must be done. However, in Model-Centric AI, you only update the models, not the data.

206
Q

What is the significance of Data-Centric AI in improving the quality of IoT data science projects?
A) It focuses on the development of advanced machine learning models.
B) It emphasizes the iterative improvement of data quality and model performance through continuous interaction with domain experts and AI experts.
C) It relies exclusively on the use of DataOps to automate the data analytics process without human intervention.
D) It prioritizes the development of new AI data platforms over improving existing data quality and reliability.

A

B

The principles of Data-Centric AI focus on systematically improving the data fit and consistency of data, emphasizing the importance of data quality over the intricacies of model design. This approach advocates for continuous and substantive interactions between AI and domain experts, aiming to enhance both the reliability of AI predictions and the overall utility of AI in real-world applications.

207
Q

In the context of sensor data quality for IoT applications, which of the following is NOT a common type of error?
A) Outliers, which are anomalies or spikes in the data.
B) Bias, described as a constant offset affecting the data.
C) Encryption errors, where data is incorrectly secured or decrypted.
D) Drift, where the properties of the data gradually deviates over time.

A

D

Encryption errors are a problem for the storage of the data as well opening privacy concerns, but they do not affect the actual quality of the data.

208
Q

What is a not valid type of measurement error?
A. Outlier
B. Data cascades
C. Drift
D. Stuck at zero

A

B

Data cascades is not a kind of measurement error, rather it is the result of accumulated errors

209
Q

Which one is the right Nyquist sampling rate?
A. Sampling rate should be at least twice as the highest frequency
B. Sampling rate should be at least thrice as the highest frequency
C. Sampling rate should be at least equal to the highest frequency
D. Sampling rate can be anything, it doesn’t matter

A

A

Definition of Nyquist sampling rate.

210
Q

Which documentation assesses whether a dataset meets its requirements and is safe to use in the data development life cycle?
a) Dataset Requirements Specification
b) Dataset Design Document
c) Dataset Testing Report
d) Dataset Maintenance Plan
e) Dataset Implementation Diary

A

c

Dataset testing report is specifically designed to assess the necessity of the dataset and whether the dataset meets its requirements and is safe to use. It typically includes information about the testing procedures, test results, and any issues or concerns identified during the testing phase. By evaluating the dataset against predefined requirements and safety criteria, the testing report ensures the quality and reliability of the dataset before it is utilized in the data development life cycle. Therefore, the answer is c.

211
Q

What are the correct approaches to improve frequency resolution? (Select all that apply)
a) Increasing overlap between segments
b) Reducing the window length
c) Increasing the number of samples
d) Employing shorter time intervals
e) Implementing zero-padding

A

a, c, e

To enhance frequency resolution, increasing overlap between segments (a), increasing the number of samples (c), and implementing zero-padding (e) are effective strategies. Increasing segment overlap reduces spectral leakage and effectively extends the effective window length, particularly beneficial in methods like the short-time Fourier transform (STFT). Increasing the number of samples extends the signal’s duration, allowing for finer frequency distinctions in spectral analysis. Zero-padding adds zeros to the end of a signal before performing a Fourier transform, effectively increasing the data points and improving frequency resolution by interpolating between existing points in the frequency domain. Reducing the window length (b) or employing shorter time intervals (d) may improve time resolution but not frequency resolution.

212
Q

In the context of IoT data science, what best describes the concept of “Data Cascades”?

a) A method of incrementally improving data quality through iterative cycles of data cleansing.
b) A sequence of compounding events that lead to significant negative downstream effects from initial data issues.
c) The process of systematically increasing the volume of data for better machine learning model training.
d) The technique of layering data from various sources to enhance data dimensionality and richness.

A

b

a) Incorrect. While iterative cycles of data cleansing are important in data management, they do not specifically refer to the compounding negative effects caused by initial data issues, which is the hallmark of Data Cascades.
b) Correct.
c) Incorrect. Systematically increasing the volume of data for machine learning models is a practice in data collection and augmentation, but it does not address the compounding negative effects caused by initial data problems as described by Data Cascades.
d) Incorrect. Layering data from various sources can enrich the data set but does not specifically refer to the negative downstream effects that result from initial data issues, which is what Data Cascades describe.

213
Q

In Digital Signal Processing (DSP), why is the Nyquist sampling rate considered critical for accurately capturing the essence of a continuous signal?

a) It is the minimum rate at which the signal must be sampled to avoid loss of information.
b) It ensures the highest possible resolution in the digital representation of the signal.
c) It corresponds to the maximum frequency component of the signal.
d) It is the rate that guarantees the elimination of all noise from the signal.

A

a

a) Correct
b) Incorrect. While higher sampling rates can lead to greater resolution in the digital representation of the signal, the Nyquist rate specifically addresses the minimum requirement to avoid aliasing, rather than the resolution itself.
c) Incorrect. The Nyquist rate is actually twice the maximum frequency component of the signal, not equal to it. This distinction is crucial to prevent aliasing and accurately capture the signal in its digital form.
d) Incorrect. The Nyquist rate is concerned with avoiding aliasing and accurately capturing the signal, not with noise elimination. Noise reduction involves different techniques and considerations beyond just the sampling rate.

214
Q

Given the significance of Data Cascades in affecting the quality of AI systems, which preventative measure is most effective in mitigating their negative impact early in the data lifecycle?
A) Increasing the computational power allocated to data processing tasks.
B) Implementing comprehensive data quality checks and validation at each stage of data collection and processing.
C) Focusing solely on the enhancement of AI algorithms without considering data quality.
D) Prioritizing the speed of data ingestion over the accuracy and completeness of the data.

A

B

To mitigate the negative impact of Data Cascades, it’s crucial to implement comprehensive data quality checks and validations at each stage of the data lifecycle. This approach helps identify and address data issues early, preventing them from compounding and leading to significant technical debt and negative downstream effects on AI systems.

215
Q

In the context of Discrete Fourier Transform (DFT), why is zero padding used?
A) To increase the total energy of the signal.
B) To reduce the computation time for the Fourier transform.
C) To improve the frequency resolution of the DFT by simulating a longer data acquisition time.
D) To decrease the bandwidth of the signal for faster data transmission.

A

C

Zero padding is a technique used in DFT to improve the frequency resolution by simulating a longer data acquisition time. It involves adding zeros to the end of the signal, which does not alter the signal’s content but increases the number of points in the DFT, thereby improving the frequency resolution.

216
Q

In Data-Centric AI, various strategies can be implemented to address data issues at each iteration. Select the appropriate strategy that Data-Centric AI can employ from the following options:

(a) Verifying the accuracy and consistency of data annotation.
(b) Augmenting data to improve data fitness.
(c) Assessing data quality.
(d) Evaluating the effectiveness of the data in training the model.
(e) All of the above.

A

e

To address this question, it is essential to comprehend the principles of Data-Centric AI. Data-Centric AI utilizes strategies to enhance the quality of data with each iteration. Option (a) is employed to verify errors within the data. Option (b) serves to enhance and refine the dataset. Option (c) functions as a strategy for evaluating data quality. Lastly, option (d) acts as a strategy based on data fit. Therefore, the correct answer is (e).

217
Q

Consider a signal with a duration of 1 second and a fixed bandwidth f_b=10 kHz. If the duration of this signal is increased to 2 seconds, which of the following statements are correct? (Select two)

  1. The total number of samples increases.
  2. The Nyquist sampling rate increases.
  3. Δf is 1 Hz.
  4. The resolution increases.
A

1, 4

When calculating the Nyquist sampling rate for a signal with a duration of 1 second and a fixed bandwidth of 10 kHz, it is 2*f_b, which equals 20 kHz. Also, the total number of samples is 20 kHz * 1 s = 20k, and Δf is 20 kHz / 20k = 1 Hz. Now, if the duration changes to 2 seconds, the Nyquist sampling rate remains at 20 kHz, and the total number of samples becomes 20 kHz * 2 s = 40k. Additionally, Δf becomes 20 kHz / 40k = 1/2 Hz. As the duration changes to 2 seconds, the total number of samples increases, and the resolution increases as well. On the other hand, the Nyquist sampling rate remains the same, and Δf becomes 1/2 Hz. Therefore, statements 1 and 4 are correct.

218
Q

Which of the following is NOT a reason why data scientists might choose to use their own methods for data quality inspection?

(A) Utilize independent understanding of data characteristics
(B) Develop inspection procedures tailored to project specifics
(C) Flexibly handle various data formats and structures
(D) Data quality tools always provide worse results

A

D

(A), (B), and (C) are all valid reasons why data scientists might choose to use their own methods

219
Q

When calculating correlation using only cos(), which information can be obtained?

(A) Magnitude of correlation
(B) Sign of correlation
(C) Phase difference
(D) (A) and (B)

A

D

If only cos() is used to calculate the correlation, we can obtain only the magnitude and sign of the correlation, and we cannot obtain phase information.

220
Q

Which of the following sampling rates would result in aliasing given the Nyquist sampling criterion?

  1. A 50Hz signal sampled at 150Hz.
  2. A 10Hz signal sampled at 15Hz.
  3. A 1Hz signal sampled at 2Hz.
  4. An 8Hz signal sampled at 64Hz
A

2

Aliasing occurs when the sampling rate is too low to accurately represent the frequency content of the original signal. According to the Nyquist sampling theorem, the sampling rate must be at least twice the maximum frequency present in the signal to avoid aliasing. therefore, option 2 (A 10Hz signal sampled at 15Hz(<20HZ)) is the correct answer as it violates the Nyquist criterion and would result in aliasing.

221
Q

Which of the following best describes Data-centric AI?

  1. A machine-centric approach focusing on ‘data work’.
  2. The mutual improvement of models and data through a one-shot stage.
  3. The systematic enhancement of data consistency.
  4. The systematic increase in data quantity.
A

3

Option 3 accurately depicts Data-centric AI by emphasizing the systematic enhancement of data consistency. Let’s examine the other options. Firstly, Data-Centric AI prioritizes human-centered handling of data, making option 1 incorrect. Additionally, it entails iterative refinement of both models and data, making option 2 inaccurate. Moreover, Data-Centric AI focuses on ensuring data suitability rather than merely increasing quantity, rendering option 4 less pertinent. Therefore, option 3 remains the most fitting description of Data-centric AI.

222
Q

Which statement accurately defines the Nyquist rate?
A) Nyquist rate is the minimum frequency accurately represented in a digital signal; aliasing occurs with frequencies below it, causing false higher-frequency components.

B) Nyquist rate is the maximum frequency accurately represented in a digital signal; aliasing occurs with frequencies above it, causing false lower-frequency components.

C) Nyquist rate is the average frequency of a signal; aliasing happens when the sampling rate exceeds it, leading to an incomplete signal representation.

D) Nyquist rate is the frequency where aliasing occurs; it always results in signal information loss, regardless of the sampling rate.

A

B

This statement accurately captures the essence of the Nyquist theorem, which states that in order to accurately represent a signal digitally, the sampling rate must be at least twice the frequency of the highest frequency component in the signal. If frequencies above the Nyquist rate are present in the signal and are not properly filtered out before sampling, aliasing occurs, leading to false lower-frequency components in the digitized signal.

223
Q

Which of the following statements accurately distinguishes between Mode-Centric AI and Data-Centric AI?

A) Mode-Centric AI primarily focuses on optimizing algorithms and models for specific tasks, whereas Data-Centric AI emphasizes the importance of high-quality, diverse datasets in driving model performance.

B) Mode-Centric AI prioritizes the collection and labeling of massive amounts of data, while Data-Centric AI concentrates on fine-tuning model architectures and hyperparameters to achieve optimal performance.

C) Mode-Centric AI relies on a single dominant algorithm or model architecture across various applications, whereas Data-Centric AI tailors its approach based on the specific characteristics and requirements of each dataset.

D) Mode-Centric AI emphasizes the interpretation and explanation of model predictions, while Data-Centric AI focuses on the efficient storage and retrieval of vast amounts of training data.

A

A, C

Only A correctly highlights the fundamental difference between the two AI approaches. Model-Centric AI is about refining and optimizing the models and algorithms themselves to enhance performance, which often involves tweaking model architectures or algorithm parameters specific to a task. On the other hand, Data-Centric AI shifts the focus towards the data used to train these models. It posits that by ensuring the data is of high quality and diversity, you can significantly improve the model’s performance. This approach involves careful curation, cleaning, and possibly augmenting data to better train models, rather than primarily focusing on the model’s internal adjustments.

224
Q

Considering the concept of “Data Cascades, what are they, and how do they impact AI projects? (select one)

A) Data Cascades refer to the rapid increase in data volume that overwhelms storage systems, leading to data loss.
B) They are the sequential processing steps in data pipelines, ensuring smooth flow and transformation of data from raw to analytics-ready forms.
C) Data Cascades are compounding events causing negative downstream effects from initial data issues, resulting in technical debt over time.
D) They describe a phenomenon where data quality improves exponentially as it passes through multiple stages of cleaning and preprocessing.

A

C

Data Cascades in the context of AI and data science refer to a series of compounding negative effects triggered by initial data quality issues. These cascading events can lead to a deterioration in model performance, increased costs, and delays in project timelines. Significantly, they contribute to the accumulation of technical debt, as initial shortcuts or oversights in data handling and preprocessing become more challenging to address over time. Recognizing and mitigating data cascades early in the development cycle is crucial to maintaining the integrity and reliability of AI systems.

225
Q

What is the Nyquist rate, and why is it important in the context of digital signal processing (DSP)? (select one)

A) The Nyquist rate is the minimum sampling rate at which a signal can be accurately recorded, twice the highest frequency present in the signal.
B) The Nyquist rate refers to the minimum bit rate required for compressing digital signals without losing quality.
C) It is the rate at which power consumption of digital devices doubles with each additional processing unit.
D) The Nyquist rate is the minimum sampling interval needed for digital filters to operate correctly without any input signal.

A

A

The Nyquist rate is fundamental in digital signal processing as it represents the minimum sampling rate required to accurately capture a signal without aliasing. Sampling at least at the Nyquist rate, which is twice the signal’s highest frequency, ensures that the original signal can be fully reconstructed from its samples, preventing the loss of information and avoiding the misinterpretation of signal frequencies (aliasing).

226
Q

What does adhering to the Nyquist theorem ensure when choosing a sampling rate for signal processing?
A) Maximizes computational resource usage
B) Prevents aliasing and accurately represents the signal
C) Increases data storage requirements
D) Simplifies the signal processing complexity

A

B

The Nyquist theorem states that to accurately sample a continuous signal without information loss, the sampling rate must be at least twice the highest frequency present in the signal. This prevents aliasing, ensuring that the sampled signal can be accurately reconstructed from its samples.

227
Q

Considering the Discrete Fourier Transform (DFT) and Data-Centric AI, which of the following statements is incorrect?

A) Adding more data points (zero padding) in the time domain of a DFT increases the frequency resolution.
B) In Data-Centric AI, systematic error analysis and data refinement aim to improve model performance by enhancing data quality rather than adjusting the model itself.
C) Doubling the sampling frequency of a signal increases the Nyquist frequency, thereby expanding the bandwidth of the signal that can be accurately represented.
D) Data-Centric AI prioritizes adjusting model parameters over addressing data quality issues to achieve optimal performance.

A

D

Explanation:
This statement is incorrect because Data-Centric AI actually emphasizes improving data quality (through error analysis, data cleaning, and enhancement techniques) as a primary approach to improving AI model performance, rather than focusing on adjusting the model’s parameters or architecture. On the other hand, statements A and C are correct in their respective contexts, and B accurately describes a core principle of Data-Centric AI, making D the incorrect choice.

228
Q

Questions:What is the correct order of words to fill in the blank spaces?

Windowing can introduce abrupt transitions at the window edges, causing a discrepancy between the ___ and ___ of the window. This mismatch results in the ___ of energy across different frequency components, characteristic of spectral leakage.

A) start - end - spreading

B) middle - end - enhancement

C) start - middle - reduction

D) start - end - concentration

A

A

Windowing in signal processing multiplies a signal with a window function, which can cause abrupt transitions at the window’s start and end. This discrepancy leads to spectral leakage, where energy spreads across different frequency components rather than remaining concentrated. This spreading effect is due to the abrupt changes, causing the energy to disperse into frequencies outside the main component, altering the signal’s spectrum. Thus, “start” and “end” relate to window edges, and “spreading” describes the resulting dispersion of energy.

229
Q

Why is a low-pass filter in the time domain preferred for real-time applications over a frequency domain filter?
a) Time domain filters are simpler to implement
b) Frequency domain filters introduce significant delay
c) Time domain filters offer better noise reduction
d) All of the above

A

d

Time domain filters are generally simpler to implement and require less computational resources compared to frequency domain filters.
Frequency domain filters often introduce significant processing delays due to the need for Fourier transforms and inverse transforms.
Time domain filters are effective in reducing noise and unwanted signal components in real-time data streams (e.g., time domain filters can adapt their parameters in real-time based on the characteristics of the input signal).

230
Q

In the context of digital filters, what does the term “roll-off” refer to?
A) The cutoff frequency where the filter starts attenuating frequencies.
B) The rate of attenuation in the stopband region.
C) The amount of ripple allowed in the stopband region.
D) The slope of the filter’s frequency response in the transition band.

A

D

Roll-off in digital filters refers to how quickly the filter attenuates frequencies beyond the passband, specifically, the steepness or slope of the filter’s attenuation in the transition band between the passband and stopband. A steep roll-off means the filter more sharply reduces the amplitude of frequencies outside its passband, which is desirable in many applications for a clear distinction between filtered and attenuated frequencies.

So the clear answer is D)

231
Q

Which type of digital filter permits frequencies below a specific cutoff frequency to pass through unchanged while suppressing frequencies above this cutoff?
a) Low-pass filter
b) High-pass filter
c) Band-pass filter
d) Band-reject filter

A

a

A low-pass filter is designed to allow frequencies below a certain cutoff frequency to pass through unaffected, while attenuating frequencies above the cutoff. This means that signals with frequencies lower than the cutoff frequency will be transmitted with little or no attenuation, while signals with frequencies higher than the cutoff will be reduced in magnitude. Therefore, the correct choice is a) Low-pass filter.

232
Q

How can spectral leakage be reduced when processing a signal with a digital filter?

A) By increasing the signal’s frequency beyond the Nyquist rate.
B) By using window functions to smooth the abrupt ends of a signal before applying the Discrete Fourier Transform (DFT).
C) By converting the digital signal to analog before filtering.
D) By decreasing the sampling rate below the signal’s highest frequency.

A

B

Spectral leakage, an artifact that occurs when a signal is not periodic within its sample window, can distort the signal’s frequency spectrum. This distortion can be mitigated by applying window functions, such as the Hamming or Blackman window, to smooth the signal’s abrupt ends before performing the DFT. This process reduces the leakage by minimizing the energy spread caused by the abrupt signal cutoff, enhancing the filter’s effectiveness and the accuracy of the spectral analysis.

233
Q

Which of the following scenarios is most likely to approximate the impulse response of a system? Hint (Consider the Convolution Reverb experiment discussed in class)

A. The system’s response to a loud pop
B. The system’s response to a clap
C. The system’s response to a song
D. Both A and B

A

D

Option A - The System’s Response to a Loud Pop: A loud pop is a short, sharp sound that can somewhat mimic an impulse.
Option B - The System’s Response to a Clap: Like a loud pop, a clap is also a brief and high-energy sound. It approximates an impulse.
Option C - The System’s Response to a Song: A song is a complex signal with varying frequencies, rhythms, and intensities. It bears little resemblance to an impulse, which is a singular, instantaneous event.

In the context of the convolution reverb experiment, we see that a loud pop and a clap were used to characterize the building(system).

234
Q

What is the main advantage of using a low-pass filter in audio applications?
1. Enhancing high-frequency components
2. Removing low-frequency noise
3. Preserving low-frequency signals while attenuating high-frequency noise
4. Attenuating low-frequency components to improve signal clarity

A

3

A low-pass filter is designed to attenuate or block high-frequency components of a signal while allowing low-frequency components to pass through relatively unaffected. This means that high-frequency components, such as noise or interference, are suppressed by the filter, resulting in a smoother output signal with reduced high-frequency content.
Therefore, answer is 3

235
Q

Which statements about ‘Spectral leakage’ are incorrect?(Choose 2)
a. Abrupt changes at the boundaries of the window in the time domain could cause spectral leakage
b. Applying hamming window can reduce spectral leakage.
c. Spectral leakage never occurs when the window size is sufficiently small.
d. Spectral leakage is primarily caused by a low sampling frequency.

A

C,D

Abrupt changes(Discontinuous points) in boundary could cause Spectral leakage.
Since hamming window reduces the side effects of truncation by smoothing, it can reduce spectral leakage. So (a) and (b) are correct.
(c) If the window size is small, truncation can occur, leading to spectral leakage.
(d) Spectral leakage is not directly caused by low sampling frequency. Even at high sampling frequencies, if the signal contains frequency components that do not align well with the window boundaries, spectral leakage can still occur. Therefore, (c) and (d) are incorrect

236
Q

What is the recommended approach to minimize spectral leakage when processing a signal?
A) Increasing the window size to infinity to capture all frequencies accurately.
B) Applying a rectangular window to make the signal more abrupt at the ends.
C) Utilizing windowing functions like Hamming or Blackman to smooth the abrupt ends of the signal.
D) Increasing the sampling rate to the maximum possible to avoid any leakage.

A

C

Spectral leakage occurs when a signal is truncated, leading to the spreading of its energy across other frequencies. To mitigate this effect, the slide recommends smoothing the abrupt ends of the signal by applying windowing functions such as Hamming or Blackman. These windows reduce the leakage by tapering the signal at its ends, rather than abruptly cutting it off, which is characteristic of a rectangular window.

237
Q

The following contains concepts related to signal processing. Which of the following is incorrect? (select one)
A) Window functions such as Hamming and Hanning can be utilized to reduce spectral leakage.
B) In the frequency domain, the input signal and filter can create an output signal through convolution.
C) Insufficient windowing size to contain the original signal can lead to spectral leakage.
D) Spectral leakage can occur due to abrupt changes in the signal.

A

B

Window functions like Hamming and Hanning are commonly used in signal processing to reduce spectral leakage by smoothing abrupt ends of the signal.
In the frequency domain, the input signal and filter can create an output signal through multiplication, not convolution.
If the windowing size is not sufficient to contain the original signal, spectral leakage can occur. Abrupt changes or ends in the signal can result in spectral leakage.
Therefore option B is incorrect.

238
Q

Suppose you are analyzing a noisy time-series ECG dataset. Upon transforming the data to the frequency domain, you notice several peaks located at 10 Hz, 40 Hz, 2 Hz, and 0.2 Hz. If your main goal is to isolate the heart rate signal (assuming that the typical human heart rate lies between 60 and 180 bpm), which filter would be the optimum choice, and what should the associated cutoff frequency(ies) be?
A. Band Stop Filter, 8 Hz to 12 Hz
B. High Pass Filter, 35Hz
C. Band Pass Filter, 0.5Hz to 4Hz
D. Low Pass Filter, 4Hz

A

C

The correct answer is C. This choice is optimal because it covers the frequency range that corresponds to the typical human heart rate, which, when converted to Hertz, is 1 Hz to 3 Hz. By selecting a Band Pass Filter with a range from 0.5Hz to 4Hz, we ensure that the heart rate signals are effectively isolated from the noises happening at 10 Hz, 40 Hz, and 0.2 Hz. This range not only captures the relevant heart rate frequencies but also provides a margin to account for potential outliers.

239
Q

Why does a Butterworth filter introduce a slope (or roll-off) in the frequency response near its cutoff frequency?

A) Because it is designed to eliminate all frequencies above the cutoff frequency abruptly, creating a vertical drop in the frequency response.

B) Because it aims to provide a maximally flat frequency response in the passband, leading to a gradual transition to the stopband.

C) Because the physical constraints of analog components used in its design inherently produce a slope in the frequency response.

D) Because it uses a high-order filter design to compensate for phase shifts, inadvertently creating a slope in the frequency response.

A

B

The Butterworth filter is known for its maximally flat frequency response in the passband, meaning it has no ripples. This design choice results in a gradual transition or slope as it approaches the cutoff frequency, moving towards the stopband. The slope, or roll-off, is a byproduct of attempting to maintain this flatness as closely as possible up to the cutoff frequency, beyond which attenuation begins. This gradual transition helps to reduce distortions within the passband while effectively attenuating unwanted frequencies in the stopband.

240
Q

What is the most appropriate method to decrease passband ripple in an FIR filter?

A. Apply zero-padding to enhance frequency resolution.
B. Decrease the filter order to reduce the kernel length.
C. Apply a windowing technique to smooth out the impulse response.
D. Narrow the frequency band to widen the filter’s stopband.

A

C

To reduce passband ripple in an FIR filter, the most effective strategy is utilizing a windowing technique (C). Windowing smooths out the filter’s impulse response, reducing side lobes and effectively diminishing ripple in both the passband and stopband. On the other hand, zero-padding (A) mainly enhances frequency resolution in spectral analysis without directly affecting the ripple. Decreasing the filter order (B) might simplify the filter but can increase ripple due to less precise control over the filter’s frequency response. Narrowing the frequency band (D) does not directly influence passband ripple.

241
Q

Which statement accurately describes the distinction between FIR and IIR filters in the context of digital signal processing?

A) FIR filters are always unstable, while IIR filters are guaranteed to be stable due to their inherent feedback mechanism.

B) FIR filters inherently have a linear phase response, making them ideal for phase-sensitive applications, whereas IIR filters may introduce phase distortions due to their recursive nature.

C) IIR filters cannot achieve a high degree of selectivity in filtering specific frequency components, unlike FIR filters which can be designed for very narrow bandwidths.

D) Spectral leakage is a phenomenon only associated with FIR filters when applied to spectral analysis of signals, while IIR filters are immune to spectral leakage due to their feedback structure.

A

B

FIR (Finite Impulse Response) filters can maintain a linear phase response, which is crucial for applications where preserving the wave shape of the original signal is important, as it ensures all frequency components of the signal are delayed by the same amount when passing through the filter. This property makes FIR filters particularly suitable for phase-sensitive applications. On the other hand, IIR (Infinite Impulse Response) filters, due to their recursive nature, can introduce phase distortion, as they may not delay all frequency components uniformly. This difference highlights the importance of choosing the right filter type based on the specific requirements of a signal processing task.

242
Q

Spectral leakage occurs during the spectral analysis of signals when:

A) The signal is perfectly windowed with no abrupt changes at the edges, resulting in a highly accurate spectral representation.

B) A signal is windowed, causing energy to spread out from the main frequency component to the sidelobes, thereby reducing spectral resolution.

C) The frequency components of a signal are below 10 Hz, which is considered as “curiosity” 1/f noise and does not contribute to spectral leakage.

D) The signal undergoes a filtering process with an ideal low-pass filter, which eliminates all frequency components above a certain cutoff frequency without affecting the spectral resolution.

A

B

Spectral leakage is a phenomenon that occurs when a signal is windowed, leading to the spreading of energy from the (mainlobe to the sidelobes. This effect reduces the spectral resolution because energy that should be concentrated in the main frequency band leaks into other frequencies. It is prompting the use of various windowing techniques to minimize the leakage and preserve as much spectral accuracy as possible.

243
Q

Select the statement that accurately reflects the interpretation of a signal’s frequency spectrum:

A) White noise within the range of 10-70 Hz is known for containing valuable information about the signal’s origin.

B) A real signal with strong peaks at 13 Hz, 26 Hz, and 39 Hz signifies fundamental and harmonic frequencies, likely indicating mechanical sources such as a submarine’s propeller.

C) AC noise is indicated by a frequency component at 60 Hz and does not include smaller peaks at its multiples.

D) An ideal antialias filter targets frequencies below 80 Hz to eliminate them from the signal.

A

B

Option C) is the correct answer as it accurately interprets the frequency spectrum of a real signal, indicating the presence of fundamental and harmonic frequencies. These peaks are characteristic of periodic mechanical sources, such as the blades of a submarine’s propeller, which generate distinct peaks at these frequencies due to their rotation.

244
Q

Increasing the order (N) of a Butterworth filter results in what?

A) Reduced selectivity
B) Decreased cutoff frequency
C) Steeper transition band
D) Simplified design

A

C

Increasing the order of a Butterworth filter makes its transition from passband to stopband steeper, allowing for a clearer distinction between allowed and blocked frequencies. This improves the filter’s precision in separating desired signals from noise.

245
Q

Choose one explanation regarding spectral leakage which is incorrect:

A) Spectral leakage occurs due to inappropriate window size
when sampling from original signal.
B) Spectral leakage can happen when the correlation between a signal and itself with time lag is too high.
C) Spectral leakage can be reduced by smoothing the abrupt ends of windowed signal.
D) Spectral leakage refers to the problem where energy leaks out from the mainlobe to the sidelobes after DFT.

A

B

A, C, D are all correct explanations about spectral leakage. The main reason for spectral leakage is the inappropriate window size during sampling which in turn leads to energy leaking out from the mainlobe to the sidelobes after DFT. One way to reduce this is to smooth out the abrupt ends of the windowed signal. B says that high autocorrelation leads to spectral leakage which is not the case.

246
Q

Which one is a property of Butterworth filter?
A. Passband is designed to be maximally high
B. Designed to mimic low pass filter for large N
C. Is a high pass filter
D. Works by moving average

A

B

B is the correct answer according to the slides. For large N, the function behaves like a ideal low pass filter. The others are wrong.

247
Q

Which of the following statements about spectral leakage is incorrect? (Select One)

  1. Spectral leakage always occurs when using a window on sensor data.
  2. Spectral leakage is caused by any abrupt change at the edge of a window of sensor data.
  3. Spectral leakage reduces spectral resolution.
  4. The Hamming window helps reduce spectral leakage by smoothing abrupt truncation.
A

1

Spectral leakage occurs when signals analyzed with a slightly truncated window don’t align with the assumed periodicity. This mismatch causes discontinuities at window edges, spreading signal energy into neighboring frequency bins and distorting the spectrum. Spectral leakage isn’t inherent but rather arises from specific conditions, not always occurring when using windows. Therefore, option 1 is incorrect as spectral leakage is contingent on the signal’s alignment with the analysis window.

248
Q

Which of the following statements best describes the relationship between Chebyshev filters, Butterworth filters, and the impact of passband ripple on filter characteristics?

A) Increasing the ripple in the passband of a Chebyshev filter decreases the roll-off rate, making it similar to a Butterworth filter.
B) A Chebyshev filter with 0% ripple in the passband becomes a Butterworth filter, which is known for its maximally flat frequency response.
C) Butterworth filters achieve a faster roll-off than Chebyshev filters by introducing ripple in the passband.
D) Chebyshev filters can only achieve a faster roll-off than Butterworth filters when the passband ripple exceeds 1%.

A

B

A) Increasing the ripple in the passband of a Chebyshev filter decreases the roll-off rate, making it similar to a Butterworth filter: This statement is incorrect because increasing the ripple in a Chebyshev filter’s passband actually increases the roll-off rate, not decreases it.

B) A Chebyshev filter with 0% ripple in the passband becomes a Butterworth filter, which is known for its maximally flat frequency response: This is correct. Butterworth filters are characterized by a maximally flat response in the passband, meaning there is no ripple. When a Chebyshev filter is designed with 0% ripple, it essentially adopts the characteristics of a Butterworth filter, emphasizing a smooth passband without fluctuations.

C) Butterworth filters achieve a faster roll-off than Chebyshev filters by introducing ripple in the passband: This statement is incorrect.

D) Chebyshev filters can only achieve a faster roll-off than Butterworth filters when the passband ripple exceeds 1%: This statement is misleading.

249
Q

Question: For the discrete signal x[n] with values {2,1,3}, compute the autocorrelation of the signal rₓ[1], which represents the autocorrelation at a lag (or shift) of 1.

A) 14
B) 8
C) 5
D) 6

A

C

Autocorrelation quantifies the similarity between a signal and its own shifted version at various lags, shedding light on patterns inherent to the signal, such as periodicity. For a discrete signal x[n], the autocorrelation at a specific lag k, symbolized as rₓ[k], is calculated by summing the product of the signal values with their values at the k-lagged positions. For a lag of 1, this calculation is as follows:

rₓ[1] = x[0]x[1] + x[1]x[2]

Applying this to x[n] = {2,1,3}:

rₓ[1] = (2 * 1) + (1 * 3) = 2 + 3 = 5

250
Q

In digital filter design, what describes the impulse response function h(n) in the context of linear digital filters?

a) The function that represents the filter’s frequency response to an input signal.
b) The sequence of numbers generated by the filter as an output when the input is an impulse signal.
c) The process of converting the time-domain signal into its frequency-domain representation.
d) The formula used to calculate the filter’s cut-off frequency in low-pass filters.

A

b

a) Incorrect. The impulse response function describes the time-domain response of a filter to an impulse input, not the frequency response.
b) Correct.
c) Incorrect. The process of converting a time-domain signal into its frequency-domain representation is known as Fourier Transform, not the impulse response function.
d) Incorrect. The cut-off frequency in low-pass filters is related to the filter’s design and specifications, not directly to the impulse response function h(n)

251
Q

What is spectral leakage, and how can it be reduced in spectral analysis?

a) Spectral leakage refers to the leakage of signal energy into adjacent frequency bins, causing distortion in spectral analysis. It can be reduced by applying window functions to taper the signal, such as the Hamming or Hanning window.

b) Spectral leakage refers to the loss of signal power during spectral analysis. It can be reduced by increasing the length of the signal to improve frequency resolution.

c) Spectral leakage is the occurrence of aliasing in spectral analysis. It can be reduced by decreasing the sampling rate of the signal.

d) Spectral leakage is the distortion of signal shape due to impedance mismatch. It can be reduced by increasing the number of frequency bins used in spectral analysis.

A

a

a) is correct because it accurately defines spectral leakage as the leakage of signal energy into adjacent frequency bins, leading to distortion in spectral analysis

252
Q

What does the term ‘ripple’ refer to in the context of filter design?
A) A constant high-frequency oscillation that filters aim to remove.
B) The variation in amplitude within the passband or stopband of a filter.
C) The desired frequency range that a filter allows to pass through.
D) The length of the filter’s impulse response.

A

B

Ripple in filter design refers to the variations or fluctuations in amplitude observed within the passband or stopband of a filter. It is an important factor to consider in the filter design process, as excessive ripple may adversely affect the filter’s performance in certain applications

253
Q

Questions:What is the correct order of words to fill in the blank spaces?

Windowing can introduce abrupt transitions at the window edges, causing a discrepancy between the ___ and ___ of the window. This mismatch results in the ___ of energy across different frequency components, characteristic of spectral leakage.

A) start - end - spreading

B) middle - end - enhancement

C) start - middle - reduction

D) start - end - concentration

A

A

Windowing in signal processing multiplies a signal with a window function, which can cause abrupt transitions at the window’s start and end. This discrepancy leads to spectral leakage, where energy spreads across different frequency components rather than remaining concentrated. This spreading effect is due to the abrupt changes, causing the energy to disperse into frequencies outside the main component, altering the signal’s spectrum. Thus, “start” and “end” relate to window edges, and “spreading” describes the resulting dispersion of energy.

254
Q

Why is a low-pass filter in the time domain preferred for real-time applications over a frequency domain filter?
a) Time domain filters are simpler to implement
b) Frequency domain filters introduce significant delay
c) Time domain filters offer better noise reduction
d) All of the above

A

d

Time domain filters are generally simpler to implement and require less computational resources compared to frequency domain filters.
Frequency domain filters often introduce significant processing delays due to the need for Fourier transforms and inverse transforms.
Time domain filters are effective in reducing noise and unwanted signal components in real-time data streams (e.g., time domain filters can adapt their parameters in real-time based on the characteristics of the input signal).

255
Q

In the context of digital filters, what does the term “roll-off” refer to?
A) The cutoff frequency where the filter starts attenuating frequencies.
B) The rate of attenuation in the stopband region.
C) The amount of ripple allowed in the stopband region.
D) The slope of the filter’s frequency response in the transition band.

A

D

Roll-off in digital filters refers to how quickly the filter attenuates frequencies beyond the passband, specifically, the steepness or slope of the filter’s attenuation in the transition band between the passband and stopband. A steep roll-off means the filter more sharply reduces the amplitude of frequencies outside its passband, which is desirable in many applications for a clear distinction between filtered and attenuated frequencies.

So the clear answer is D)

256
Q

Which type of digital filter permits frequencies below a specific cutoff frequency to pass through unchanged while suppressing frequencies above this cutoff?
a) Low-pass filter
b) High-pass filter
c) Band-pass filter
d) Band-reject filter

A

a

A low-pass filter is designed to allow frequencies below a certain cutoff frequency to pass through unaffected, while attenuating frequencies above the cutoff. This means that signals with frequencies lower than the cutoff frequency will be transmitted with little or no attenuation, while signals with frequencies higher than the cutoff will be reduced in magnitude. Therefore, the correct choice is a) Low-pass filter.

257
Q

How can spectral leakage be reduced when processing a signal with a digital filter?

A) By increasing the signal’s frequency beyond the Nyquist rate.
B) By using window functions to smooth the abrupt ends of a signal before applying the Discrete Fourier Transform (DFT).
C) By converting the digital signal to analog before filtering.
D) By decreasing the sampling rate below the signal’s highest frequency.

A

B

Spectral leakage, an artifact that occurs when a signal is not periodic within its sample window, can distort the signal’s frequency spectrum. This distortion can be mitigated by applying window functions, such as the Hamming or Blackman window, to smooth the signal’s abrupt ends before performing the DFT. This process reduces the leakage by minimizing the energy spread caused by the abrupt signal cutoff, enhancing the filter’s effectiveness and the accuracy of the spectral analysis.

258
Q

Which of the following scenarios is most likely to approximate the impulse response of a system? Hint (Consider the Convolution Reverb experiment discussed in class)

A. The system’s response to a loud pop
B. The system’s response to a clap
C. The system’s response to a song
D. Both A and B

A

D

Option A - The System’s Response to a Loud Pop: A loud pop is a short, sharp sound that can somewhat mimic an impulse.
Option B - The System’s Response to a Clap: Like a loud pop, a clap is also a brief and high-energy sound. It approximates an impulse.
Option C - The System’s Response to a Song: A song is a complex signal with varying frequencies, rhythms, and intensities. It bears little resemblance to an impulse, which is a singular, instantaneous event.

In the context of the convolution reverb experiment, we see that a loud pop and a clap were used to characterize the building(system).

259
Q

What is the main advantage of using a low-pass filter in audio applications?
1. Enhancing high-frequency components
2. Removing low-frequency noise
3. Preserving low-frequency signals while attenuating high-frequency noise
4. Attenuating low-frequency components to improve signal clarity

A

3

A low-pass filter is designed to attenuate or block high-frequency components of a signal while allowing low-frequency components to pass through relatively unaffected. This means that high-frequency components, such as noise or interference, are suppressed by the filter, resulting in a smoother output signal with reduced high-frequency content.
Therefore, answer is 3

260
Q

Which statements about ‘Spectral leakage’ are incorrect?(Choose 2)
a. Abrupt changes at the boundaries of the window in the time domain could cause spectral leakage
b. Applying hamming window can reduce spectral leakage.
c. Spectral leakage never occurs when the window size is sufficiently small.
d. Spectral leakage is primarily caused by a low sampling frequency.

A

C,D

Abrupt changes(Discontinuous points) in boundary could cause Spectral leakage.
Since hamming window reduces the side effects of truncation by smoothing, it can reduce spectral leakage. So (a) and (b) are correct.
(c) If the window size is small, truncation can occur, leading to spectral leakage.
(d) Spectral leakage is not directly caused by low sampling frequency. Even at high sampling frequencies, if the signal contains frequency components that do not align well with the window boundaries, spectral leakage can still occur. Therefore, (c) and (d) are incorrect

261
Q

What is the recommended approach to minimize spectral leakage when processing a signal?
A) Increasing the window size to infinity to capture all frequencies accurately.
B) Applying a rectangular window to make the signal more abrupt at the ends.
C) Utilizing windowing functions like Hamming or Blackman to smooth the abrupt ends of the signal.
D) Increasing the sampling rate to the maximum possible to avoid any leakage.

A

C

Spectral leakage occurs when a signal is truncated, leading to the spreading of its energy across other frequencies. To mitigate this effect, the slide recommends smoothing the abrupt ends of the signal by applying windowing functions such as Hamming or Blackman. These windows reduce the leakage by tapering the signal at its ends, rather than abruptly cutting it off, which is characteristic of a rectangular window.

262
Q

The following contains concepts related to signal processing. Which of the following is incorrect? (select one)
A) Window functions such as Hamming and Hanning can be utilized to reduce spectral leakage.
B) In the frequency domain, the input signal and filter can create an output signal through convolution.
C) Insufficient windowing size to contain the original signal can lead to spectral leakage.
D) Spectral leakage can occur due to abrupt changes in the signal.

A

B

Window functions like Hamming and Hanning are commonly used in signal processing to reduce spectral leakage by smoothing abrupt ends of the signal.
In the frequency domain, the input signal and filter can create an output signal through multiplication, not convolution.
If the windowing size is not sufficient to contain the original signal, spectral leakage can occur. Abrupt changes or ends in the signal can result in spectral leakage.
Therefore option B is incorrect.

263
Q

Suppose you are analyzing a noisy time-series ECG dataset. Upon transforming the data to the frequency domain, you notice several peaks located at 10 Hz, 40 Hz, 2 Hz, and 0.2 Hz. If your main goal is to isolate the heart rate signal (assuming that the typical human heart rate lies between 60 and 180 bpm), which filter would be the optimum choice, and what should the associated cutoff frequency(ies) be?
A. Band Stop Filter, 8 Hz to 12 Hz
B. High Pass Filter, 35Hz
C. Band Pass Filter, 0.5Hz to 4Hz
D. Low Pass Filter, 4Hz

A

C

The correct answer is C. This choice is optimal because it covers the frequency range that corresponds to the typical human heart rate, which, when converted to Hertz, is 1 Hz to 3 Hz. By selecting a Band Pass Filter with a range from 0.5Hz to 4Hz, we ensure that the heart rate signals are effectively isolated from the noises happening at 10 Hz, 40 Hz, and 0.2 Hz. This range not only captures the relevant heart rate frequencies but also provides a margin to account for potential outliers.

264
Q

Why does a Butterworth filter introduce a slope (or roll-off) in the frequency response near its cutoff frequency?

A) Because it is designed to eliminate all frequencies above the cutoff frequency abruptly, creating a vertical drop in the frequency response.

B) Because it aims to provide a maximally flat frequency response in the passband, leading to a gradual transition to the stopband.

C) Because the physical constraints of analog components used in its design inherently produce a slope in the frequency response.

D) Because it uses a high-order filter design to compensate for phase shifts, inadvertently creating a slope in the frequency response.

A

B

The Butterworth filter is known for its maximally flat frequency response in the passband, meaning it has no ripples. This design choice results in a gradual transition or slope as it approaches the cutoff frequency, moving towards the stopband. The slope, or roll-off, is a byproduct of attempting to maintain this flatness as closely as possible up to the cutoff frequency, beyond which attenuation begins. This gradual transition helps to reduce distortions within the passband while effectively attenuating unwanted frequencies in the stopband.

265
Q

What is the most appropriate method to decrease passband ripple in an FIR filter?

A. Apply zero-padding to enhance frequency resolution.
B. Decrease the filter order to reduce the kernel length.
C. Apply a windowing technique to smooth out the impulse response.
D. Narrow the frequency band to widen the filter’s stopband.

A

C

To reduce passband ripple in an FIR filter, the most effective strategy is utilizing a windowing technique (C). Windowing smooths out the filter’s impulse response, reducing side lobes and effectively diminishing ripple in both the passband and stopband. On the other hand, zero-padding (A) mainly enhances frequency resolution in spectral analysis without directly affecting the ripple. Decreasing the filter order (B) might simplify the filter but can increase ripple due to less precise control over the filter’s frequency response. Narrowing the frequency band (D) does not directly influence passband ripple.

266
Q

Which statement accurately describes the distinction between FIR and IIR filters in the context of digital signal processing?

A) FIR filters are always unstable, while IIR filters are guaranteed to be stable due to their inherent feedback mechanism.

B) FIR filters inherently have a linear phase response, making them ideal for phase-sensitive applications, whereas IIR filters may introduce phase distortions due to their recursive nature.

C) IIR filters cannot achieve a high degree of selectivity in filtering specific frequency components, unlike FIR filters which can be designed for very narrow bandwidths.

D) Spectral leakage is a phenomenon only associated with FIR filters when applied to spectral analysis of signals, while IIR filters are immune to spectral leakage due to their feedback structure.

A

B

FIR (Finite Impulse Response) filters can maintain a linear phase response, which is crucial for applications where preserving the wave shape of the original signal is important, as it ensures all frequency components of the signal are delayed by the same amount when passing through the filter. This property makes FIR filters particularly suitable for phase-sensitive applications. On the other hand, IIR (Infinite Impulse Response) filters, due to their recursive nature, can introduce phase distortion, as they may not delay all frequency components uniformly. This difference highlights the importance of choosing the right filter type based on the specific requirements of a signal processing task.

267
Q

Spectral leakage occurs during the spectral analysis of signals when:

A) The signal is perfectly windowed with no abrupt changes at the edges, resulting in a highly accurate spectral representation.

B) A signal is windowed, causing energy to spread out from the main frequency component to the sidelobes, thereby reducing spectral resolution.

C) The frequency components of a signal are below 10 Hz, which is considered as “curiosity” 1/f noise and does not contribute to spectral leakage.

D) The signal undergoes a filtering process with an ideal low-pass filter, which eliminates all frequency components above a certain cutoff frequency without affecting the spectral resolution.

A

B

Spectral leakage is a phenomenon that occurs when a signal is windowed, leading to the spreading of energy from the (mainlobe to the sidelobes. This effect reduces the spectral resolution because energy that should be concentrated in the main frequency band leaks into other frequencies. It is prompting the use of various windowing techniques to minimize the leakage and preserve as much spectral accuracy as possible.

268
Q

Select the statement that accurately reflects the interpretation of a signal’s frequency spectrum:

A) White noise within the range of 10-70 Hz is known for containing valuable information about the signal’s origin.

B) A real signal with strong peaks at 13 Hz, 26 Hz, and 39 Hz signifies fundamental and harmonic frequencies, likely indicating mechanical sources such as a submarine’s propeller.

C) AC noise is indicated by a frequency component at 60 Hz and does not include smaller peaks at its multiples.

D) An ideal antialias filter targets frequencies below 80 Hz to eliminate them from the signal.

A

B

Option C) is the correct answer as it accurately interprets the frequency spectrum of a real signal, indicating the presence of fundamental and harmonic frequencies. These peaks are characteristic of periodic mechanical sources, such as the blades of a submarine’s propeller, which generate distinct peaks at these frequencies due to their rotation.

269
Q

Increasing the order (N) of a Butterworth filter results in what?

A) Reduced selectivity
B) Decreased cutoff frequency
C) Steeper transition band
D) Simplified design

A

C

Increasing the order of a Butterworth filter makes its transition from passband to stopband steeper, allowing for a clearer distinction between allowed and blocked frequencies. This improves the filter’s precision in separating desired signals from noise.

270
Q

Choose one explanation regarding spectral leakage which is incorrect:

A) Spectral leakage occurs due to inappropriate window size
when sampling from original signal.
B) Spectral leakage can happen when the correlation between a signal and itself with time lag is too high.
C) Spectral leakage can be reduced by smoothing the abrupt ends of windowed signal.
D) Spectral leakage refers to the problem where energy leaks out from the mainlobe to the sidelobes after DFT.

A

B

A, C, D are all correct explanations about spectral leakage. The main reason for spectral leakage is the inappropriate window size during sampling which in turn leads to energy leaking out from the mainlobe to the sidelobes after DFT. One way to reduce this is to smooth out the abrupt ends of the windowed signal. B says that high autocorrelation leads to spectral leakage which is not the case.

271
Q

Which one is a property of Butterworth filter?
A. Passband is designed to be maximally high
B. Designed to mimic low pass filter for large N
C. Is a high pass filter
D. Works by moving average

A

B

B is the correct answer according to the slides. For large N, the function behaves like a ideal low pass filter. The others are wrong.

272
Q

Which of the following statements about spectral leakage is incorrect? (Select One)

  1. Spectral leakage always occurs when using a window on sensor data.
  2. Spectral leakage is caused by any abrupt change at the edge of a window of sensor data.
  3. Spectral leakage reduces spectral resolution.
  4. The Hamming window helps reduce spectral leakage by smoothing abrupt truncation.
A

1

Spectral leakage occurs when signals analyzed with a slightly truncated window don’t align with the assumed periodicity. This mismatch causes discontinuities at window edges, spreading signal energy into neighboring frequency bins and distorting the spectrum. Spectral leakage isn’t inherent but rather arises from specific conditions, not always occurring when using windows. Therefore, option 1 is incorrect as spectral leakage is contingent on the signal’s alignment with the analysis window.

273
Q

Which of the following statements best describes the relationship between Chebyshev filters, Butterworth filters, and the impact of passband ripple on filter characteristics?

A) Increasing the ripple in the passband of a Chebyshev filter decreases the roll-off rate, making it similar to a Butterworth filter.
B) A Chebyshev filter with 0% ripple in the passband becomes a Butterworth filter, which is known for its maximally flat frequency response.
C) Butterworth filters achieve a faster roll-off than Chebyshev filters by introducing ripple in the passband.
D) Chebyshev filters can only achieve a faster roll-off than Butterworth filters when the passband ripple exceeds 1%.

A

B

A) Increasing the ripple in the passband of a Chebyshev filter decreases the roll-off rate, making it similar to a Butterworth filter: This statement is incorrect because increasing the ripple in a Chebyshev filter’s passband actually increases the roll-off rate, not decreases it.

B) A Chebyshev filter with 0% ripple in the passband becomes a Butterworth filter, which is known for its maximally flat frequency response: This is correct. Butterworth filters are characterized by a maximally flat response in the passband, meaning there is no ripple. When a Chebyshev filter is designed with 0% ripple, it essentially adopts the characteristics of a Butterworth filter, emphasizing a smooth passband without fluctuations.

C) Butterworth filters achieve a faster roll-off than Chebyshev filters by introducing ripple in the passband: This statement is incorrect.

D) Chebyshev filters can only achieve a faster roll-off than Butterworth filters when the passband ripple exceeds 1%: This statement is misleading.

274
Q

Question: For the discrete signal x[n] with values {2,1,3}, compute the autocorrelation of the signal rₓ[1], which represents the autocorrelation at a lag (or shift) of 1.

A) 14
B) 8
C) 5
D) 6

A

C

Autocorrelation quantifies the similarity between a signal and its own shifted version at various lags, shedding light on patterns inherent to the signal, such as periodicity. For a discrete signal x[n], the autocorrelation at a specific lag k, symbolized as rₓ[k], is calculated by summing the product of the signal values with their values at the k-lagged positions. For a lag of 1, this calculation is as follows:

rₓ[1] = x[0]x[1] + x[1]x[2]

Applying this to x[n] = {2,1,3}:

rₓ[1] = (2 * 1) + (1 * 3) = 2 + 3 = 5

275
Q

In digital filter design, what describes the impulse response function h(n) in the context of linear digital filters?

a) The function that represents the filter’s frequency response to an input signal.
b) The sequence of numbers generated by the filter as an output when the input is an impulse signal.
c) The process of converting the time-domain signal into its frequency-domain representation.
d) The formula used to calculate the filter’s cut-off frequency in low-pass filters.

A

b

a) Incorrect. The impulse response function describes the time-domain response of a filter to an impulse input, not the frequency response.
b) Correct.
c) Incorrect. The process of converting a time-domain signal into its frequency-domain representation is known as Fourier Transform, not the impulse response function.
d) Incorrect. The cut-off frequency in low-pass filters is related to the filter’s design and specifications, not directly to the impulse response function h(n)

276
Q

What is spectral leakage, and how can it be reduced in spectral analysis?

a) Spectral leakage refers to the leakage of signal energy into adjacent frequency bins, causing distortion in spectral analysis. It can be reduced by applying window functions to taper the signal, such as the Hamming or Hanning window.

b) Spectral leakage refers to the loss of signal power during spectral analysis. It can be reduced by increasing the length of the signal to improve frequency resolution.

c) Spectral leakage is the occurrence of aliasing in spectral analysis. It can be reduced by decreasing the sampling rate of the signal.

d) Spectral leakage is the distortion of signal shape due to impedance mismatch. It can be reduced by increasing the number of frequency bins used in spectral analysis.

A

a

a) is correct because it accurately defines spectral leakage as the leakage of signal energy into adjacent frequency bins, leading to distortion in spectral analysis

277
Q

What does the term ‘ripple’ refer to in the context of filter design?
A) A constant high-frequency oscillation that filters aim to remove.
B) The variation in amplitude within the passband or stopband of a filter.
C) The desired frequency range that a filter allows to pass through.
D) The length of the filter’s impulse response.

A

B

Ripple in filter design refers to the variations or fluctuations in amplitude observed within the passband or stopband of a filter. It is an important factor to consider in the filter design process, as excessive ripple may adversely affect the filter’s performance in certain applications