M06 - Multimodel Learning, Interaction and Communication Flashcards
What is machine learning?
“The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience.”
What is the definition of machine learning with E,T and P?
“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”
Define E, T and P for a robot learning project.
Task: object recognition with color and depth data
Experience: the iCub multisensor dataset
Performance measure: accuracy, precision, recall, F1 score
What is modality?
Sensory data that are associated with different aspects of the observed phenomena
Why do we need multimodel learning/integration?
- To form a robust sensory representation
- To leverage complementary characteristics of modalities
How do you count modalities in a robot?
The number of types of data = The number of modalities
What are the 5 challenges in Multimodel Machine Learning?
- Representation: how to represent multimodal data [pixels, signals, symbols, etc.]
- Translation: how to map data from one modality to another
- Alignment:how to identify direct relations between modalities
- Fusion:how to join information from two or more modalities [data level, decision level, intermediate]
- Co - learning: how to transfer knowledge between modalities
What are the steps in a machine learning pipeline?
- Preprocessing (dimensionality reduction, features extraction, selection, scaling, sampling, denoising)
- Learning (Initializing, Optimizing, Cross-Validation)
- Evaluation (New model)
What is the problem between color and depth data?
There is a huge semantic gap between color depth data and raw data matrices
How can we extract representations?
- hand-crafted features
- automatic feature learning
Which feature selection is quicker?
Automatic feature learning usually finds better solutions than hand-designed ones
What are desired specifications of the representations?
- Similar representation should indicate similar concepts
[if you visualize the representation space, the distances for different carrots should be close to each other but far from the cars] - Representations should be robust
[The extracted representations should be robust to deal with the noise] - We should know how to handle missing data in one modality
What is unimodal learning?
Discrete probability distribution of objects
What are the problems with unimodal learning?
External reasons:
- noise in the environment
- miscalibrated sensors
Model-related reasons
- wrong model selection
- non-regularized weights
- using raw data as input
What are the goals of multimodal learning/integration?
- To form a robust sensory representations
- To leverage complementary characteristics of modalities
What are the main characteristics of deep multimodel learning?
- both modality-wise representations (features) and shared (fused) representations are learned from data
- requires little or not preprocessing of input data (end-to-end training)
- deeper, complex networks typically require large amounts of training data (if trained from scratch)
What are the main characteristics of conventional multimodel learning?
- features are manually designed and require prior knowledge about the underlying problem and data
- some techniques, like early fusions, may be sensitive to data preprocessing
- may not require as much training data
What the levels of multimodel fusion techniques?
- data level
- decision level
- intermediate fusion
What happens on the data level of multimodal techniques?
Fuse the inputs before performing machine learning
What happens on the decision level of multimodal techniques?
Fuse the decisions of each model i.e. outputs of the machines learning algorithm
What happens on the intermediate fusion level of multimodal techniques?
Fuse the representations at different levels of the model (you need to understand which part of network encodes what parts of your data) i.e. intermediate layers of the convolutional neural networks
What are the assumptions of data level fusion?
conditional independence among modalities, depth and color
What does data level fusion do?
- concentrate raw inputs
- reduce dimensions of input
- hand-crafted features
- observing same phenomena but each sensor observes different types
What is the output of data level fusion?
Decide on output by looking at the majority or mean or use an algorithm
What does decision level fusion do?
- employs different or the same model for different modalities
- collect decisions form separate models trained on different modalities
fuse them by averaging, sum-them-maximg, or meta learning
When do you use decision level fusion?
- the modalities are uncorrelated
- the modalities have different dimensions
- exploit different machine learning model for different modalities
(CNNs for image, SVM for depth, MLP for touch, etc.)
What are fusion methods?
- multimodal deep learnig for robust object recognition
- deep learning-based image segmentation on multimodal medical imaging
- multimodal representation models for prediction and control from information
What is intermediate fusion?
- non-hand-crafted features
- fuse similar modalities together
- multi-modal architecture
When is it best to use CNN?
CNN is best when you are planning to fuse the output of features
What are the implications of multimodal learning results?
- We form a robust sensory representation
- We leverage complementary characteristics of modalities