Jan 19, 2023

Multimodal Machine Learning

 Multimodal enrichment and fusion for ML models

This is an excellent paper on the visual synthesis and an important contribution to multimodal ML. 

Multimodal data with images, audio, video, and various text structures help with contextual relationships and (potentially) enrich machine learning models. The most common approach to capturing multiple types of semantic representation is using a vector in a latent space. These embeddings are a standard approach in #transformers like #BERT. Primarily the self-attention algorithms provided radical improvements for NLP tasks. In machine vision, a meshed-memory #transformer (M^2) is applied to image captioning. Conversely, variational #autoencoders (#VAE) and #GAN architectures work well for generative imagery formation from text inputs. Sequential embedding methods are a foundation for combining language models with image-infused information. 

A promising generalization-type approach is the #Data2Vec process, which uses the same algorithm for text, speech, and imaging. The co-attention method has been applied in #VilBert and #Flamingo models with promising results. Handling other modalities like video combined with semi-structured or structured data lakes will spark another wave of revolution. Critical questions will be how to coordinate various modalities most efficiently. In ML research, this process is called data fusion and has been on the drawing boards for some time. Some of the issues include the multimodal sequence-to-sequence mapping process. More recently, the contextualized latent representation paired with the masking process was a robust solution in the #MoCo-v3 system. 

#ai #ML #selfattention #meshedmemory #MOCO #coattention #datalake #sequencetosequence #machinelearning #algorithms

https://deepai.org/publication/nuwa-visual-synthesis-pre-training-for-neural-visual-world-creation


No comments:

Post a Comment

Note: Only a member of this blog may post a comment.