With the wide deployments of heterogeneous networks, huge amounts of data with characteristics of high volume, high variety, high velocity, and high veracity are generated. These data, referred to multimodal big data, contain abundant intermodality and cross-modality information and pose vast challenges on traditional data fusion methods. In this review, we present some pioneering deep learning models to fuse these multimodal big data. With the increasing exploration of the multimodal big data, there are still some challenges to be addressed. Thus, this review presents a survey on deep learning for multimodal data fusion to provide readers, regardless of their original community, with the fundamentals of multimodal deep learning fusion method and to motivate new multimodal data fusion techniques of deep learning. Specifically, representative architectures that are widely used are summarized as fundamental to the understanding of multimodal deep learning. Then the current pioneering multimodal data fusion deep learning models are summarized. Finally, some challenges and future topics of multimodal data fusion deep learning models are described.
The success of CNNs is accompanied by deep models and heavy storage costs. For compressing CNNs, we propose an efficient and robust pruning approach, cross-entropy pruning (CEP). Given a trained CNN model, connections were divided into groups in a group-wise way according to their corresponding output neurons. All connections with their cross-entropy errors below a grouping threshold were then removed. A sparse model was obtained and the number of parameters in the baseline model significantly reduced. This letter also presents a highest cross-entropy pruning (HCEP) method that keeps a small portion of weights with the highest CEP. This method further improves the accuracy of CEP. To validate CEP, we conducted the experiments on low redundant networks that are hard to compress. For the MNIST data set, CEP achieves an 0.08% accuracy drop required by LeNet-5 benchmark with only 16% of original parameters. Our proposed CEP also reduces approximately 75% of the storage cost of AlexNet on the ILSVRC 2012 data set, increasing the top-1 errorby only 0.4% and top-5 error by only 0.2%. Compared with three existing methods on LeNet-5, our proposed CEP and HCEP perform significantly better than the existing methods in terms of the accuracy and stability. Some computer vision tasks on CNNs such as object detection and style transfer can be computed in a high-performance way using our CEP and HCEP strategies.