State Key Laboratory of Media Convergence and Communication
facilityBeijing, China
Research output, citation impact, and the most-cited recent papers from State Key Laboratory of Media Convergence and Communication. Aggregated across the NobleBlocks index of 300M+ scholarly works.
Top-cited papers from State Key Laboratory of Media Convergence and Communication
Digital platforms such as search engines and social media have become major gateways to news. Algorithms are used to deliver news that is consistent with consumers’ preferences and individuals share news through their online social networks. This networked environment has resulted in growing uncertainty about online information which has had an impact on news industries globally. While it is well established that perceptions of trust in news found on social media or via search engines are lower than traditional news media, there has been less discussion about the impact of social media use on perceptions of trust in the news media more broadly. This study fills that gap by examining the influence of social media as news sources and pathways to news on perceptions of the level of news trust at a country level. A secondary data analysis of a 26-country survey in 2016 and 2019 was conducted. The analysis revealed an increase in social media use for accessing news resulted in a decline in trust in news media generally across the globe. Higher levels of general mistrust in news were related to an increased use of sharing of news. This paper argues the use of social media for news is closely linked to the increase in news mistrust, which is likely to continue to rise as the number of people using social media to access news continues to grow.
Curriculum learning begins to thrive in the speech enhancement area, which decouples the original spectrum estimation task into multiple easier sub-tasks to achieve better performance. Motivated by that, we propose a dual-branch attention-in-attention transformer dubbed DB-AIAT to handle both coarse- and fine-grained regions of the spectrum in parallel. From a complementary perspective, a magnitude masking branch is proposed to coarsely estimate the overall magnitude spectrum, and simultaneously a complex refining branch is elaborately designed to compensate for the missing spectral details and implicitly derive phase information. Within each branch, we propose a novel attention-in-attention transformer-based module to replace the conventional RNNs and temporal convolutional networks for temporal sequence modeling. Specifically, the proposed attention-in-attention transformer consists of adaptive temporal-frequency attention transformer blocks and an adaptive hierarchical attention module, aiming to capture long-term temporal-frequency dependencies and further aggregate global hierarchical contextual information. Experimental results on Voice Bank + DEMAND demonstrate that DB-AIAT yields state-of-the-art performance (e.g., 3.31 PESQ, 95.6% STOI and 10.79dB SSNR) over previous advanced systems with a relatively small model size (2.81M).
With the proliferation of Audio Language Model (ALM) based deepfake audio, there is an urgent need for generalized detection methods. ALM-based deepfake audio currently exhibits widespread, high deception, and type versatility, posing a significant challenge to current audio deepfake detection (ADD) models trained solely on vocoded data. To effectively detect ALM-based deepfake audio, we focus on the mechanism of the ALM-based audio generation method, the conversion from neural codec to waveform. We initially constructed the Codecfake dataset, an open-source, large-scale collection comprising over 1 million audio samples in both English and Chinese, focus on ALM-based audio detection. As countermeasure, to achieve universal detection of deepfake audio and tackle domain ascent bias issue of original sharpness aware minimization (SAM), we propose the CSAM strategy to learn a domain balanced and generalized minima. In our experiments, we first demonstrate that ADD model training with the Codecfake dataset can effectively detects ALM-based audio. Furthermore, our proposed generalization countermeasure yields the lowest average equal error rate (EER) of 0.616% across all test conditions compared to baseline models. The dataset1 and associated code2 are available online.
Detecting violence in video is a challenging task due to its complex scenarios and great intra-class variability. Most previous works specialize in the analysis of appearance or motion information, ignoring the co-occurrence of some audio and visual events. Physical conflicts such as abuse and fighting are usually accompanied by screaming, while crowd violence such as riots and wars are generally related to gunshots and explosions. Therefore, we propose a novel audio-guided multimodal violence detection framework. First, deep neural networks are used to extract visual and audio features, respectively. Then, a Cross-Modal Awareness Local-Arousal (CMA-LA) network is proposed for cross-modal interaction, which implements audio-to-visual feature enhancement over temporal dimension. The enhanced features are then fed into a multilayer perceptron (MLP) to capture high-level semantics, followed by a temporal convolution layer to obtain high-confidence violence scores. To verify the effectiveness of the proposed method, we conduct experiments on a large public violent video dataset, i.e., XD-Violence. Experimental results demonstrate that our model outperforms several methods and achieves new state-of-the-art performance.
In this paper, we undertake a stakeholder analysis of the Australian Competition and Consumer Commission’s Digital Platforms Inquiry to understand the nature and influence of different forms of public input. Our findings show that nation-state regulation of digital platforms is now very much on the policy agenda worldwide, with a focus upon the competition policy dimensions of platform regulation. The second key finding is that the regulatory activism of the ACCC have ensured that the Inquiry and its findings have had maximum public impact. Finally, we argue that the key dynamic shaping the Inquiry was the competing demands of the traditional news media publishers and digital platforms, and that civil society input was relatively limited and secondary to the final recommendations.
Abstract Large-scale dynamic vehicle routing problem (LSDVRP) is exhibiting extensive application prospect with the rapid growth of online logistics, whereas a few approaches have been developed to address LSDVRPs. The difficulty in solving LSDVRPs lies in that it requires quick response and high adaptability to numerous newly appeared customers in LSDVRPs. To overcome this difficulty, in this paper, we propose a responsive ant colony optimization algorithm, termed as RACO, for efficiently addressing LSDVRPs. In the proposed RACO, a pheromone diversity enhancing method is suggested to generate diverse pheromone matrices for quickly responding to newly appeared customer requests in solving LSDVRPs. A pheromone ensemble technique is further designed to produce a high-quality initial population that well adapts to the new customer requests by making use of diverse pheromone matrices. Empirical results on a set of 12 LSDVRP test instances demonstrate the effectiveness of the suggested pheromone diversity enhancing method in quickly responding to newly appeared customer requests for solving LSDVRPs. Moreover, we investigate the computational cost and the traveling cost obtained by the proposed RACO to evaluate responsiveness and adaptability of the proposed RACO, respectively. Comparison with four state-of-the-art approaches to DVRPs validates the superiority of the proposed RACO in addressing LSDVRPs in terms of responsiveness and adaptability.
Partially spoofed audio detection is a challenging task, lying in the need to accurately locate the authenticity of audio at the frame level. To address this issue, we propose a fine-grained partially spoofed audio detection method, namely Temporal Deepfake Location (TDL), which can effectively capture information of both features and locations. Specifically, our approach involves two novel parts: embedding similarity module and temporal convolution operation. To enhance the identification between the real and fake features, the embedding similarity module is designed to generate an embedding space that can separate the real frames from fake frames. To effectively concentrate on the position information, temporal convolution operation is proposed to calculate the frame-specific similarities among neighboring frames, and dynamically select informative neighbors to convolution. Extensive experiments show that our method outperform baseline models in ASVspoof2019 Partial Spoof dataset and demonstrate superior performance even in the cross-dataset scenario.
In media industry, the demand of SDR-to-HDRTV upconversion arises when users possess HDR-WCG (high dynamic range-wide color gamut) TVs while most off-the-shelf footage is still in SDR (standard dynamic range). The research community has started tackling this low-level vision task by learning-based approaches. When applied to real SDR, yet, current methods tend to produce dim and desaturated result, making nearly no improvement on viewing experience. Different from other network-oriented methods, we attribute such deficiency to training set (HDR-SDR pair). Consequently, we propose new HDRTV dataset (dubbed HDRTV4K) and new HDR-to-SDR degradation models. Then, it's used to train a luminance-segmented network (LSN) consisting of a global mapping trunk, and two Transformer branches on bright and dark luminance range. We also update assessment criteria by tailored metrics and subjective experiment. Finally, ablation studies are conducted to prove the effectiveness. Our work is available at: https://github.com/AndreGuo/HDRTVDM
Objectives: Previous studies revealed a positive association between neuroticism and depression. This study further extended the previous findings by exploring the psychological processes underlying this association among Chinese postgraduates. Guided by theoretical models and empirical research, we proposed a multiple mediation and moderated mediation model to investigate the roles of dispositional mindfulness and cognitive reappraisal in the relationship between neuroticism and depression. Methods: Using the NEO Five-Factor Inventory, Beck Depression Inventory, Mindfulness Attention Awareness Scale, and Emotion Regulation Questionnaire, 1103 first-year postgraduates at a comprehensive university in China were surveyed. Path analysis was adopted to test the models. Results: The results showed that dispositional mindfulness mediated the association between neuroticism and depression. Further, this mediating effect was moderated by cognitive reappraisal, with this effect being stronger in individuals with low engagement in cognitive reappraisal. Conclusion: The results support interrelations among neuroticism, depression, dispositional mindfulness, and cognitive reappraisal as moderated mediation rather than multiple mediation. The results enhance our understanding of psychological mechanisms between neuroticism and depression and provide suggestions for interventions to prevent or reduce depression in highly neurotic postgraduates.
Singing voice synthesis and singing voice conversion have significantly advanced, revolutionizing musical experiences. However, the rise of "Deepfake Songs" generated by these technologies raises concerns about authenticity. Unlike Audio DeepFake Detection (ADD), the field of song deepfake detection lacks specialized datasets or methods for song authenticity verification. In this paper, we initially construct a Chinese Fake Song Detection (FSD) dataset to investigate the field of song deepfake detection. The fake songs in the FSD dataset are generated by five state-of-the-art singing voice synthesis and singing voice conversion methods. Our initial experiments on FSD revealed the ineffectiveness of existing speech-trained ADD models for the task of song deepfake detection. Thus, we employ the FSD dataset for the training of ADD models. We subsequently evaluate these models under two scenarios: one with the original songs and another with separated vocal tracks. Experiment results show that song-trained ADD models exhibit a 38.58% reduction in average equal error rate compared to speech-trained ADD models on the FSD test set.
For the lack of adequate paired noisy-clean speech corpus in many real scenarios, non-parallel training is a promising task for DNN-based speech enhancement methods. However, because of the severe mismatch between input and target speeches, many previous studies only focus on the magnitude spectrum estimation and remain the phase unaltered, resulting in the degraded speech quality under low signal-to-noise ratio conditions. To tackle this problem, we decouple the difficult target w.r.t. original spectrum optimization into spectral magnitude and phase, and a novel Cycle-in-Cycle generative adversarial network (dubbed CinCGAN) is proposed to jointly estimate the spectral magnitude and phase information stage by stage under unpaired data. In the first stage, we pretrain a magnitude Cycle-GAN to coarsely estimate the spectral magnitude of clean speech. In the second stage, we incorporate the pretrained CycleGAN with a complex-valued CycleGAN as a cycle-in-cycle structure to simultaneously recover phase information and refine the overall spectrum. Experimental results demonstrate that the proposed approach significantly outperforms previous baselines under non-parallel training. The evaluation on training the models with standard paired data also shows that CinCGAN achieves remarkable performance especially in reducing background noise and speech distortion.
Deep learning has significantly advanced the object detection field. However, tiny object detection (TOD) remains a challenging problem. We provide a new analysis method to examine the TOD challenge through occlusion-based attribution analysis in the frequency domain. We observe that tiny objects become less distinct after feature encoding and can benefit from the removal of high-frequency information. In this paper, we propose a novel approach named Spectral Enhancement for Tiny object detection (SET), which amplifies the frequency signatures of tiny objects in a heterogeneous architecture. SET includes two modules. The Hierarchical Background Smoothing (HBS) module suppresses high-frequency noise in the background through adaptive smoothing operations. The Adversarial Perturbation Injection (API) module leverages adversarial perturbations to increase feature saliency in critical regions and prompt the refinement of object features during training. Extensive experiments on four datasets demonstrate the effectiveness of our method. Especially, SET boosts the prior art RFLA by 3.2% AP on the AI-TOD dataset.
High dynamic range(HDR) imaging is the task of re-covering HDR image from one or multiple input Low Dynamic Range (LDR) images. In this paper, we present Gamma-enhanced Spatial Attention Network(GSANet), a novel framework for reconstructing HDR images. This problem comprises two intractable challenges of how to tackle overexposed and underexposed regions and how to overcome the paradox of performance and complexity trade-off. To address the former, after applying gamma correction on the LDR images, we adopt a spatial attention module to adaptively select the most appropriate regions of various exposure low dynamic range images for fusion. For the latter one, we propose an efficient channel attention module, which only involves a handful of parameters while bringing clear performance gain. Experimental results show that the proposed method achieves better visual quality on the HDR dataset. The code will be available at: https://github.com/fancyicookie/GSANet
Alzheimer's disease (AD) has been regarded as the most common form of dementia which affects millions of people around the world. While there is no specific remedy for AD, an precise prediction in the early stage could effectively delay the onset of AD. Structural Magnetic Resonance Imaging (sMRI) can detect brain abnormalities for AD patients and has been widely used for AD diagnosis. Thanks to the rapid development of deep learning techniques, a great number of deep learning methods have been adopted to obtain task-orientated features from sMRI and achieved satisfactory performance. In this paper, we first systematically review these applications of deep learning models on AD detection using sMRI. Specifically, we divide them into four main categories according to their input types and discuss their advantages and limitations. Then, we propose two challenges for current studies: incomparable performance and the difficulty to efficiently model the relationship between spatially distant regions. Finally, we offer two possible future research directions for building better deep learning-based AD detection models.
To prevent IoT devices from being exploited, it is particularly important to detect vulnerabilities as many as possible during the device development process. The black-box fuzzing test is widely used in vulnerability detection for IoT devices for several reasons. First of all, the source code of the firmware is rarely provided in public, device response messages are a valuable source of device status. In legacy black-box fuzzing tests, there was a lack of checks on network protocols, message formats and encodings. Byte-to-byte mutation without these checks produced a large amount of garbage input data, which could not reach the deep-level function code. The efficiency and accuracy of fuzzing testing were negatively impacted accordingly. Secondly, communication protocol specification of firmware is rarely provided in public too, and it is difficult for existing grammar-based fuzzing strategies to distinguish the meaning of each field of the message. To solve the above issues, this paper proposes a response-based black-box fuzzing method, named FIoTFuzzer. We set up a message adapter to identify the protocol, format, encoding and other information of original communication packets. To improve the syntax inference capability, FIoTFuzzer divides the message segment based on the response, avoiding blind mutation of the content. This method of using mutation strategy based on message segment under the premise of format specification can reach deep functional components of smart devices. This fuzzing method has lightweight dependencies and does not require reverse engineering. Our tests were evaluated on 12 IoT devices, which included routers, smart bulbs and IP cameras. The results show that: (1) FIoFuzzer is able to detect real-world vulnerabilities in IoT devices; (2) In our benchmark comparison tests with Boofuzz and Sulley, FIoTFuzzer detected 9 vulnerabilities while Boofuzz detected only 5 and Sulley detected only 4 among these 9 vulnerabilities.
Colorization has attracted increasing interest in recent years. However, image colorization is an ill-posed problem with multi-modal correct solutions and they still suffer problems of context confusion and object-edge color bleeding. In this paper, we proposed the Color-GAN, a novel auto adversarial learning colorization methods coupled with channel and spatial attention based on residual structure enhanced by feature extractor and skip-connection. Our network learns colorizing in the method of combining perceptual and semantic understanding of color with class distributions. Experimental results show that our network outperformers existing methods on different quality metrics, meanwhile generates state-of-the-art performance on auto image colorization.
Recent years have witnessed the great promise of deep neural video compression codecs. However, there are still unprecedented challenges ahead when the videos are expected to be encoded with extremely low bitrate. Motivated by recent attempts of layered conceptual image compression, we make the first attempt to leverage the disentangled visual representations for extreme human body video compression. More specifically, to capture the main structure, we adopt the inferred human pose keypoints as the structure code of each frame, thereby deriving the motion information from structure codes of adjacent frames for further compression. To better exploit the texture redundancy, all frames share the same texture codes by incorporating the proposed texture contrastive learning to ensure texture consistency within a video. Two branches are consequently transmitted in a separable manner, and the generator synthesizes the reconstructed video with the combination of all decoded representations at the decoder side. Both qualitative and quantitative experimental results demonstrate that the proposed scheme can produce perceptually pleasing reconstruction results in ultra-low bitrates far below that can be reached by other video codecs.
Gender has been demonstrated to have a significant impact on emotion response. However, most previous researches on the effects of gender on emotion have largely focused on the fields of psychology and neuroscience. This paper employs electroencephalography signals to investigate gender differences in emotional expressivity, emotional experience and emotion recognition from the perspective of affective computing. Considering the advantages of being concise, comprehensible and emotionally impactful, the short videos are chosen to elicit positive, neutral and negative emotional categories. The analysis results on emotional responses show that the gender differences do exist in evoking emotion during watching short videos. Additionally, we use differential entropy as features and conduct the cross-subject emotion recognition experiment employing support vector machine. The experimental results show that the same gender training strategy outperforms the different gender training one, which implies that there should be gender in-group advantage in electroencephalogram patterns when processing emotions. These findings provide valuable information for a deeper understanding of the role of gender differences in emotion processing and emphasize the importance of integrating neuroscience and computational approaches in emotion recognition tasks.
In this letter, two compact, colocated, low-coupled, quad-polarized (QP) antennas are proposed based on two dual-mode radiation bodies to achieve compactness and high isolation. The first antenna obtains four polarizations by orthogonally placing two dual-mode loops; the second antenna includes a dual-mode loop and a miniaturized dual-polarized patch based on fractal structures inside the loop. To investigate the performance of QP multiple-input–multiple-output (MIMO) systems, channel parameters including correlation coefficients, degree of power balance, and channel capacities are analyzed based on the measured data. Results show that, high channel capacities can be achieved both in a realistic office room and a rich scattered reverberation chamber, which are very close to that of the 4 × 4 independently identically distributed (i.i.d.) Rayleigh channel. It is revealed that both correlation coefficient and degree of received power balance have strong influence on the performance of a MIMO system, which are related to the radiation properties of the antenna and the polarimetric characteristics of the propagation environment.
The diffusion of all-media content plays a vital role in guiding public opinion and ideology. However, at present, most of the media content exists on all kinds of mainstream media platforms, which poses great challenges to the effective supervision of relevant departments and society. This has led to arbitrary charges, chaotic media content, difficulties in supervision and evidence collection, and infringements of the rights and interests of original content creators. To address these problems, this paper constructs a trustworthy propagation architecture that supports multi-platform media content sharing. This architecture collaboratively builds an audio-visual blockchain through public and consortium blockchains, coupled with an improved ChinaDRM to provide digital rights management and content encryption. Simultaneously, we employ an enhanced Diffie–Hellman key agreement protocol to offer distributed encryption and decryption for media content. Within this model, various media platforms and national regulatory authorities are responsible for content storage and distribution as consortium nodes and public blockchain nodes, respectively. At the same time, users, as light nodes of public chain or service consumers of consortium blockchain, can consume and comment on content. Analysis shows that the trusted communication framework of media content based on the audio-visual blockchain has certain expansibility and practicability. It can facilitate the supervision of mainstream media platforms by national authorities and society through inter-blockchain technology, offering a novel solution for multi-platform trustworthy cooperative information sharing.