NobleBlocks

International Audio Laboratories Erlangen

facilityErlangen, Germany

Research output, citation impact, and the most-cited recent papers from International Audio Laboratories Erlangen (Germany). Aggregated across the NobleBlocks index of 300M+ scholarly works.

Total works
918
Citations
17.7K
h-index
59
i10-index
437
Also known as
AudioLabsInternational Audio Laboratories Erlangen

Top-cited papers from International Audio Laboratories Erlangen

The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech
Keisuke Kinoshita, Marc Delcroix, Takuya Yoshioka, Tomohiro Nakatani +4 more
2013377doi:10.1109/waspaa.2013.6701894

Recently, substantial progress has been made in the field of reverberant speech signal processing, including both single- and multichannel dereverberation techniques, and automatic speech recognition (ASR) techniques robust to reverberation. To evaluate state-of-the-art algorithms and obtain new insights regarding potential future research directions, we propose a common evaluation framework including datasets, tasks, and evaluation metrics for both speech enhancement and ASR techniques. The proposed framework will be used as a common basis for the REVERB (REverberant Voice Enhancement and Recognition Benchmark) challenge. This paper describes the rationale behind the challenge, and provides a detailed description of the evaluation framework and benchmark results.

A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research
Keisuke Kinoshita, Marc Delcroix, Sharon Gannot, Emanuël A. P. Habets +4 more
2016· EURASIP Journal on Advances in Signal Processing364doi:10.1186/s13634-016-0306-6

In recent years, substantial progress has been made in the field of reverberant speech signal processing, including both single- and multichannel dereverberation techniques and automatic speech recognition (ASR) techniques that are robust to reverberation. In this paper, we describe the REVERB challenge, which is an evaluation campaign that was designed to evaluate such speech enhancement (SE) and ASR techniques to reveal the state-of-the-art techniques and obtain new insights regarding potential future research directions. Even though most existing benchmark tasks and challenges for distant speech processing focus on the noise robustness issue and sometimes only on a single-channel scenario, a particular novelty of the REVERB challenge is that it is carefully designed to test robustness against reverberation, based on both real, single-channel, and multichannel recordings. This challenge attracted 27 papers, which represent 25 systems specifically designed for SE purposes and 49 systems specifically designed for ASR purposes. This paper describes the problems dealt within the challenge, provides an overview of the submitted systems, and scrutinizes them to clarify what current processing strategies appear effective in reverberant speech processing.

Multi-Speaker DOA Estimation Using Deep Convolutional Networks Trained With Noise Signals
Soumitro Chakrabarty, Emanuël A. P. Habets
2019· IEEE Journal of Selected Topics in Signal Processing318doi:10.1109/jstsp.2019.2901664

Supervised learning-based methods for source localization, being data driven, can be adapted to different acoustic conditions via training and have been shown to be robust to adverse acoustic environments. In this paper, a convolutional neural network (CNN) based supervised learning method for estimating the direction of arrival (DOA) of multiple speakers is proposed. Multi-speaker DOA estimation is formulated as a multi-class multi-label classification problem, where the assignment of each DOA label to the input feature is treated as a separate binary classification problem. The phase component of the short-time Fourier transform (STFT) coefficients of the received microphone signals are directly fed into the CNN, and the features for DOA estimation are learnt during training. Utilizing the assumption of disjoint speaker activity in the STFT domain, a novel method is proposed to train the CNN with synthesized noise signals. Through experimental evaluation with both simulated and measured acoustic impulse responses, the ability of the proposed DOA estimation approach to adapt to unseen acoustic conditions and its robustness to unseen noise type is demonstrated. Through additional empirical investigation, it is also shown that with an array of M microphone our proposed framework yields the best localization performance with M-1 convolution layers. The ability of the proposed method to accurately localize speakers in a dynamic acoustic scenario with varying number of sources is also shown.

Broadband doa estimation using convolutional neural networks trained with noise signals
Soumitro Chakrabarty, Emanuël A. P. Habets
2017278doi:10.1109/waspaa.2017.8170010

A convolution neural network (CNN) based classification method for broadband DOA estimation is proposed, where the phase component of the short-time Fourier transform coefficients of the received microphone signals are directly fed into the CNN and the features required for DOA estimation are learned during training. Since only the phase component of the input is used, the CNN can be trained with synthesized noise signals, thereby making the preparation of the training data set easier compared to using speech signals. Through experimental evaluation, the ability of the proposed noise trained CNN framework to generalize to speech sources is demonstrated. In addition, the robustness of the system to noise, small perturbations in microphone positions, as well as its ability to adapt to different acoustic conditions is investigated using experiments with simulated and real data.

webMUSHRA — A Comprehensive Framework for Web-based Listening Tests
Michael Schoeffler, Sarah Bartoschek, Fabian-Robert Stöter, Marlene Roess +3 more
2018· Journal of Open Research Software200doi:10.5334/jors.187

For a long time, many popular listening test methods, such as ITU-R BS.1534 (MUSHRA), could not be carried out as web-based listening tests, since established web standards did not support all required audio processing features. With the standardization of the Web Audio API, the required features became available and, therefore, also the possibility to implement a wide range of established methods as web-based listening tests. In order to simplify the implementation of MUSHRA listening tests, the development of webMUSHRA was started. By utilizing webMUSHRA, experimenters can configure web-based MUSHRA listening tests without the need of web programming expertise. Today, webMUSHRA supports many more listening test methods, such as ITU-R BS.1116 and forced-choice procedures. Moreover, webMUSHRA is highly customizable and has been used in many auditory studies for different purposes.

Inference of Room Geometry From Acoustic Impulse Responses
Fabio Antonacci, Jason Filos, Mark R. Thomas, Emanuël A. P. Habets +3 more
2012· IEEE Transactions on Audio Speech and Language Processing139doi:10.1109/tasl.2012.2210877

Acoustic scene reconstruction is a process that aims to infer characteristics of the environment from acoustic measurements. We investigate the problem of locating planar reflectors in rooms, such as walls and furniture, from signals obtained using distributed microphones. Specifically, localization of multiple two- dimensional (2-D) reflectors is achieved by estimation of the time of arrival (TOA) of reflected signals by analysis of acoustic impulse responses (AIRs). The estimated TOAs are converted into elliptical constraints about the location of the line reflector, which is then localized by combining multiple constraints. When multiple walls are present in the acoustic scene, an ambiguity problem arises, which we show can be addressed using the Hough transform. Additionally, the Hough transform significantly improves the robustness of the estimation for noisy measurements. The proposed approach is evaluated using simulated rooms under a variety of different controlled conditions where the floor and ceiling are perfectly absorbing. Results using AIRs measured in a real environment are also given. Additionally, results showing the robustness to additive noise in the TOA information are presented, with particular reference to the improvement achieved through the use of the Hough transform.

Multicenter evaluation of signal enhancement algorithms for hearing aids
Heleen Luts, Koen Eneman, Jan Wouters, Michael Schulte +4 more
2010· The Journal of the Acoustical Society of America129doi:10.1121/1.3299168

In the framework of the European HearCom project, promising signal enhancement algorithms were developed and evaluated for future use in hearing instruments. To assess the algorithms' performance, five of the algorithms were selected and implemented on a common real-time hardware/software platform. Four test centers in Belgium, The Netherlands, Germany, and Switzerland perceptually evaluated the algorithms. Listening tests were performed with large numbers of normal-hearing and hearing-impaired subjects. Three perceptual measures were used: speech reception threshold (SRT), listening effort scaling, and preference rating. Tests were carried out in two types of rooms. Speech was presented in multitalker babble arriving from one or three loudspeakers. In a pseudo-diffuse noise scenario, only one algorithm, the spatially preprocessed speech-distortion-weighted multi-channel Wiener filtering, provided a SRT improvement relative to the unprocessed condition. Despite the general lack of improvement in SRT, some algorithms were preferred over the unprocessed condition at all tested signal-to-noise ratios (SNRs). These effects were found across different subject groups and test sites. The listening effort scores were less consistent over test sites. For the algorithms that did not affect speech intelligibility, a reduction in listening effort was observed at 0 dB SNR.

Irregular vocal-fold vibration—High-speed observation and modeling
Patrick Mergell, Hanspeter Herzel, Ingo R. Titze
2000· The Journal of the Acoustical Society of America124doi:10.1121/1.1314398

Direct observations of nonstationary asymmetric vocal-fold oscillations are reported. Complex time series of the left and the right vocal-fold vibrations are extracted from digital high-speed image sequences separately. The dynamics of the corresponding high-speed glottograms reveals transitions between low-dimensional attractors such as subharmonic and quasiperiodic oscillations. The spectral components of either oscillation are given by positive linear combinations of two fundamental frequencies. Their ratio is determined from the high-speed sequences and is used as a parameter of laryngeal asymmetry in model calculations. The parameters of a simplified asymmetric two-mass model of the larynx are preset by using experimental data. Its bifurcation structure is explored in order to fit simulations to the observed time series. Appropriate parameter settings allow the reproduction of time series and differentiated amplitude contours with quantitative agreement. In particular, several phase-locked episodes ranging from 4:5 to 2:3 rhythms are generated realistically with the model.

Rigid sphere room impulse response simulation: Algorithm and applications
Daniel Jarrett, Emanuël A. P. Habets, Mark R. Thomas, Patrick A. Naylor
2012· The Journal of the Acoustical Society of America122doi:10.1121/1.4740497

Simulated room impulse responses have been proven to be both useful and indispensable for comprehensive testing of acoustic signal processing algorithms while controlling parameters such as the reverberation time, room dimensions, and source-array distance. In this work, a method is proposed for simulating the room impulse responses between a sound source and the microphones positioned on a spherical array. The method takes into account specular reflections of the source by employing the well-known image method, and scattering from the rigid sphere by employing spherical harmonic decomposition. Pseudocode for the proposed method is provided, taking into account various optimizations to reduce the computational complexity. The magnitude and phase errors that result from the finite order spherical harmonic decomposition are analyzed and general guidelines for the order selection are provided. Three examples are presented: an analysis of a diffuse reverberant sound field, a study of binaural cues in the presence of reverberation, and an illustration of the algorithm's use as a mouth simulator.

In Vivo Determination of Hepatic Stiffness Using Steady-State Free Precession Magnetic Resonance Elastography
Dieter Klatt, Patrick Asbach, Jens Rump, Sebastian Papazoglou +4 more
2006· Investigative Radiology116doi:10.1097/01.rli.0000244341.16372.08

OBJECTIVE: The objective of this study was to introduce an magnetic resonance elastography (MRE) protocol based on fractional motion encoding and planar wave acquisition for rapid measurements of in vivo human liver stiffness. MATERIALS AND METHODS: Vibrations of a remote actuator membrane were fed by a rigid rod to the patient's surface beneath the right costal arch resulting in axial shear deflections of the liver. Data acquisition was performed using a balanced steady-state free precession (bSSFP) sequence incorporating oscillating gradients for motion sensitization. Tissue vibrations of frequency fv = 51 Hz were tuned by twice the sequence repetition time (1/fv = 2TR). Twenty axial images acquired by time-resolved through-plane wave encoding were used for planar elasticity reconstruction. The MRE data acquisition was achieved within 4 breathholds of 17 seconds each. The method was applied to 12 healthy volunteers and 2 patients with diffuse liver disease (fibrosis grade 3). RESULTS: MRE data acquisition was successful in all volunteers and patients. The elastic moduli were measured with values between 1.99 +/- 0.16 and 5.77 +/- 0.88 kPa. Follow-up studies demonstrated the reproducibility of the method and revealed a difference of 0.74 +/- 0.47 kPa (P < 0.05) between the hepatic stiffness of 2 healthy male volunteers. CONCLUSION: bSSFP combined with fractional MRE enables rapid measurement of liver stiffness in vivo. The used actuation principle supports a 2-dimensional analysis of the strain wave field captured by axial wave images. The measured data indicate individual variations of hepatic stiffness in healthy volunteers.

MPEG-H 3D Audio—The New Standard for Coding of Immersive Spatial Audio
Jürgen Herre, Johannes Hilpert, Achim Kuntz, Jan Plogsties
2015· IEEE Journal of Selected Topics in Signal Processing114doi:10.1109/jstsp.2015.2411578

The science and art of Spatial Audio is concerned with the capture, production, transmission, and reproduction of an immersive sound experience. Recently, a new generation of spatial audio technology has been introduced that employs elevated and lowered loudspeakers and thus surpasses previous `surround sound' technology without such speakers in terms of listener immersion and potential for spatial realism. In this context, the ISO/MPEG standardization group has started the MPEG-H 3D Audio development effort to facilitate high-quality bitrate-efficient production, transmission and reproduction of such immersive audio material. The underlying format is designed to provide universal means for carriage of channel-based, object-based and Higher Order Ambisonics based input. High quality reproduction is provided for many output formats from 22.2 and beyond down to 5.1, stereo and binaural reproduction-independently of the original encoding format, thus overcoming the incompatibility between various 3D formats. This paper provides an overview of the MPEG-H 3D Audio project and technology and an assessment of the system capabilities and performance.

Deepfilternet: A Low Complexity Speech Enhancement Framework for Full-Band Audio Based On Deep Filtering
Hendrik Schröter, Alberto N. Escalante-B., Tobias Rosenkranz, Andreas Maier
2022· ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)109doi:10.1109/icassp43922.2022.9747055

Complex-valued processing has brought deep learning-based speech enhancement and signal extraction to a new level. Typically, the process is based on a time-frequency (TF) mask which is applied to a noisy spectrogram, while complex masks (CM) are usually preferred over real-valued masks due to their ability to modify the phase. Recent work proposed to use a complex filter instead of a point-wise multiplication with a mask. This allows to incorporate information from previous and future time steps exploiting local correlations within each frequency band.In this work, we propose DeepFilterNet, a two stage speech enhancement framework utilizing deep filtering. First, we enhance the spectral envelope using ERB-scaled gains modeling the human frequency perception. The second stage employs deep filtering to enhance the periodic components of speech. Additionally to taking advantage of perceptual properties of speech, we enforce network sparsity via separable convolutions and extensive grouping in linear and recurrent layers to design a low complexity architecture.We further show that our two stage deep filtering approach outperforms complex masks over a variety of frequency resolutions and latencies and demonstrate convincing performance compared to other state-of-the-art models.

Real-Time Body Tracking with One Depth Camera and Inertial Sensors
Thomas Helten, Meinard Müller, Hans‐Peter Seidel, Christian Theobalt
2013107doi:10.1109/iccv.2013.141

In recent years, the availability of inexpensive depth cameras, such as the Microsoft Kinect, has boosted the research in monocular full body skeletal pose tracking. Unfortunately, existing trackers often fail to capture poses where a single camera provides insufficient data, such as non-frontal poses, and all other poses with body part occlusions. In this paper, we present a novel sensor fusion approach for real-time full body tracking that succeeds in such difficult situations. It takes inspiration from previous tracking solutions, and combines a generative tracker and a discriminative tracker retrieving closest poses in a database. In contrast to previous work, both trackers employ data from a low number of inexpensive body-worn inertial sensors. These sensors provide reliable and complementary information when the monocular depth information alone is not sufficient. We also contribute by new algorithmic solutions to best fuse depth and inertial data in both trackers. One is a new visibility model to determine global body pose, occlusions and usable depth correspondences and to decide what data modality to use for discriminative tracking. We also contribute with a new inertial-based pose retrieval, and an adapted late fusion step to calculate the final body pose.

Mandatory Human Rights Due Diligence in Germany and Norway: Stepping, or Striding, in the Same Direction?
Markus Krajewski, Kristel Manal Tonstad, Franziska Wohltmann
2021· Business and Human Rights Journal103doi:10.1017/bhj.2021.43

Germany and Norway are the two latest states to adopt laws mandating human rights due diligence by companies. Germany adopted a Law on Supply Chain Due Diligence (German Law) on 10 June 2021. 1 The same day, the Norwegian parliament passed a Transparency Act (Norwegian Act) requiring human rights and decent work due diligence. 2 Like the French Loi de Vigilance and the Dutch Child Labour Due Diligence Law, these laws provide further momentum for mandatory measures to promote corporate respect for human rights, including future regulations in the European Union (EU). While the aims are similar, the German and Norwegian laws contain certain important differences when it comes to the substance and scope of the due diligence requirement. In this context, adherence to international standards remains the way forward to ensure compliance with divergent requirements in different jurisdictions.

Synthese substituierter Hydrazine aus Aminen mit Hydroxylamin‐<i>O</i>‐sulfonsäure
R. Gösl, Alwin Meuwsen
1959· Chemische Berichte100doi:10.1002/cber.19590921020

Abstract Die Umsetzung von Aminen mit Hydroxylamin‐ O ‐sulfonsäure ist eine allgemeine und bequeme Methode, Mono‐ und 1.1‐Dialkyl‐hydrazine sowie 1.1.1‐trisubstituierte Hydraziniumverbindungen darzustellen. Bei Pyridinen wird eine Aminierung des heterocyclisch‐aromatischen Stickstoffs erreicht.

Personalization and Evaluation of a Real-Time Depth-Based Full Body Tracker
Thomas Helten, Andreas Baak, Gaurav Bharaj, Meinard Müller +2 more
201395doi:10.1109/3dv.2013.44

Reconstructing a three-dimensional representation of human motion in real-time constitutes an important research topic with applications in sports sciences, human-computer-interaction, and the movie industry. In this paper, we contribute with a robust algorithm for estimating a personalized human body model from just two sequentially captured depth images that is more accurate and runs an order of magnitude faster than the current state-of-the-art procedure. Then, we employ the estimated body model to track the pose in real-time from a stream of depth images using a tracking algorithm that combines local pose optimization and a stabilizing dataBase look-up. Together, this enables accurate pose tracking that is more accurate than previous approaches. As a further contribution, we evaluate and compare our algorithm to previous work on a comprehensive benchmark dataset containing more than 15 minutes of challenging motions. This dataset comprises calibrated marker-Based motion capture data, depth data, as well as ground truth tracking results and is publicly available for research purposes.

Score-Informed Source Separation for Musical Audio Recordings: An overview
Sebastian Ewert, Bryan Pardo, Meinard Müller, Mark D. Plumbley
2014· IEEE Signal Processing Magazine93doi:10.1109/msp.2013.2296076

In recent years, source separation has been a central research topic in music signal processing, with applications in stereo-to-surround up-mixing, remixing tools for disc jockeys or producers, instrument-wise equalizing, karaoke systems, and preprocessing in music analysis tasks. Musical sound sources, however, are often strongly correlated in time and frequency, and without additional knowledge about the sources, a decomposition of a musical recording is often infeasible. To simplify this complex task, various methods have recently been proposed that exploit the availability of a musical score. The additional instrumentation and note information provided by the score guides the separation process, leading to significant improvements in terms of separation quality and robustness. A major challenge in utilizing this rich source of information is to bridge the gap between high-level musical events specified by the score and their corresponding acoustic realizations in an audio recording. In this article, we review recent developments in score-informed source separation and discuss various strategies for integrating the prior knowledge encoded by the score.

Music, Computing, and Health: A Roadmap for the Current and Future Roles of Music Technology for Health Care and Well-Being
Kat Agres, Rebecca Schaefer, Anja Volk, Susan van Hooren +4 more
2021· Music & Science90doi:10.1177/2059204321997709

The fields of music, health, and technology have seen significant interactions in recent years in developing music technology for health care and well-being. In an effort to strengthen the collaboration between the involved disciplines, the workshop “Music, Computing, and Health” was held to discuss best practices and state-of-the-art at the intersection of these areas with researchers from music psychology and neuroscience, music therapy, music information retrieval, music technology, medical technology (medtech), and robotics. Following the discussions at the workshop, this article provides an overview of the different methods of the involved disciplines and their potential contributions to developing music technology for health and well-being. Furthermore, the article summarizes the state of the art in music technology that can be applied in various health scenarios and provides a perspective on challenges and opportunities for developing music technology that (1) supports person-centered care and evidence-based treatments, and (2) contributes to developing standardized, large-scale research on music-based interventions in an interdisciplinary manner. The article provides a resource for those seeking to engage in interdisciplinary research using music-based computational methods to develop technology for health care, and aims to inspire future research directions by evaluating the state of the art with respect to the challenges facing each field.

Theme Transformer: Symbolic Music Generation With Theme-Conditioned Transformer
Yi-Jen Shih, Shih-Lun Wu, Frank Zalkow, Meinard Müller +1 more
2022· IEEE Transactions on Multimedia86doi:10.1109/tmm.2022.3161851

Attention-based Transformer models have been increasingly employed for automatic music generation. To condition the generation process of such a model with a user-specified sequence, a popular approach is to take that conditioning sequence as a priming sequence and ask a Transformer decoder to generate a continuation. However, this <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">prompt-based conditioning</i> cannot guarantee that the conditioning sequence would develop or even simply repeat itself in the generated continuation. In this paper, we propose an alternative conditioning approach, called <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">theme-based conditioning</i> , that explicitly trains the Transformer to treat the conditioning sequence as a thematic material that has to manifest itself multiple times in its generation result. This is achieved with two main technical contributions. First, we propose a deep learning-based approach that uses contrastive representation learning and clustering to automatically retrieve thematic materials from music pieces in the training data. Second, we propose a novel gated parallel attention module to be used in a sequence-to-sequence (seq2seq) encoder/decoder architecture to more effectively account for a given conditioning thematic material in the generation process of the Transformer decoder. We report on objective and subjective evaluations of variants of the proposed Theme Transformer and the conventional prompt-based baseline, showing that our best model can generate, to some extent, polyphonic pop piano music with repetition and plausible variations of a given condition.

Time–Frequency Masking Based Online Multi-Channel Speech Enhancement With Convolutional Recurrent Neural Networks
Soumitro Chakrabarty, Emanuël A. P. Habets
2019· IEEE Journal of Selected Topics in Signal Processing83doi:10.1109/jstsp.2019.2911401

This paper presents a time-frequency masking based online multi-channel speech enhancement approach that uses a convolutional recurrent neural network to estimate the mask. The magnitude and phase components of the short-time Fourier transform coefficients for multiple time frames are provided as an input such that the network is able to discriminate between the directional speech and the noise components based on the spatial characteristics of the individual signals as well as their spectro-temporal structure. The estimation of two different masks, namely, ideal ratio mask (IRM) and ideal binary mask (IBM), along with two different approaches for incorporating the mask to obtain the desired signal are discussed. In the first approach, the mask is directly applied as a real valued gain to a reference microphone signal, whereas in the second approach, the masks are used as an activity indicator for the recursive update of power spectral density (PSD) matrices to be used within a beamformer. The performance of the proposed system with the two different estimated masks utilized within the two different enhancement approaches is evaluated with both simulated as well as measured room impulse responses, where it is shown that the IBM is better suited as an indicator for the PSD updates while direct application of IRM as a real valued gain leads to a better improvement in terms of short term objective intelligibility. Analysis of the performance of the proposed system also demonstrates the robustness of the system to different angular positions of the speech source.