Full Programme – DAFx24

Full Programme

Tabbed Interface with Tables

Tue

Sept 3rd

Wed

Sept 4th

Thu

Sept 5th

Fri

Sept 6th

Sat

Sept 7th

Time	Session	Speakers	Venue
8:30	Registration (all day)	-	Foyer
9:00	Tutorial 1: Design strategies and techniques to better support collaborative, egalitarian and sustainable musical interfaces Tutorial 1: Design strategies and techniques to better support collaborative, egalitarian and sustainable musical interfaces Dr. Anna Xambó Sedó (Queen Mary University of London) Abstract: A common challenge in the community of designing audio effects and algorithms to synthesise musical instruments is what are the design considerations to accommodate interaction experiences relevant to musicians, particularly among a diverse community of practitioners. This hands-on tutorial will cover some theoretical and practical foundations for designing interfaces for digital sound instruments and effects looking at how best to support collaborative, egalitarian and sustainable spaces.	Dr. Anna Xambó Sedó	03MS01
10:30	Refreshments	-	Lakeside
11:00	Tutorial 2: Room acoustics rendering for immersive audio applications Tutorial 2: Room acoustics rendering for immersive audio applications Dr. Orchisama Das (SONOS) Abstract: This tutorial will focus on the fundamentals of room acoustics rendering for immersive audio applications. Rendering the acoustics of a space is known to increase the sense of immersion and envelopment, and enhances user experience in mixed reality applications. We will briefly discuss some fundamental models before delving into discuss delay line based parametric reverberators that are ideal for real-time applications. The tutorial will also cover aspects of spatial reverb for multichannel reproduction and touch on binauralisation for headphone reproduction. Code examples and introduction to open source toolboxes in Python will be provided.	Dr. Orchisama Das	03MS01
12:30	Lunch	-	Hillside
14:00	Tutorial 3: Accessibility-Conscious design for audio hardware and software Tutorial 3: Accessibility-Conscious design for audio hardware and software Jason Dasent (Kingston University) Abstract: This tutorial will firstly start with an introduction into the world of accessible music technology, where Jason will provide his perspective and work as a visually impaired music producer and audio engineer. We will discuss the work being done with music equipment manufacturers to make their products accessible, as well as with companies and educational institutions interested in accessibility. Following this introduction, there will be a live and interactive demonstration showing all aspects of the creation of a music production, from recording to mastering using accessible hardware and software.	Jason Dasent	03MS01
15:30	Refreshments	-	Lakeside
16:00	AI for Multitrack Music Mixing: Hands-On Workshop AI for Multitrack Music Mixing: Hands-On Workshop Soumya Sai Vanka (Queen Mary University of London) Dr. Marco A. Martínez-Ramírez (Sony AI) Abstract: Music Mixing is essential in post-production, demanding both technical expertise and creativity to achieve professional results. This workshop explores recent advancements in automatic mixing, particularly deep learning-based approaches utilizing large datasets, black-box processing and innovative techniques like differentiable mixing consoles. Through a hands-on session with code examples, participants will learn about building, training, and evaluating these systems. Topics include intelligent music production, contextual importance in mixing, and system design challenges. Aimed towards researchers and professionals in digital audio processing, this workshop serves as an entry point for those new to deep learning methods for music post-production methods. The participation of the DAFx community is crucial as we collectively shape the future of AI-driven music mixing.	Soumya Sai Vanka and Dr. Marco A. Martínez-Ramírez	03MS01
18:00	End of Sessions	-	-
18:30	Welcome Reception	-	Lake Bar
22:00	End of Day	-	-

Time	Session	Speakers	Venue
8:30	Registration (all day)	-	Foyer
9:00	Welcome Remarks	Enzo De Sena	03MS01
9:00	Oral Session 1: Deep Learning I Audio Effect Chain Estimation and Dry Signal Recovery From Multi-Effect-Processed Musical Signals Osamu Take, Kento Watanabe, Takayuki Nakatsuka, Tian Cheng, Tomoyasu Nakano, Masataka Goto, Shinnosuke Takamichi and Hiroshi Saruwatari Audio Effect Chain Estimation and Dry Signal Recovery From Multi-Effect-Processed Musical Signals Osamu Take, Kento Watanabe, Takayuki Nakatsuka, Tian Cheng, Tomoyasu Nakano, Masataka Goto, Shinnosuke Takamichi and Hiroshi Saruwatari Abstract: In this paper we propose a method that can address a novel task, audio effect (AFX) chain estimation and dry signal recovery. AFXs are indispensable in modern sound design workflows. Sound engineers often cascade different AFXs (as an AFX chain) to achieve their desired soundscapes. Given a multi-AFX-applied solo instrument performance (wet signal), our method can automatically estimate the applied AFX chain and recover its unprocessed dry signal, while previous research only addresses one of them. The estimated chain is useful for novice engineers in learning practical usages of AFXs, and the recovered signal can be reused with a different AFX chain. To solve this task, we first develop a deep neural network model that estimates the last-applied AFX and undoes its AFX at a time. We then iteratively apply the same model to estimate the AFX chain and eventually recover the dry signal from the wet signal. Our experiments on guitar phrase recordings with various AFX chains demonstrate the validity of our method for both the AFX-chain estimation and dry signal recovery. We also confirm that the input wet signal can be reproduced by applying the estimated AFX chain to the recovered dry signal. CONMOD: Controllable Neural Frame-Based Modulation Effects Gyubin Lee, Hounsu Kim, Junwon Lee and Juhan Nam CONMOD: Controllable Neural Frame-Based Modulation Effects Gyubin Lee, Hounsu Kim, Junwon Lee and Juhan Nam Abstract: Deep learning models have seen widespread use in modelling LFO-driven audio effects, such as phaser and flanger. Although existing neural architectures exhibit high-quality emulation of individual effects, they do not possess the capability to manipulate the output via control parameters. To address this issue, we introduce Controllable Neural Frame-based Modulation Effects (CONMOD), a single black-box model which emulates various LFOdriven effects in a frame-wise manner, offering control over LFO frequency and feedback parameters. Additionally, the model is capable of learning the continuous embedding space of two distinct phaser effects, enabling us to steer between effects and achieve creative outputs. Our model outperforms previous work while possessing both controllability and universality, presenting opportunities to enhance creativity in modern LFO-driven audio effects. Additional demo of our model is available in the accompanying website. Sample Rate Independent Recurrent Neural Networks for Audio Effects Processing Alistair Carson, Alec Wright, Jatin Chowdhury, Vesa Välimäki and Stefan Bilbao Sample Rate Independent Recurrent Neural Networks for Audio Effects Processing Alistair Carson, Alec Wright, Jatin Chowdhury, Vesa Välimäki and Stefan Bilbao Abstract: In recent years, machine learning approaches to modelling guitar amplifiers and effects pedals have been widely investigated and have become standard practice in some consumer products. In particular, recurrent neural networks (RNNs) are a popular choice for modelling non-linear devices such as vacuum tube amplifiers and distortion circuitry. One limitation of such models is that they are trained on audio at a specific sample rate and therefore give unreliable results when operating at another rate. Here, we investigate several methods of modifying RNN structures to make them approximately sample rate independent, with a focus on oversampling. In the case of integer oversampling, we demonstrate that a previously proposed delay-based approach provides high fidelity sample rate conversion whilst additionally reducing aliasing. For non-integer sample rate adjustment, we propose two novel methods and show that one of these, based on cubic Lagrange interpolation of a delay-line, provides a significant improvement over existing methods. To our knowledge, this work provides the first in-depth study into this problem. A Diffusion-Based Generative Equalizer for Music Restoration Eloi Moliner, Maija Turunen, Filip Elvander and Vesa Välimäki A Diffusion-Based Generative Equalizer for Music Restoration Eloi Moliner, Maija Turunen, Filip Elvander and Vesa Välimäki Abstract: This paper presents a novel approach to audio restoration, focusing on the enhancement of low-quality music recordings, and in particular historical ones. Building upon a previous algorithm called BABE, or Blind Audio Bandwidth Extension, we introduce BABE-2, which presents a series of improvements. This research broadens the concept of bandwidth extension to generative equalization, a task that, to the best of our knowledge, has not been previously addressed for music restoration. BABE-2 is built around an optimization algorithm utilizing priors from diffusion models, which are trained or fine-tuned using a curated set of high-quality music tracks. The algorithm simultaneously performs two critical tasks: estimation of the filter degradation magnitude response and hallucination of the restored audio. The proposed method is objectively evaluated on historical piano recordings, showing an enhancement over the prior version. The method yields similarly impressive results in rejuvenating the works of renowned vocalists Enrico Caruso and Nellie Melba. This research represents an advancement in the practical restoration of historical music. Historical music restoration examples are available at: research.spa.aalto.fi/publications/papers/dafx-babe2/. Neural Audio Processing on Android Phones Jason Hoopes, Brooke Chalmers and Victor Zappi Neural Audio Processing on Android Phones Jason Hoopes, Brooke Chalmers and Victor Zappi Abstract: This study investigates the potential of real-time inference of neural audio effects on Android smartphones, marking an initial step towards bridging the gap in neural audio processing for mobile devices. Focusing exclusively on processing rather than synthesis, we explore the performance of three open-source neural models across five Android phones released between 2014 and 2022, showcasing varied capabilities due to their generational differences. Through comparative analysis utilizing two C++ inference engines (ONNX Runtime and RTNeural), we aim to evaluate the computational efficiency and timing performance of these models, considering the varying computational loads and the hardware specifics of each device. Our work contributes insights into the feasibility of implementing neural audio processing in real-time on mobile platforms, highlighting challenges and opportunities for future advancements in this rapidly evolving field.	Various	03MS01
10:30	Show & Tell (oral session 1 presenters)	-	Foyer
	Poster Session 1: Deep Learning II RAVE for Speech: Efficient Voice Conversion at High Sampling Rates Anders R. Bargum and Cumhur Erkut RAVE for Speech: Efficient Voice Conversion at High Sampling Rates Anders R. Bargum and Cumhur Erkut Abstract: Voice conversion has gained increasing popularity within the field of audio manipulation and speech synthesis. Often, the main objective is to transfer the input identity to that of a target speaker without changing its linguistic content. While current work provides high-fidelity solutions they rarely focus on model simplicity, high-sampling rate environments or stream-ability. By incorporating speech representation learning into a generative timbre transfer model, traditionally created for musical purposes, we investigate the realm of voice conversion generated directly in the time domain at high sampling rates. More specifically, we guide the latent space of a baseline model towards linguistically relevant representations and condition it on external speaker information. Through objective and subjective assessments, we demonstrate that the proposed solution can attain levels of naturalness, quality, and intelligibility comparable to those of a state-of-the-art solution for seen speakers, while significantly decreasing inference time. However, despite the presence of target speaker characteristics in the converted output, the actual similarity to unseen speakers remains a challenge. Network Bending of Diffusion Models for Audio-Visual Generation Luke Dzwonczyk, Carmine Emanuele Cella and David Ban Network Bending of Diffusion Models for Audio-Visual Generation Luke Dzwonczyk, Carmine Emanuele Cella and David Ban Abstract: In this paper we present the first steps towards the creation of a tool which enables artists to create music visualizations using pretrained, generative, machine learning models. First, we investigate the application of network bending, the process of applying transforms within the layers of a generative network, to image generation diffusion models by utilizing a range of point-wise, tensorwise, and morphological operators. We identify a number of visual effects that result from various operators, including some that are not easily recreated with standard image editing tools. We find that this process allows for continuous, fine-grain control of image generation which can be helpful for creative applications. Next, we generate music-reactive videos using Stable Diffusion by passing audio features as parameters to network bending operators. Finally, we comment on certain transforms which radically shift the image and the possibilities of learning more about the latent space of Stable Diffusion based on these transforms. Automatic Equalization for Individual Instrument Tracks Using Convolutional Neural Networks Florian Mockenhaupt, Joscha Simon Rieber and Shahan Nercessian Automatic Equalization for Individual Instrument Tracks Using Convolutional Neural Networks Florian Mockenhaupt, Joscha Simon Rieber and Shahan Nercessian Abstract: We propose a novel approach for the automatic equalization of individual musical instrument tracks. Our method begins by identifying the instrument present within a source recording in order to choose its corresponding ideal spectrum as a target. Next, the spectral difference between the recording and the target is calculated, and accordingly, an equalizer matching model is used to predict settings for a parametric equalizer. To this end, we build upon a differentiable parametric equalizer matching neural network, demonstrating improvements relative to previously established state-of-the-art. Unlike past approaches, we show how our system naturally allows real-world audio data to be leveraged during the training of our matching model, effectively generating suitably produced training targets in an automated manner mirroring conditions at inference time. Consequently, we illustrate how fine-tuning our matching model on such examples considerably improves parametric equalizer matching performance in realworld scenarios, decreasing mean absolute error by 24% relative to methods relying solely on random parameter sampling techniques as a self-supervised learning strategy. We perform listening tests, and demonstrate that our proposed automatic equalization solution subjectively enhances the tonal characteristics for recordings of common instrument types. Balancing Error and Latency of Black-Box Models for Audio Effects Using Hardware-Aware Neural Architecture Search Christopher Ringhofer, Alexa Gnoss and Gregor Schiele Balancing Error and Latency of Black-Box Models for Audio Effects Using Hardware-Aware Neural Architecture Search Christopher Ringhofer, Alexa Gnoss and Gregor Schiele Abstract: In this paper, we address automating and systematizing the process of finding black-box models for virtual analogue audio effects with an optimal balance between error and latency. We introduce a multi-objective optimization approach based on hardware-aware neural architecture search which allows specifying the optimization balance of model error and latency according to the requirements of the application. By using a regularized evolutionary algorithm, it is able to navigate through a huge search space systematically. Additionally, we propose a search space for modelling non-linear dynamic audio effects consisting of over 41 trillion different WaveNet-style architectures. We evaluate its performance and usefulness by yielding highly effective architectures, either up to 18× faster or with a test loss of up to 56% less than the best performing models of the related work, while still showing a favourable trade-off. We can conclude that hardware-aware neural architecture search is a valuable tool that can help researchers and engineers developing virtual analogue models by automating the architecture design and saving time by avoiding manual search and evaluation through trial-and-error. ICGAN: An Implicit Conditioning Method for Interpretable Feature Control of Neural Audio Synthesis Yunyi Liu and Craig Jin ICGAN: An Implicit Conditioning Method for Interpretable Feature Control of Neural Audio Synthesis Yunyi Liu and Craig Jin Abstract: Neural audio synthesis methods can achieve high-fidelity and realistic sound generation by utilizing deep generative models. Such models typically rely on external labels which are often discrete as conditioning information to achieve guided sound generation. However, it remains difficult to control the subtle changes in sounds without appropriate and descriptive labels, especially given a limited dataset. This paper proposes an implicit conditioning method for neural audio synthesis using generative adversarial networks that allows for interpretable control of the acoustic features of synthesized sounds. Our technique creates a continuous conditioning space that enables timbre manipulation without relying on explicit labels. We further introduce an evaluation metric to explore controllability and demonstrate that our approach is effective in enabling a degree of controlled variation of different synthesized sound effects for in-domain and cross-domain sounds. A Hierarchical Deep Learning Approach for Minority Instrument Detection Dylan Sechet, Francesca Bugiotti, Matthieu Kowalski, Edouard D'Hérouville and Filip Langiewicz A Hierarchical Deep Learning Approach for Minority Instrument Detection Dylan Sechet, Francesca Bugiotti, Matthieu Kowalski, Edouard D'Hérouville and Filip Langiewicz Abstract: Identifying instrument activities within audio excerpts is vital in music information retrieval, with significant implications for music cataloging and discovery. Prior deep learning endeavors in musical instrument recognition have predominantly emphasized instrument classes with ample data availability. Recent studies have demonstrated the applicability of hierarchical classification in detecting instrument activities in orchestral music, even with limited fine-grained annotations at the instrument level. Based on the Hornbostel-Sachs classification, such a hierarchical classification system is evaluated using the MedleyDB dataset, renowned for its diversity and richness concerning various instruments and music genres. This work presents various strategies to integrate hierarchical structures into models and tests a new class of models for hierarchical music prediction. This study showcases more reliable coarse-level instrument detection by bridging the gap between detailed instrument identification and group-level recognition, paving the way for further advancements in this domain.	Various	Lakeside
	L-ISA Studio Demo (Sponsor demo) L-ISA Studio Demo (Sponsor demo) Introduction to L-Acoustics spatial audio workflows, with plenty of time for listening and Q&A. Book here	-	TB07
	AR Demo AR Demo Headphone-based audio augmented reality demonstrator showcasing the effects of manipulated late reverberation in rendering virtual sound sources. The setup is based on a dataset of binaural room impulse responses measured along a 2 m long line, which is used to imitate the reproduction of a pair of loudspeakers. Therefore, listeners can explore the virtual sources by moving back and forth and rotating arbitrarily on this line. The demo allows the user to adjust the late reverberation tail of the auralizations interactively from shorter to longer decay times regarding the baseline decay behavior. Modification of the decay times is based on resynthesizing the late reverberation using frequency dependent shaping of binaural white noise and modal reconstruction. Book here	-	LTC
12:00	Keynote 1: Conceptualizing the Chorus Keynote 1: Conceptualizing the Chorus David Zicarelli (Cycling ’74) Abstract: A few years ago I began playing with techniques involving making dozens or hundreds of copies of the same sound and treating each one slightly differently, creating a sort of infinite chorus of sources or effects. This work caused me realize how many practices and assumptions in our field are still concerned with minimizing computational resources, originating from a time when computers could barely compute a single stream of audio that you could control interactively. Now that hundreds of audio channels are easily possible with today’s computers, we have the opportunity to explore conceptual models that treat these multiple channels as creative spaces for spontaneous interaction. In this talk I’ll share some initial experiments and findings in the space of the infinite chorus.	David Zicarelli	03MS01
13:00	Lunch	-	Hillside
14:30	Oral Session 2: Virtual Analogue Wave Digital Model of the MXR Phase 90 Based on a Time-Varying Resistor Approximation of JFET Elements Riccardo Giampiccolo, Samuele Del Moro, Claudio Eutizi, Mattia Massimi, Oliviero Massi and Alberto Bernardini Wave Digital Model of the MXR Phase 90 Based on a Time-Varying Resistor Approximation of JFET Elements Riccardo Giampiccolo, Samuele Del Moro, Claudio Eutizi, Mattia Massimi, Oliviero Massi and Alberto Bernardini Abstract: Virtual Analog (VA) modeling is the practice of digitally emulating analog audio gear. Over the past few years, with the purpose of recreating the alleged distinctive sound of audio equipment and musicians, many different guitar pedals have been emulated by means of the VA paradigm but little attention has been given to phasers. Phasers process the spectrum of the input signal with time-varying notches by means of shifting stages typically realized with a network of transistors, whose nonlinear equations are, in general, demanding to be solved. In this paper, we take as a reference the famous MXR Phase 90 guitar pedal, and we propose an efficient time-varying model of its Junction Field-Effect Transistors (JFETs) based on a channel resistance approximation. We then employ such a model in the Wave Digital domain to emulate in real-time the guitar pedal, obtaining an implementation characterized by low computational cost and good accuracy. Hyper Recurrent Neural Network: Condition Mechanisms for Black-Box Audio Effect Modeling Yen-Tung Yeh, Wen-Yi Hsiao and Yi-Hsuan Yang Hyper Recurrent Neural Network: Condition Mechanisms for Black-Box Audio Effect Modeling Yen-Tung Yeh, Wen-Yi Hsiao and Yi-Hsuan Yang Abstract: Recurrent neural networks (RNNs) have demonstrated impressive results for virtual analog modeling of audio effects. These networks process time-domain audio signals using a series of matrix multiplication and nonlinear activation functions to emulate the behavior of the target device accurately. To additionally model the effect of the knobs for an RNN-based model, existing approaches integrate control parameters by concatenating them channel-wisely with some intermediate representation of the input signal. While this method is parameter-efficient, there is room to further improve the quality of generated audio because the concatenation-based conditioning method has limited capacity in modulating signals. In this paper, we propose three novel conditioning mechanisms for RNNs, tailored for black-box virtual analog modeling. These advanced conditioning mechanisms modulate the model based on control parameters, yielding superior results to existing RNN- and CNN-based architectures across various evaluation metrics. Revisiting the Second-Order Accurate Non-Iterative Discretization Scheme Martin Holters Revisiting the Second-Order Accurate Non-Iterative Discretization Scheme Martin Holters Abstract: In the field of virtual analog modeling, a variety of methods have been proposed to systematically derive simulation models from circuit schematics. However, they typically rely on implicit numerical methods to transform the differential equations governing the circuit to difference equations suitable for simulation. For circuits with non-linear elements, this usually means that a non-linear equation has to be solved at run-time at high computational cost. As an alternative to fully-implicit numerical methods, a family of non-iterative discretization schemes has recently been proposed, allowing a significant reduction of the computational load. However, in the original presentation, several assumptions are made regarding the structure of the ODE, limiting the generality of these schemes. Here, we show that for the second-order accurate variant in particular, the method is applicable to general ODEs. Furthermore, we point out an interesting connection to the implicit midpoint method. Digitizing the Schumann PLL Analog Harmonizer Isaiah Farrell and Stefan Bilbao Digitizing the Schumann PLL Analog Harmonizer Isaiah Farrell and Stefan Bilbao Abstract: The Schumann Electronics PLL is a guitar effect that uses hardwarebased processing of one-bit digital signals, with op-amp saturation and CMOS control systems used to generate multiple square waves derived from the frequency of the input signal. The effect may be simulated in the digital domain by cascading stages of statespace virtual analog modeling and algorithmic approximations of CMOS integrated circuits. Phase-locked loops, decade counters, and Schmitt trigger inverters are modeled using logic algorithms, allowing for the comparable digital implementation of the Schumann PLL. Simulation results are presented. Wave Digital Modeling of Circuits with Multiple One-Port Nonlinearities Based on Lipschitz-Bounded Neural Networks Oliviero Massi, Edoardo Manino and Alberto Bernardini Wave Digital Modeling of Circuits with Multiple One-Port Nonlinearities Based on Lipschitz-Bounded Neural Networks Oliviero Massi, Edoardo Manino and Alberto Bernardini Abstract: Neural networks have found application within the Wave Digital Filters (WDFs) framework as data-driven input-output blocks for modeling single one-port or multi-port nonlinear devices in circuit systems. However, traditional neural networks lack predictable bounds for their output derivatives, essential to ensure convergence when simulating circuits with multiple nonlinear elements using fixed-point iterative methods, e.g., the Scattering Iterative Method (SIM). In this study, we address such issue by employing Lipschitz-bounded neural networks for regressing nonlinear WD scattering relations of one-port nonlinearities. Graphic Equalizers Based on Limited Action Networks Kurt Werner Graphic Equalizers Based on Limited Action Networks Kurt Werner Abstract: Several classic graphic equalizers, such as the Altec 9062A and the “Motown EQ,” have stepped gain controls and “proportional bandwidth” and used passive, constant-resistance, RLC circuit designs based on “limited-action networks.” These are related to bridged-T-network EQs, with several differences that cause important practical improvements, also affecting their sound. We study these networks, giving their circuit topologies, design principles, and design equations, which appear not to have been published before. We make a Wave Digital Filter which can model either device or an idealized “Exact” version, to which we can add various new extensions and features.	Various	03MS01
16:00	Show & Tell (oral session 2 presenters)	-	Foyer
	Poster Session 2: Late Breaking Results Synthesizer Sound Matching Using Audio Spectrogram Transformers Fred Bruford, Frederik Blang and Shahan Nercessian Synthesizer Sound Matching Using Audio Spectrogram Transformers Fred Bruford, Frederik Blang and Shahan Nercessian Abstract: Systems for synthesizer sound matching, which automatically set the parameters of a synthesizer to emulate an input sound, have the potential to make the process of synthesizer programming faster and easier for novice and experienced musicians alike, whilst also affording new means of interaction with synthesizers. Considering the enormous variety of synthesizers in the marketplace, and the complexity of many of them, general-purpose sound matching systems that function with minimal knowledge or prior assumptions about the underlying synthesis architecture are particularly desirable. With this in mind, we introduce a synthesizer sound matching model based on the Audio Spectrogram Transformer. We demonstrate the viability of this model by training on a large synthetic dataset of randomly generated samples from the popular Massive synthesizer. We show that this model can reconstruct parameters of samples generated from a set of 16 parameters, highlighting its improved fidelity relative to multi-layer perceptron and convolutional neural network baselines. We also provide audio examples demonstrating the out-of-domain model performance in emulating vocal imitations, and sounds from other synthesizers and musical instruments. Spectral Analysis of Stochastic Wavetable Synthesis Nicholas Boyko and Elliot Canfield-Dafilou Spectral Analysis of Stochastic Wavetable Synthesis Nicholas Boyko and Elliot Canfield-Dafilou Abstract: Dynamic Stochastic Wavetable Synthesis (DSWS) is a sound synthesis and processing technique that uses probabilistic waveform synthesis techniques invented by Iannis Xenakis as a modulation/ distortion effect applied to a wavetable oscillator. The stochastic manipulation of the wavetable provides a means to creating signals with rich, dynamic spectra. In the present work, the DSWS technique is compared to other fundamental sound synthesis techniques such as frequency modulation synthesis. Additionally, several extensions of the DSWS technique are proposed. Leveraging Electric Guitar Tones and Effects to Improve Robustness in Guitar Tablature Transcription Modeling Hegel Pedroza, Wallace Abreu, Ryan M. Corey and Iran R. Roman Leveraging Electric Guitar Tones and Effects to Improve Robustness in Guitar Tablature Transcription Modeling Hegel Pedroza, Wallace Abreu, Ryan M. Corey and Iran R. Roman Abstract: Guitar tablature transcription (GTT) aims at automatically generating symbolic representations from real solo guitar performances. Due to its applications in education and musicology, GTT has gained traction in recent years. However, GTT robustness has been limited due to the small size of available datasets. Researchers have recently used synthetic data that simulates guitar performances using pre-recorded or computer-generated tones, allowing for scalable and automatic data generation. The present study complements these efforts by demonstrating that GTT robustness can be improved by including synthetic training data created using recordings of real guitar tones played with different audio effects. We evaluate our approach on a new evaluation dataset with professional solo guitar performances that we composed and collected, featuring a wide array of tones, chords, and scales.	Various	Lakeside
	IoSR Facilities Tour IoSR Facilities Tour Tour of the studios and audio labs at the University of Surrey. Book here	-	PATs Building
	AR Demo AR Demo Headphone-based audio augmented reality demonstrator showcasing the effects of manipulated late reverberation in rendering virtual sound sources. The setup is based on a dataset of binaural room impulse responses measured along a 2 m long line, which is used to imitate the reproduction of a pair of loudspeakers. Therefore, listeners can explore the virtual sources by moving back and forth and rotating arbitrarily on this line. The demo allows the user to adjust the late reverberation tail of the auralizations interactively from shorter to longer decay times regarding the baseline decay behavior. Modification of the decay times is based on resynthesizing the late reverberation using frequency dependent shaping of binaural white noise and modal reconstruction. Book here	-	LTC
17:30	End of Sessions	-	-
18:00	Drinks Reception	-	Wates House
19:00	Performance	-	Ivy Centre
19:30	Reception	-	Wates House
22:00	End of Day	-	-

Time	Session	Speakers	Venue
8:30	Registration (all day)	-	Foyer
9:00	Oral Session 3: Instrument Modelling and Effects I Searching for Music Mixing Graphs: A Pruning Approach Sungho Lee, Marco Martínez-Ramírez, Wei-Hsiang Liao, Stefan Uhlich, Giorgio Fabbro, Kyogu Lee and Yuki Mitsufuji Searching for Music Mixing Graphs: A Pruning Approach Sungho Lee, Marco Martínez-Ramírez, Wei-Hsiang Liao, Stefan Uhlich, Giorgio Fabbro, Kyogu Lee and Yuki Mitsufuji Abstract: Music mixing is compositional — experts combine multiple audio processors to achieve a cohesive mix from dry source tracks. We propose a method to reverse engineer this process from the input and output audio. First, we create a mixing console that applies all available processors to every chain. Then, after the initial console parameter optimization, we alternate between removing redundant processors and fine-tuning. We achieve this through differentiable implementation of both processors and pruning. Consequently, we find a sparse mixing graph that achieves nearly identical matching quality of the full mixing console. We apply this procedure to drymix pairs from various datasets and collect graphs that also can be used to train neural networks for music mixing applications. Quadratic Spline Approximation of the Contact Potential for Real-Time Simulation of Lumped Collisions in Musical Instruments Abhiram Bhanuprakash, Maarten Van Walstijn and Vasileios Chatziioannou Quadratic Spline Approximation of the Contact Potential for Real-Time Simulation of Lumped Collisions in Musical Instruments Abhiram Bhanuprakash, Maarten Van Walstijn and Vasileios Chatziioannou Abstract: Collisions are an integral part of the sound production mechanism in a wide variety of musical instruments. In physics-based realtime simulation of such nonlinear phenomena, challenges centred around efficient and accurate root-finding arise. Nonlinearly implicit schemes are normally ill-suited for real-time simulation as they rely on iterative solvers for root-solving. Explicit schemes overcome this issue at the cost of a slightly larger error for a given sample rate. In this paper, for the case of lumped collisions, an alternate approach is proposed by approximating the contact potential curve. The approximation is described, and is shown to lead to a non-iterative update for an energy-stable nonlinearly implicit scheme. The method is first tested on single mass-barrier collision simulations, and then employed in conjunction with a modal string model to simulate hammer-string and slide-string interaction. Results are discussed in comparison with existing approaches, and real-time feasibility is demonstrated. Real-Time Guitar Synthesis Stefan Bilbao, Riccardo Russo, Craig Webb and Michele Ducceschi Real-Time Guitar Synthesis Stefan Bilbao, Riccardo Russo, Craig Webb and Michele Ducceschi Abstract: The synthesis of guitar tones was one of the first uses of physical modeling synthesis, and many approaches (notably digital waveguides) have been employed. The dynamics of the string under playing conditions is complex, and includes nonlinearities, both inherent to the string itself, and due to various collisions with the fretboard, frets and a stopping finger. All lead to important perceptual effects, including pitch glides, rattling against frets, and the ability to play on the harmonics. Numerical simulation of these simultaneous strong nonlinearities is challenging, but recent advances in algorithm design due to invariant energy quadratisation and scalar auxiliary variable methods allow for very efficient and provably numerically stable simulation. A new design is presented here that does not employ costly iterative methods such as the Newton-Raphson method, and for which required linear system solutions are small. As such, this method is suitable for real-time implementation. Simulation and timing results are presented. Guitar Tone Stack Modeling with a Neural State-Space Filter Tantep Sinjanakhom, Eero-Pekka Damskägg, Stylianos Mimilakis, Athanasios Gotsopoulos and Vesa Välimäki Guitar Tone Stack Modeling with a Neural State-Space Filter Tantep Sinjanakhom, Eero-Pekka Damskägg, Stylianos Mimilakis, Athanasios Gotsopoulos and Vesa Välimäki Abstract: In this work, we present a data-driven approach to modeling tone stack circuits in guitar amplifiers and distortion pedals. To this aim, the proposed modeling approach uses a feedforward fully connected neural network to predict the parameters of a coupledform state-space filter, ensuring the numerical stability of the resulting time-varying system. The neural network is conditioned on the tone controls of the target tone stack and is optimized jointly with the coupled-form state-space filter to match the target frequency response. To assess the proposed approach, we model three popular tone stack schematics with both matched-order and overparameterized filters and conduct an objective comparison with well-established approaches that use cascaded biquad filters. Results from the conducted experiments demonstrate improved accuracy of the proposed modeling approach, especially in the case of over-parameterized state-space filters while guaranteeing numerical stability. Our method can be deployed, after training, in realtime audio processors. Distortion Recovery: A Two-Stage Method for Guitar Effect Removal Ying-Shuo Lee, Yueh-Po Peng, Jui-Te Wu, Ming Cheng, Li Su and Yi-Hsuan Yang Distortion Recovery: A Two-Stage Method for Guitar Effect Removal Ying-Shuo Lee, Yueh-Po Peng, Jui-Te Wu, Ming Cheng, Li Su and Yi-Hsuan Yang Abstract: Removing audio effects from electric guitar recordings makes it easier for post-production and sound editing. An audio distortion recovery model not only improves the clarity of the guitar sounds but also opens up new opportunities for creative adjustments in mixing and mastering. While progress have been made in creating such models, previous efforts have largely focused on synthetic distortions that may be too simplistic to accurately capture the complexities seen in real-world recordings. In this paper, we tackle the task by using a dataset of guitar recordings rendered with commercial-grade audio effect VST plugins. Moreover, we introduce a novel two-stage methodology for audio distortion recovery. The idea is to firstly process the audio signal in the Mel-spectrogram domain in the first stage, and then use a neural vocoder to generate the pristine original guitar sound from the processed Mel-spectrogram in the second stage. We report a set of experiments demonstrating the effectiveness of our approach over existing methods, through both subjective and objective evaluation metrics. Sound Matching Using Synthesizer Ensembles Gerard Roma Sound Matching Using Synthesizer Ensembles Gerard Roma Abstract: Sound matching allows users to automatically approximate existing sounds using a synthesizer. Previous work has mostly focused on algorithms for automatically programming an existing synthesizer. This paper proposes a system for selecting between different synthesizer designs, each one with a corresponding automatic programmer. An implementation that allows designing ensembles based on a template is demonstrated. Several experiments are presented using a simple subtractive synthesis design. Using an ensemble of synthesizer-programmer pairs is shown to provide better matching than a single programmer trained for an equivalent integrated synthesizer. Scaling to hundreds of synthesizers is shown to improve match quality.	Various	03MS01
10:30	Show & Tell (oral session 3 presenters)	-	Foyer
	Poster Session 3: Instrument Modelling and Effects II Improving Unsupervised Clean-to-Rendered Guitar Tone Transformation Using GANs and Integrated Unaligned Clean Data Yu-Hua Chen, Woosung Choi, Wei-Hsiang Liao, Marco Martínez-Ramírez, Kin Wai Cheuk, Yuki Mitsufuji, Jyh-Shing Roger Jang and Yi-Hsuan Yang Improving Unsupervised Clean-to-Rendered Guitar Tone Transformation Using GANs and Integrated Unaligned Clean Data Yu-Hua Chen, Woosung Choi, Wei-Hsiang Liao, Marco Martínez-Ramírez, Kin Wai Cheuk, Yuki Mitsufuji, Jyh-Shing Roger Jang and Yi-Hsuan Yang Abstract: Recent years have seen increasing interest in applying deep learning methods to the modeling of guitar amplifiers or effect pedals. Existing methods are mainly based on the supervised approach, requiring temporally-aligned data pairs of unprocessed and rendered audio. However, this approach does not scale well, due to the complicated process involved in creating the data pairs. A very recent work done by Wright et al. has explored the potential of leveraging unpaired data for training, using a generative adversarial network (GAN)-based framework. This paper extends their work by using more advanced discriminators in the GAN, and using more unpaired data for training. Specifically, drawing inspiration from recent advancements in neural vocoders, we employ in our GANbased model for guitar amplifier modeling two sets of discriminators, one based on multi-scale discriminator (MSD) and the other multi-period discriminator (MPD). Moreover, we experiment with adding unprocessed audio signals that do not have the corresponding rendered audio of a target tone to the training data, to see how much the GAN model benefits from the unpaired data. Our experiments show that the proposed two extensions contribute to the modeling of both low-gain and high-gain guitar amplifiers. Towards Efficient Modelling of String Dynamics: A Comparison of State Space and Koopman Based Deep Learning Methods Rodrigo Diaz, Carlos De La Vega Martin and Mark Sandler Towards Efficient Modelling of String Dynamics: A Comparison of State Space and Koopman Based Deep Learning Methods Rodrigo Diaz, Carlos De La Vega Martin and Mark Sandler Abstract: This paper presents an examination of State Space Models (SSM) and Koopman-based deep learning methods for modelling the dynamics of both linear and non-linear stiff strings. Through experiments with datasets generated under different initial conditions and sample rates, we assess the capacity of these models to accurately model the complex behaviours observed in string dynamics. Our findings indicate that our proposed Koopman-based model performs as well as or better than other existing approaches in nonlinear cases for long-sequence modelling. We inform the design of these architectures with the structure of the problems at hand. Although challenges remain in extending model predictions beyond the training horizon (i.e., extrapolation), the focus of our investigation lies in the models’ ability to generalise across different initial conditions within the training time interval. This research contributes insights into the physical modelling of dynamical systems (in particular those addressing musical acoustics) by offering a comparative overview of these and previous methods and introducing innovative strategies for model improvement. Our results highlight the efficacy of these models in simulating non-linear dynamics and emphasise their wide-ranging applicability in accurately modelling dynamical systems over extended sequences. DDSP-Based Neural Waveform Synthesis of Polyphonic Guitar Performance From String-Wise MIDI Input Nicolas Jonason, Xin Wang, Erica Cooper, Lauri Juvela, Bob L. T. Sturm and Junichi Yamagishi DDSP-Based Neural Waveform Synthesis of Polyphonic Guitar Performance From String-Wise MIDI Input Nicolas Jonason, Xin Wang, Erica Cooper, Lauri Juvela, Bob L. T. Sturm and Junichi Yamagishi Abstract: We explore the use of neural synthesis for acoustic guitar from string-wise MIDI input. We propose four different systems and compare them with both objective metrics and subjective evaluation against natural audio and a sample-based baseline. We iteratively develop these four systems by making various considerations on the architecture and intermediate tasks, such as predicting pitch and loudness control features. We find that formulating the control feature prediction task as a classification task rather than a regression task yields better results. Furthermore, we find that our simplest proposed system, which directly predicts synthesis parameters from MIDI input performs the best out of the four proposed systems. Audio examples and code are available. DDSP-SFX: Acoustically-Guided Sound Effects Generation with Differentiable Digital Signal Processing Yunyi Liu, Craig Jin and David Gunawan DDSP-SFX: Acoustically-Guided Sound Effects Generation with Differentiable Digital Signal Processing Yunyi Liu, Craig Jin and David Gunawan Abstract: Controlling the variations of sound effects using neural audio synthesis models has been a challenging task. Differentiable digital signal processing (DDSP) provides a lightweight solution that achieves high-quality sound synthesis while enabling deterministic acoustic attribute control by incorporating pre-processed audio features and digital synthesizers. In this research, we introduce DDSP-SFX, a model based on the DDSP architecture capable of synthesizing high-quality sound effects while enabling users to control the timbre variations easily. We integrate a transient modelling algorithm in DDSP that achieves higher objective evaluation scores and subjective ratings over impulsive signals (footsteps, gunshots). We propose a novel method that achieves frame-level timbre variation control while also allowing deterministic attribute control. We further qualitatively show the timbre transfer performance using voice as the guiding sound. On Vibrato and Frequency (De)Modulation in Musical Sounds Jeremy Hyrkas and Tamara Smyth On Vibrato and Frequency (De)Modulation in Musical Sounds Jeremy Hyrkas and Tamara Smyth Abstract: Vibrato is an important characteristic in human musical performance and is often uniquely characteristic to a player and/or a particular instrument. This work is motivated by the assumption (often made in the source separation literature) that vibrato aids in the identification of multiple sound sources playing in unison. It follows that its removal, the focus herein, may contribute to a more blended combination. In signals, vibrato is often modeled as an oscillatory deviation from a center pitch/frequency that presents in the sound as phase/frequency modulation. While vibrato implementation using a time-varying delay line is well known, using a delay line for its removal is less so. In this work we focus on (de)modulation of vibrato in a signal by first showing the relationship between modulation and corresponding demodulation delay functions and then suggest a solution for increased vibrato removal in the latter by ensuring sideband attenuation below the threshold of audibility. Two known methods for estimating the instantaneous frequency/phase are used to construct delay functions from both contrived and musical examples so that vibrato removal may be evaluated.	Various	Lakeside
	L-ISA Studio Demo (Sponsor demo) L-ISA Studio Demo (Sponsor demo) Introduction to L-Acoustics spatial audio workflows, with plenty of time for listening and Q&A. Book here	-	TB07
	AR Demo AR Demo Headphone-based audio augmented reality demonstrator showcasing the effects of manipulated late reverberation in rendering virtual sound sources. The setup is based on a dataset of binaural room impulse responses measured along a 2 m long line, which is used to imitate the reproduction of a pair of loudspeakers. Therefore, listeners can explore the virtual sources by moving back and forth and rotating arbitrarily on this line. The demo allows the user to adjust the late reverberation tail of the auralizations interactively from shorter to longer decay times regarding the baseline decay behavior. Modification of the decay times is based on resynthesizing the late reverberation using frequency dependent shaping of binaural white noise and modal reconstruction. Book here	-	LTC
12:00	Keynote 2: Understanding Machine Learning as a Tool for Supporting Human Creators in Music and Beyond Keynote 2: Understanding Machine Learning as a Tool for Supporting Human Creators in Music and Beyond Rebecca Fiebrink (University of the Arts London) Abstract: When technical researchers, creators, and the general public discuss the future of AI in music and art, the focus is usually on a few types of questions, including: How can we make content generation and processing algorithms better and faster? Will contemporary AI systems put human creators out of a job? Are algorithms really capable of being “creative”? In this talk, I propose that we should be asking a different set of questions, beginning with the question of how we can use machine learning to better support fundamentally human creative activities in music and art. I’ll show examples from my research of how prioritising human creators—professionals, amateurs, and students—can lead to a new understanding of what machine learning is good for, and who can benefit from it. For instance, machine learning can aid human creators engaged in rapid prototyping of new interactions with sound and media. Machine learning can support greater embodied engagement in design, and it can enable more people to participate in the creation and customisation of new technologies. Furthermore, machine learning is leading to new types of human creative practices with computationally-infused mediums, in which a broad range of people can act not only as designers and implementors, but also as explorers, curators, and co-creators.	Rebecca Fiebrink	03MS01
13:00	Lunch	-	Hillside
14:30	Oral Session 4: Reverberation I RIR2FDN: An Improved Room Impulse Response Analysis and Synthesis Gloria Dal Santo, Benoit Alary, Karolina Prawda, Sebastian Schlecht and Vesa Välimäki RIR2FDN: An Improved Room Impulse Response Analysis and Synthesis Gloria Dal Santo, Benoit Alary, Karolina Prawda, Sebastian Schlecht and Vesa Välimäki Abstract: This paper seeks to improve the state-of-the-art in delay-network-based analysis-synthesis of measured room impulse responses (RIRs). We propose an informed method incorporating improved energy decay estimation and synthesis with an optimized feedback delay network. The performance of the presented method is compared against an end-to-end deep-learning approach. A formal listening test was conducted where participants assessed the similarity of reverberated material across seven distinct RIRs and three different sound sources. The results reveal that the performance of these methods is influenced by both the excitation sounds and the reverberation conditions. Nonetheless, the proposed method consistently demonstrates higher similarity ratings compared to the end-to-end approach across most conditions. However, achieving an indistinguishable synthesis of measured RIRs remains a persistent challenge, underscoring the complexity of this problem. Overall, this work helps improve the sound quality of analysis-based artificial reverberation. Modeling the Frequency-Dependent Sound Energy Decay of Acoustic Environments with Differentiable Feedback Delay Networks Alessandro Ilic Mezza, Riccardo Giampiccolo and Alberto Bernardini Modeling the Frequency-Dependent Sound Energy Decay of Acoustic Environments with Differentiable Feedback Delay Networks Alessandro Ilic Mezza, Riccardo Giampiccolo and Alberto Bernardini Abstract: Differentiable machine learning techniques have recently proved effective for finding the parameters of Feedback Delay Networks (FDNs) so that their output matches desired perceptual qualities of target room impulse responses. However, we show that existing methods tend to fail at modeling the frequency-dependent behavior of sound energy decay that characterizes real-world environments unless properly trained. In this paper, we introduce a novel perceptual loss function based on the mel-scale energy decay relief, which generalizes the well-known time-domain energy decay curve to multiple frequency bands. We also augment the prototype FDN by incorporating differentiable wideband attenuation and output filters, and train them via backpropagation along with the other model parameters. The proposed approach improves upon existing strategies for designing and training differentiable FDNs, making it more suitable for audio processing applications where realistic and controllable artificial reverberation is desirable, such as gaming, music production, and virtual reality. Binaural Dark-Velvet-Noise Reverberator Jon Fagerström, Nils Meyer-Kahlen, Sebastian J. Schlecht and Vesa Välimäki Binaural Dark-Velvet-Noise Reverberator Jon Fagerström, Nils Meyer-Kahlen, Sebastian J. Schlecht and Vesa Välimäki Abstract: Binaural late-reverberation modeling necessitates the synthesis of frequency-dependent inter-aural coherence, a crucial aspect of spatial auditory perception. Prior studies have explored methodologies such as filtering and cross-mixing two incoherent late reverberation impulse responses to emulate the coherence observed in measured binaural late reverberation. In this study, we introduce two variants of the binaural dark-velvet-noise reverberator. The first one uses cross-mixing of two incoherent dark-velvet-noise sequences that can be generated efficiently. The second variant is a novel time-domain jitter-based approach. The methods’ accuracies are assessed through objective and subjective evaluations, revealing that both methods yield comparable performance and clear improvements over using incoherent sequences. Moreover, the advantages of the jitter-based approach over cross-mixing are highlighted by introducing a parametric width control, based on the jitter-distribution width, into the binaural dark velvet noise reverberator. The jitter-based approach can also introduce timedependent coherence modifications without additional computational cost. Differentiable Active Acoustics - Optimizing Stability via Gradient Descent Gian Marco De Bortoli, Gloria Dal Santo, Karolina Prawda, Tapio Lokki, Vesa Välimäki and Sebastian J. Schlecht Differentiable Active Acoustics - Optimizing Stability via Gradient Descent Gian Marco De Bortoli, Gloria Dal Santo, Karolina Prawda, Tapio Lokki, Vesa Välimäki and Sebastian J. Schlecht Abstract: Active acoustics (AA) refers to an electroacoustic system that actively modifies the acoustics of a room. For common use cases, the number of transducers—loudspeakers and microphones—involved in the system is large, resulting in a large number of system parameters. To optimally blend the response of the system into the natural acoustics of the room, the parameters require careful tuning, which is a time-consuming process performed by an expert. In this paper, we present a differentiable AA framework, which allows multi-objective optimization without impairing architecture flexibility. The system is implemented in PyTorch to be easily translated into a machine-learning pipeline, thus automating the tuning process. The objective of the pipeline is to optimize the digital signal processor (DSP) component to evenly distribute the energy in the feedback loop across frequencies. We investigate the effectiveness of DSPs composed of finite impulse response filters, which are unconstrained during the optimization. We study the effect of multiple filter orders, number of transducers, and loss functions on the performance. Different loss functions behave similarly for systems with few transducers and low-order filters. Increasing the number of transducers and the order of the filters improves results and accentuates the difference in the performance of the loss functions. Naturalness of Double-Slope Decay in Generalised Active Acoustic Enhancement Systems Will Cassidy, Phil Coleman, Russell Mason and Enzo De Sena Naturalness of Double-Slope Decay in Generalised Active Acoustic Enhancement Systems Will Cassidy, Phil Coleman, Russell Mason and Enzo De Sena Abstract: Active acoustic enhancement systems (AAESs) alter the perceived acoustics of a space by using microphones and loudspeakers to introduce sound energy into the room. Double-sloped energy decay may be observed in these systems. However, it is unclear as to which conditions lead to this effect, and to what extent double sloping reduces the perceived naturalness of the reverberation compared to Sabine decay. This paper uses simulated combinations of AAES parameters to identify which cases affect the objective curvature of the energy decay. A subjective test with trained listeners assessed the naturalness of these conditions. Using an AAES model, room impulse responses were generated for varying room dimensions, absorption coefficients, channel counts, system loop gains and reverberation times (RTs) of the artificial reverberator. The objective double sloping was strongly correlated to the ratio between the reverberator and passive room RTs, but parameters such as absorption and room size did not have a profound effect on curvature. It was found that double sloping significantly reduced the perceived naturalness of the reverberation, especially when the reverberator RT was greater than two times that of the passive room. Double sloping had more effect on the naturalness ratings when subjects listened to a more absorptive passive room, and also when using speech rather than transient stimuli. Lowering the loop gain by 9 dB increased the naturalness of the doublesloped stimuli, where some were rated as significantly more natural than the Sabine decay stimuli from the passive room. A Common-Slopes Late Reverberation Model Based on Acoustic Radiance Transfer Matteo Scerbo, Sebastian J. Schlecht, Randall Ali, Lauri Savioja and Enzo De Sena A Common-Slopes Late Reverberation Model Based on Acoustic Radiance Transfer Matteo Scerbo, Sebastian J. Schlecht, Randall Ali, Lauri Savioja and Enzo De Sena Abstract: In rooms with complex geometry and uneven distribution of energy losses, late reverberation depends on the positions of sound sources and listeners. More precisely, the decay of energy is characterised by a sum of exponential curves with position-dependent amplitudes and position-independent decay rates (hence the name common slopes). The amplitude of different energy decay components is a particularly important perceptual aspect that requires efficient modeling in applications such as virtual reality and video games. Acoustic Radiance Transfer (ART) is a room acoustics model focused on late reverberation, which uses a pre-computed acoustic transfer matrix based on the room geometry and materials, and allows interactive changes to source and listener positions. In this work, we present an efficient common-slopes approximation of the ART model. Our technique extracts common slopes from ART using modal decomposition, retaining only the non-oscillating energy modes. Leveraging the structure of ART, changes to the positions of sound sources and listeners only require minimal processing. Experimental results show that even very few slopes are sufficient to capture the positional dependency of late reverberation, reducing model complexity substantially.	Various	03MS01
16:00	Show & Tell (oral session 4 presenters)	-	Foyer
	Poster Session 4: Reverberation II Differentiable MIMO Feedback Delay Networks for Multichannel Room Impulse Response Modeling Riccardo Giampiccolo, Alessandro Ilic Mezza and Alberto Bernardini Differentiable MIMO Feedback Delay Networks for Multichannel Room Impulse Response Modeling Riccardo Giampiccolo, Alessandro Ilic Mezza and Alberto Bernardini Abstract: Recently, with the advent of new performing headsets and goggles, the demand for Virtual and Augmented Reality applications has experienced a steep increase. In order to coherently navigate the virtual rooms, the acoustics of the scene must be emulated in the most accurate and efficient way possible. Amongst others, Feedback Delay Networks (FDNs) have proved to be valuable tools for tackling such a task. In this article, we expand and adapt a method recently proposed for the data-driven optimization of single-input-single-output FDNs to the multiple-input-multiple-output (MIMO) case for addressing spatial/space-time processing applications. By testing our methodology on items taken from two different datasets, we show that the parameters of MIMO FDNs can be jointly optimized to match some perceptual characteristics of given multichannel room impulse responses, overcoming approaches available in the literature, and paving the way toward increasingly efficient and accurate real-time virtual room acoustics rendering. A Highly Parametrized Scattering Delay Network Implementation for Interactive Room Auralization Marco Fontana, Giorgio Presti, Davide Fantini, Federico Avanzini and Arcadio Reyes-Lecuona A Highly Parametrized Scattering Delay Network Implementation for Interactive Room Auralization Marco Fontana, Giorgio Presti, Davide Fantini, Federico Avanzini and Arcadio Reyes-Lecuona Abstract: Scattering Delay Networks (SDNs) are an interesting approach to artificial reverberation, with parameters tied to the room’s physical properties and the computational efficiency of delay networks. This paper presents a highly-parametrized and real-time plugin of an SDN. The SDN plugin allows for interactive room auralization, enabling users to modify the parameters affecting the reverberation in real-time. These parameters include source and receiver positions, room shape and size, and wall absorption properties. This makes our plugin suitable for applications that require realtime and interactive spatial audio rendering, such as virtual or augmented reality frameworks and video games. Additionally, the main contributions of this work include a filter design method for wall sound absorption, as well as plugin features such as air absorption modeling, various output formats (mono, stereo, binaural, and first to fifth order Ambisonics), open sound control (OSC) for controlling source and receiver parameters, and a graphical user interface (GUI). Evaluation tests showed that the reverberation time and the filter design approach are consistent with both theoretical references and real-world measurements. Finally, performance analysis indicated that the SDN plugin requires minimal computational resources. Equalizing Loudspeakers in Reverberant Environments Using Deep Convolutive Dereverberation Silvio Osimi, Leonardo Gabrielli, Samuele Cornell and Stefano Squartini Equalizing Loudspeakers in Reverberant Environments Using Deep Convolutive Dereverberation Silvio Osimi, Leonardo Gabrielli, Samuele Cornell and Stefano Squartini Abstract: Loudspeaker equalization is an established topic in the literature, and currently many techniques are available to address most practical use cases. However, most of these rely on accurate measurements of the loudspeaker in an anechoic environment, which in some occurrences is not feasible. This is the case, e.g. of custom digital organs, which have a set of loudspeakers that are built into a large and geometrically-complex piece of furniture, which may be too heavy and large to be transported to a measurement room, or may require a big one, making traditional impulse response measurements impractical for most users. In this work we propose a method to find the inverse of the sound emission system in a reverberant environment, based on a Deep Learning dereverberation algorithm. The method is agnostic of the room characteristics and can be, thus, conducted in an automated fashion in any environment. A real use case is discussed and results are provided, showing the effectiveness of the approach in designing filters that match closely the magnitude response of the ideal inverting filters. Evaluating Neural Networks Architectures for Spring Reverb Modelling Francesco Papaleo, Xavier Lizarraga-Seijas and Frederic Font Evaluating Neural Networks Architectures for Spring Reverb Modelling Francesco Papaleo, Xavier Lizarraga-Seijas and Frederic Font Abstract: Reverberation is a key element in spatial audio perception, historically achieved with the use of analogue devices, such as plate and spring reverb, and in the last decades with digital signal processing techniques that have allowed different approaches for Virtual Analogue Modelling (VAM). The electromechanical functioning of the spring reverb makes it a nonlinear system that is difficult to fully emulate in the digital domain with white-box modelling techniques. In this study, we compare five different neural network architectures, including convolutional and recurrent models, to assess their effectiveness in replicating the characteristics of this audio effect. The evaluation is conducted on two datasets at sampling rates of 16 kHz and 48 kHz. This paper specifically focuses on neural audio architectures that offer parametric control, aiming to advance the boundaries of current black-box modelling techniques in the domain of spring reverberation. Modified Late Reverberation in an Audio Augmented Reality Scenario Christian Schneiderwind and Annika Neidhardt Modified Late Reverberation in an Audio Augmented Reality Scenario Christian Schneiderwind and Annika Neidhardt Abstract: This paper presents a headphone-based audio augmented reality demonstrator showcasing the effects of manipulated late reverberation in rendering virtual sound sources. The setup is based on a dataset of binaural room impulse responses measured along a 2 m long line, which is used to imitate the reproduction of a pair of loudspeakers. Therefore, listeners can explore the virtual sources by moving back and forth and rotating arbitrarily on this line. The demo allows the user to adjust the late reverberation tail of the auralizations interactively from shorter to longer decay times regarding the baseline decay behavior. Modification of the decay times is based on resynthesizing the late reverberation using frequencydependent shaping of binaural white noise and modal reconstruction. The paper includes descriptions of the frameworks used for this demo and an overview of the required data and processing steps.	Various	Lakeside
	IoSR Facilities Tour IoSR Facilities Tour Tour of the studios and audio labs at the University of Surrey. Book here	-	PATs Building
17:30	End of Sessions	-	-
18:00	Banquet	-	Loseley Park (off campus)
22:00	End of Day	-	-

Time	Session	Speakers	Venue
8:30	Registration (all day)	-	Foyer
9:00	Oral Session 5: Audio Signal Processing I A Real-Time Approach for Estimating Pulse Tracking Parameters for Beat-Synchronous Audio Effects Peter Meier, Simon Schwär and Meinard Müller A Real-Time Approach for Estimating Pulse Tracking Parameters for Beat-Synchronous Audio Effects Peter Meier, Simon Schwär and Meinard Müller Abstract: Predominant Local Pulse (PLP) estimation, an established method for extracting beat positions and other periodic pulse information from audio signals, has recently been extended with an online variant tailored for real-time applications. In this paper, we introduce a novel approach to generating various real-time control signals from the original online PLP output. While the PLP activation function encodes both predominant pulse information and pulse stability, we propose several normalization procedures to discern local pulse oscillation from stability, utilizing the PLP activation envelope. Through this, we generate pulse-synchronous Low Frequency Oscillators (LFOs) and supplementary confidence-based control signals, enabling dynamic control over audio effect parameters in real-time. Additionally, our approach enables beat position prediction, providing a look-ahead capability, for example, to compensate for system latency. To showcase the effectiveness of our control signals, we introduce an audio plugin prototype designed for integration within a Digital Audio Workstation (DAW), facilitating real-time applications of beat-synchronous effects during live mixing and performances. Moreover, this plugin serves as an educational tool, providing insights into PLP principles and the tempo structure of analyzed music signals. Parameter Estimation of Frequency-Modulated Sinusoids with the Distribution Derivative Method Marcelo Caetano Parameter Estimation of Frequency-Modulated Sinusoids with the Distribution Derivative Method Marcelo Caetano Abstract: Frequency-modulated (FM) sinusoids are commonly used to model signals in several engineering applications, such as radar, sonar, communications, acoustics, and optics. The estimation of the parameters of FM sinusoids is a challenging problem with a long history in the literature. In this article, we use the distribution derivative method (DDM) to estimate the parameters of FM sinusoids in additive white Gaussian noise. Firstly, we derive the estimation of parameters of the model with DDM. Then, we compare the results of Monte-Carlo simulations (MCS) of DDM estimation of FM signals in additive white Gaussian noise against the state of the art (SOTA) and the Cramér-Rao lower bound (CRLB). DDM estimation of FM sinusoids showed performance comparable to the SOTA with less estimation bias. Additionally, DDM estimation of FM sinusoids is simple and straightforward to implement with the fast Fourier transform (FFT) relative to other approaches in the literature. Finally, DDM estimation has effectively the same computational complexity as the FFT. Topology-Preserving Deformations of Digital Audio Georg Essl Topology-Preserving Deformations of Digital Audio Georg Essl Abstract: Topology provides global invariants for data as well as spaces of deformation. In this paper we discuss the deformations of audio signals which preserve topological information specified by sublevel set persistent homology. It is well known that the topological information only changes at extrema. We introduce box snakes as a data structure that captures permissible editing and deformation of signals and preserves the extremal properties of the signal while allowing for monotone deformations between them. The resulting algorithm works on any ordered discrete data hence can be applied to time and frequency domain finite length audio signals. Characterisation and Excursion Modelling of Audio Haptic Transducers Stephen Oxnard, Ethan Stanhope, Laurence J. Hobden and Mahmoud Masri Characterisation and Excursion Modelling of Audio Haptic Transducers Stephen Oxnard, Ethan Stanhope, Laurence J. Hobden and Mahmoud Masri Abstract: Statement and calculation of objective audio haptic transducer performance metrics facilitates optimisation of multi-sensory sound reproduction systems. Measurements of existing haptic transducers are applied to the calculation of a series of performance metrics to demonstrate a means of comparative objective analysis. The frequency response, transient response and moving mass excursion characteristics of each measured transducer are quantified using novel and previously defined metrics. Objective data drawn from a series of practical measurements shows that the proposed metrics and means of excursion modelling applied herein are appropriate for haptic transducer evaluation and protection against over-excursion respectively. Differentiable All-Pole Filters for Time-Varying Audio Systems Chin-Yun Yu, Christopher Mitcheltree, Alistair Carson, Stefan Bilbao, Joshua Reiss and György Fazekas Differentiable All-Pole Filters for Time-Varying Audio Systems Chin-Yun Yu, Christopher Mitcheltree, Alistair Carson, Stefan Bilbao, Joshua Reiss and György Fazekas Abstract: Infinite impulse response filters are an essential building block of many time-varying audio systems, such as audio effects and synthesisers. However, their recursive structure impedes end-to-end training of these systems using automatic differentiation. Although non-recursive filter approximations like frequency sampling and frame-based processing have been proposed and widely used in previous works, they cannot accurately reflect the gradient of the original system. We alleviate this difficulty by reexpressing a time-varying all-pole filter to backpropagate the gradients through itself, so the filter implementation is not bound to the technical limitations of automatic differentiation frameworks. This implementation can be employed within audio systems containing filters with poles for efficient gradient evaluation. We demonstrate its training efficiency and expressive capabilities for modelling real-world dynamic audio systems on a phaser, time-varying subtractive synthesiser, and feed-forward compressor. We make our code and audio samples available and provide the trained audio effect and synth models in a VST plugin. Band-Limited Impulse Invariance Method Using Lagrange Kernels Nara Hahn, Frank Schultz and Sascha Spors Band-Limited Impulse Invariance Method Using Lagrange Kernels Nara Hahn, Frank Schultz and Sascha Spors Abstract: The band-limited impulse invariance method is a recently proposed approach for the discrete-time modeling of an LTI continuous-time system. Both the magnitude and phase responses are accurately modeled by means of discrete-time filters. It is an extension of the conventional impulse invariance method, which is based on the time-domain sampling of the continuous-time response. The resulting IIR filter typically exhibits spectral aliasing artifacts. In the band-limited impulse invariance method, an FIR filter is combined in parallel with the IIR filter, in such a way that the frequency response of the FIR part reduces the aliasing contributions. This method was shown to improve the frequency-domain accuracy while maintaining the compact temporal structure of the discrete-time model. In this paper, a new version of the bandlimited impulse invariance method is introduced, where the FIR coefficients are derived in closed form by examining the discontinuities that occur in the continuous-time domain. An analytical anti-aliasing filtering is performed by replacing the discontinuities with band-limited transients. The band-limited discontinuities are designed by using the anti-derivatives of the Lagrange interpolation kernel. The proposed method is demonstrated by a wave scattering example, where the acoustical impulse responses on a rigid spherical scatter are simulated.	Various	03MS01
10:30	Show & Tell (oral session 5 presenters)	-	Foyer
	Poster Session 5: Audio Signal Processing II Interpolation Filters for Antiderivative Antialiasing Victor Zheleznov and Stefan Bilbao Interpolation Filters for Antiderivative Antialiasing Victor Zheleznov and Stefan Bilbao Abstract: Aliasing is an inherent problem in nonlinear digital audio processing which results in undesirable audible artefacts. Antiderivative antialiasing has proved to be an effective approach to mitigate aliasing distortion, and is based on continuous-time convolution of a linearly interpolated distorted signal with antialiasing filter kernels. However, the performance of this method is determined by the properties of interpolation filter. In this work, cubic interpolation kernels for antiderivative antialiasing are considered. For memoryless nonlinearities, aliasing reduction is improved employing cubic interpolation. For stateful systems, numerical simulation and stability analysis with respect to different interpolation kernels remain in favour of linear interpolation. PIPES: A Networked Rapid Development Protocol for Sound Applications Paolo Marrone, Stefano D'Angelo and Federico Fontana PIPES: A Networked Rapid Development Protocol for Sound Applications Paolo Marrone, Stefano D'Angelo and Federico Fontana Abstract: The development of audio Digital Signal Processing (DSP) algorithms typically requires iterative design, analysis, and testing, possibly on different target platforms, furthermore often asking for resets or restarts of execution environments between iterations. Manually performing deployment, setup, and output data collection can quickly become intolerably time-consuming. Therefore, we propose a new, experimental, open-ended, and automatable protocol to separate the coding, building, and deployment tasks onto different network nodes. The proposed protocol is mostly based on widespread technology and designed to be easy to implement and integrate with existing software infrastructure. Its flexibility has been validated through a proof-of-concept implementation. Despite being still in its infancy, it already shows potential in allowing faster and more comfortable development workflows. Audio Visualization via Delay Embedding and Subspace Learning Alois Cerbu and Carmine Cella Audio Visualization via Delay Embedding and Subspace Learning Alois Cerbu and Carmine Cella Abstract: We describe a sequence of methods for producing videos from audio signals. Our visualizations capture perceptual features like harmonicity and brightness: they produce stable images from periodic sounds and slowly-evolving images from inharmonic ones; they associate jagged shapes to brighter sounds and rounded shapes to darker ones. We interpret our methods as adaptive FIR filterbanks and show how, for larger values of the complexity parameters, we can perform accurate frequency detection without the Fourier transform. Attached to the paper is a code repository containing the Jupyter notebook used to generate the images and videos cited. We also provide code for a realtime C++ implementation of the simplest visualization method. We discuss the mathematical theory of our methods in the two appendices. Real-Time System for Sound Enhancement in Noisy Environment Stefania Cecchi, Valeria Bruschi, Paolo Peretti and Ferruccio Bettarelli Real-Time System for Sound Enhancement in Noisy Environment Stefania Cecchi, Valeria Bruschi, Paolo Peretti and Ferruccio Bettarelli Abstract: The noise can affect the listening experience in many real-life situations involving loudspeakers as a playback device. A solution to reduce the effect of the noise is to employ headphones, but they can be annoying and not allowed on some occasions. In this context, a system for improving the audio perception and the intelligibility of sounds in a domestic noisy environment is introduced and a real-time implementation is proposed. The system comprises three main blocks: a noise estimation procedure based on an adaptive algorithm, an auditory spectral masking algorithm that estimates the music threshold capable of masking the noise source, and an FFT equalizer that is used to apply the estimated level. It has been developed on an embedded DSP board considering one microphone for the ambient noise analysis and two vibrating sound transducers for sound reproduction. Several experiments on simulated and real-world scenarios have been realized to prove the effectiveness of the proposed approach. Hybrid Audio Inpainting Approach with Structured Sparse Decomposition and Sinusoidal Modeling Eto Sun and Philippe Depalle Hybrid Audio Inpainting Approach with Structured Sparse Decomposition and Sinusoidal Modeling Eto Sun and Philippe Depalle Abstract: This research presents a novel hybrid audio inpainting approach that considers the diversity of signals and enhances the reconstruction quality. Existing inpainting approaches have limitations, such as energy drop and poor reconstruction quality for non-stationary signals. Based on the fact that an audio signal can be considered as a mixture of three components: tonal, transients, and noise, the proposed approach divides the left and right reliable neighborhoods around the gap into these components using a structured sparse decomposition technique. The gap is reconstructed by extrapolating parameters estimated from the reliable neighborhoods of each component. Component-targeted methods are refined and employed to extrapolate the parameters based on their own acoustic characteristics. Experiments were conducted to evaluate the performance of the hybrid approach and compare it with other state-of-the-art inpainting approaches. The results show the hybrid approach achieves high-quality reconstruction and low computational complexity across various gap lengths and signal types, particularly for longer gaps and non-stationary signals.	Various	Lakeside
	L-ISA Studio Demo (Sponsor demo) L-ISA Studio Demo (Sponsor demo) Introduction to L-Acoustics spatial audio workflows, with plenty of time for listening and Q&A. Book here	-	TB07
	AR Demo AR Demo Headphone-based audio augmented reality demonstrator showcasing the effects of manipulated late reverberation in rendering virtual sound sources. The setup is based on a dataset of binaural room impulse responses measured along a 2 m long line, which is used to imitate the reproduction of a pair of loudspeakers. Therefore, listeners can explore the virtual sources by moving back and forth and rotating arbitrarily on this line. The demo allows the user to adjust the late reverberation tail of the auralizations interactively from shorter to longer decay times regarding the baseline decay behavior. Modification of the decay times is based on resynthesizing the late reverberation using frequency dependent shaping of binaural white noise and modal reconstruction. Book here	-	LTC
12:00	Keynote 3: Machine Learning and Digital Audio Effects Keynote 3: Machine Learning and Digital Audio Effects Vesa Välimäki (Aalto University) Abstract: Many papers presented at the DAFX conference currently use machine learning techniques, but this was not the case just a few years ago. This talk will discuss the developments that led to the paradigm shift in our research field, which followed a few years behind some closely related fields, such as speech recognition and synthesis. It has been a nice surprise that machine learning can better solve many audio processing problems than our previous signal-processing methods. However, there are also counterexamples in which we have not found a perfect machine-learning-based solution. Audio time-scale modification is a problem for which ideal training data is unavailable, and the current best method is based on traditional signal processing. Generative machine learning, such as diffusion models, can provide excellent solutions to problems that seemed almost impossible earlier, such as the reconstruction of long gaps, or audio inpainting.	Vesa Välimäki	03MS01
13:00	Lunch	-	Hillside
14:30	Oral Session 6: Spatial Audio HRTF Spatial Upsampling in the Spherical Harmonics Domain Employing a Generative Adversarial Network Xuyi Hu, Jian Li, Lorenzo Picinali and Aidan Hogg HRTF Spatial Upsampling in the Spherical Harmonics Domain Employing a Generative Adversarial Network Xuyi Hu, Jian Li, Lorenzo Picinali and Aidan Hogg Abstract: A Head-Related Transfer Function (HRTF) is able to capture alterations a sound wave undergoes from its source before it reaches the entrances of a listener’s left and right ear canals, and is imperative for creating immersive experiences in virtual and augmented reality (VR/AR). Nevertheless, creating personalized HRTFs demands sophisticated equipment and is hindered by time-consuming data acquisition processes. To counteract these challenges, various techniques for HRTF interpolation and up-sampling have been proposed. This paper illustrates how Generative Adversarial Networks (GANs) can be applied to HRTF data upsampling in the spherical harmonics domain. We propose using Autoencoding Generative Adversarial Networks (AE-GAN) to upsample lowdegree spherical harmonics coefficients and get a more accurate representation of the full HRTF set. The proposed method is benchmarked against two baselines: barycentric interpolation and HRTF selection. Results from log-spectral distortion (LSD) evaluation suggest that the proposed AE-GAN has significant potential for upsampling very sparse HRTFs, achieving 17% improvement over baseline methods. NBU: Neural Binaural Upmixing of Stereo Content Philipp Grundhuber, Michael Lovedee-Turner and Emanuël Habets NBU: Neural Binaural Upmixing of Stereo Content Philipp Grundhuber, Michael Lovedee-Turner and Emanuël Habets Abstract: While immersive music productions have become popular in recent years, music content produced during the last decades has been predominantly mixed for stereo. This paper presents a datadriven approach to automatic binaural upmixing of stereo music. The network architecture HDemucs, previously utilized for both source separation and binauralization, is leveraged for an endto-end approach to binaural upmixing. We employ two distinct datasets, demonstrating that while custom-designed training data enhances the accuracy of spatial positioning, the use of professionally mixed music yields superior spatialization. The trained networks show a capacity to process multiple simultaneous sources individually and add valid binaural cues, effectively positioning sources with an average azimuthal error of less than Frequency-Dependent Characteristics and Perceptual Validation of the Interaural Thresholded Level Distribution Christian S. E. Cotton, Stephen G. Oxnard, Laurence J. Hobden and Ethan Stanhope Frequency-Dependent Characteristics and Perceptual Validation of the Interaural Thresholded Level Distribution Christian S. E. Cotton, Stephen G. Oxnard, Laurence J. Hobden and Ethan Stanhope Abstract: The interaural thresholded level distribution (ITLD) is a novel metric of auditory source width (ASW), derived from the psychophysical processes and structures of the inner ear. While several of the ITLD’s objective properties have been presented in previous work, its frequency-dependent characteristics and perceptual relationship with ASW have not been previously explored. This paper presents an investigation into these properties of the ITLD, which exhibits pronounced variation in band-limited behaviour as octaveband centre-frequency is increased. Additionally, a very strong correlation was found between [1 – ITLD] and normalised values of ASW, collected from a semantic differential listening test based on the Multiple Stimulus with Hidden Reference and Anchor (MUSHRA) framework. Perceptual relationships between various ITLD-derived quantities were also investigated, showing that the low-pass filter intrinsic to ITLD calculation strengthened the relationship between [1 – ITLD] and ASW. A subsequent test using transient stimuli, as well as investigations into other psychoacoustic properties of the metric such as its just-noticeabledifference, were outlined as subjects for future research, to gain a deeper understanding of the subjective properties of the ITLD. Decoding Sound Source Location From EEG: Preliminary Comparisons of Spatial Rendering and Location Nils Marggraf-Turley, Lorenzo Picinali, Niels Pontoppidan, Martha Shiell and Drew Cappotto Decoding Sound Source Location From EEG: Preliminary Comparisons of Spatial Rendering and Location Nils Marggraf-Turley, Lorenzo Picinali, Niels Pontoppidan, Martha Shiell and Drew Cappotto Abstract: Spatial auditory acuity is contingent on the quality of spatial cues presented during listening. Electroencephalography (EEG) shows promise for finding neural markers of such acuity present in recorded neural activity, potentially mitigating common challenges with behavioural assessment (e.g., sound source localisation tasks). This study presents findings from three preliminary experiments which investigated neural response variations to auditory stimuli under different spatial listening conditions: free-field (loudspeakerbased), individual Head-Related Transfer-Functions (HRTF), and non-individual HRTFs. Three participants, each participating in one experiment, were exposed to auditory stimuli from various spatial locations while neural activity was recorded via EEG. The resultant neural responses underwent a decoding protocol to asses how decoding accuracy varied between stimuli locations over time. Decoding accuracy was highest for free-field auditory stimuli, with significant but lower decoding accuracy between left and right hemisphere locations for individual and non-individual HRTF stimuli. A latency in significant decoding accuracy was observed between listening conditions for locations dominated by spectral cues. Furthermore, findings suggest that decoding accuracy between free-field and non-individual HRTF stimuli may reflect behavioural front-back confusion rates. A Deep Learning Approach to the Prediction of Time-Frequency Spatial Parameters for Use in Stereo Upmixing Daniel Turner and Damian Murphy A Deep Learning Approach to the Prediction of Time-Frequency Spatial Parameters for Use in Stereo Upmixing Daniel Turner and Damian Murphy Abstract: This paper presents a deep learning approach to parametric time-frequency parameter prediction for use within stereo upmixing algorithms. The approach presented uses a Multi-Channel U-Net with Residual connections (MuCh-Res-U-Net) trained on a novel dataset of stereo and parametric time-frequency spatial audio data to predict time-frequency spatial parameters from a stereo input signal for positions on a 50-point Lebedev quadrature sampled sphere. An example upmix pipeline is then proposed which utilises the predicted time-frequency spatial parameters to both extract and remap stereo signal components to target spherical harmonic components to facilitate the generation of a full spherical representation of the upmixed sound field. Audio-Visual Talker Localization in Video for Spatial Sound Reproduction Davide Berghi and Philip Jackson Audio-Visual Talker Localization in Video for Spatial Sound Reproduction Davide Berghi and Philip Jackson Abstract: Object-based audio production requires the positional metadata to be defined for each point-source object, including the key elements in the foreground of the sound scene. In many media production use cases, both cameras and microphones are employed to make recordings, and the human voice is often a key element. In this research, we detect and locate the active speaker in the video, facilitating the automatic extraction of the positional metadata of the talker relative to the camera’s reference frame. With the integration of the visual modality, this study expands upon our previous investigation focused solely on audio-based active speaker detection and localization. Our experiments compare conventional audio-visual approaches for active speaker detection that leverage monaural audio, our previous audio-only method that leverages multichannel recordings from a microphone array, and a novel audio-visual approach integrating vision and multichannel audio. We found the role of the two modalities to complement each other. Multichannel audio, overcoming the problem of visual occlusions, provides a double-digit reduction in detection error compared to audio-visual methods with single-channel audio. The combination of multichannel audio and vision further enhances spatial accuracy, leading to a four-percentage point increase in F1 score on the Tragic Talkers dataset. Future investigations will assess the robustness of the model in noisy and highly reverberant environments, as well as tackle the problem of off-screen speakers.	Various	03MS01
16:00	Show & Tell (oral session 6 presenters)	-	Foyer
	Poster Session 6: Demos and Software QUBX: Rust Library for Queue-Based Multithreaded Real-Time Parallel Audio Streams Processing and Management Pasquale Mainolfi QUBX: Rust Library for Queue-Based Multithreaded Real-Time Parallel Audio Streams Processing and Management Pasquale Mainolfi Abstract: The concurrent management of real-time audio streams pose an increasingly complex technical challenge within the realm of the digital audio signals processing, necessitating efficient and intuitive solutions. Qubx endeavors to lead in tackling this obstacle with an architecture grounded in dynamic circular queues, tailored to optimize and synchronize the processing of parallel audio streams. It is a library written in Rust, a modern and powerful ecosystem with a still limited range of tools for digital signal processing and management. Additionally, Rust’s inherent security features and expressive type system bolster the resilience and stability of the proposed tool. GPGPU Audio Benchmark Framework Travis Skare GPGPU Audio Benchmark Framework Travis Skare Abstract: Acceleration of audio workloads on generally-programmable GPU (GPGPU) hardware offers potentially high speedup factors, but also presents challenges in terms of development and deployment. We can increasingly depend on such hardware being available in users’ systems, yet few real-time audio products use this resource. We propose a suite of benchmarks to qualify a GPU as suitable for batch or real-time audio processing. This includes both microbenchmarks and higher-level audio domain benchmarks. We choose metrics based on application, paying particularly close attention to latency tail distribution. We propose an extension to the benchmark framework to more accurately simulate the real-world request pattern and performance requirements when running in a digital audio workstation. We run these benchmarks on two common consumer-level platforms: a PC desktop with a recent midrange discrete GPU and a Macintosh desktop with unified CPUGPU memory architecture. Real-Time Implementation of a Linear-Phase Octave Graphic Equalizer Valeria Bruschi, Stefania Cecchi, Alessandro Nicolini and Vesa Välimäki Real-Time Implementation of a Linear-Phase Octave Graphic Equalizer Valeria Bruschi, Stefania Cecchi, Alessandro Nicolini and Vesa Välimäki Abstract: This paper proposes a real-time implementation of a linear-phase octave graphic equalizer (GEQ), previously introduced by the same authors. The structure of the GEQ is based on interpolated finite impulse response (IFIR) filters and is derived from a single prototype FIR filter. The low computational cost and small latency make the presented GEQ suitable for real-time applications. In this work, the GEQ has been implemented as a plugin of a specific software, used for real-time tests. The performance of the equalizer has been evaluated through subjective tests, comparing it with a filterbank equalizer. For the tests, four standard equalization curves have been chosen. The experimental results show promising outcomes. The result is an accurate real-time-capable linear-phase GEQ with a reasonable latency. An Open Source Stereo Widening Plugin Orchisama Das An Open Source Stereo Widening Plugin Orchisama Das Abstract: Stereo widening algorithms aim to extend the stereo image width and thereby, increase the perceived spaciousness of a mix. Here, we present the design and implementation of a stereo widening plugin that is computationally efficient. First, a stereo signal is decorrelated by convolving with a velvet noise sequence, or alternately, by passing through a cascade of allpass filters with randomised phase. Both the original and decorrelated signals are passed through perfect reconstruction filterbanks to get a set of lowpassed and highpassed signals. Then, the original and decorrelated filtered signals are combined through a mixer and summed to produce the final stereo output. Two separate parameters control the perceived width of the lower frequencies and higher frequencies respectively. A transient detection block prevents the smearing of percussive signals caused by the decorrelation filters. The stereo widener has been released as an open-source plugin. GRAFX: An Open-Source Library for Audio Processing Graphs in Pytorch Sungho Lee, Marco Martínez-Ramírez, Wei-Hsiang Liao, Stefan Uhlich, Giorgio Fabbro, Kyogu Lee and Yuki Mitsufuji GRAFX: An Open-Source Library for Audio Processing Graphs in Pytorch Sungho Lee, Marco Martínez-Ramírez, Wei-Hsiang Liao, Stefan Uhlich, Giorgio Fabbro, Kyogu Lee and Yuki Mitsufuji Abstract: We present GRAFX, an open-source library designed for handling audio processing graphs in PyTorch. Along with various library functionalities, we describe technical details on the efficient parallel computation of input graphs, signals, and processor parameters in GPU. Then, we show its example use under a music mixing scenario, where parameters of every differentiable processor in a large graph are optimized via gradient descent. The code is available at https://github.com/sh-lee97/grafx. LTFATPY: Towards Making a Wide Range of Time-Frequency Representations Available in Python Clara Hollomey and Nicki Holighaus LTFATPY: Towards Making a Wide Range of Time-Frequency Representations Available in Python Clara Hollomey and Nicki Holighaus Abstract: LTFATPY is a software package for accessing the Large Time Frequency Analysis Toolbox (LTFAT) from Python. Dedicated to time-frequency analysis, LTFAT comprises a large number of linear transforms for Fourier, Gabor, and wavelet analysis along with their associated operators. Its filter bank module is a collection of computational routines for finite impulse response and band-limited filters, allowing for the specification of constant-Q and auditory-inspired transforms. While LTFAT has originally been written in MATLAB/GNU Octave, the recent popularity of the Python programming language in related fields, such as signal processing and machine learning, makes it desirable to have LTFAT available in Python as well. We introduce LTFATPY, describe its main features, and outline further developments.	Various	Lakeside
	IoSR Facilities Tour IoSR Facilities Tour Tour of the studios and audio labs at the University of Surrey. Book here	-	PATs Building
	AR Demo AR Demo Headphone-based audio augmented reality demonstrator showcasing the effects of manipulated late reverberation in rendering virtual sound sources. The setup is based on a dataset of binaural room impulse responses measured along a 2 m long line, which is used to imitate the reproduction of a pair of loudspeakers. Therefore, listeners can explore the virtual sources by moving back and forth and rotating arbitrarily on this line. The demo allows the user to adjust the late reverberation tail of the auralizations interactively from shorter to longer decay times regarding the baseline decay behavior. Modification of the decay times is based on resynthesizing the late reverberation using frequency dependent shaping of binaural white noise and modal reconstruction. Book here	-	LTC
17:30	Awards	-	03MS01
18:00	Dafx Board Meeting	-	03MS01
19:00	End of Day	-	-

Time	Session	Venue
8:50	Transport to Central London	Outside Rik Medlik building
10:30	Drop of luggage/meet guides	-
11:00	The Great British Rock and Roll Walking Tour The Great British Rock & Roll Walking Tour The Rolling Stones. The Clash. The Kinks. David Bowie. The Who. Led Zeppelin. Many of rock’s greatest bands have London roots, and you’ll explore their haunts on foot during this deep-dive walking tour. Led by a local musician, you’ll wander through Soho, Mayfair, and beyond, getting a look at the land that created and shaped music’s greatest sound.	Tottenham Court Road St
13:00	Pub Lunch (not covered by DAFx)	O'Neill's Wardour Street
14:30	End of Conference	-