T2A-Feedback

Text-to-audio (T2A) generation has achieved remarkable progress in generating a variety of audio outputs from language prompts. However, current state-of-the-art T2A models still struggle to satisfy human preferences for prompt-following and acoustic quality when generating complex multi-event audio. To improve the performance of the model in these high-level applications, we propose to enhance the basic capabilities of the model with AI feedback learning. First, we introduce fine-grained AI audio scoring pipelines to: 1) verify whether each event in the text prompt is present in the audio (Event Occurrence Score), 2) detect deviations in event sequences from the language description (Event Sequence Score), and 3) assess the overall acoustic and harmonic quality of the generated audio (Acoustic & Harmonic Quality). We evaluate these three automatic scoring pipelines and find that they correlate significantly better with human preferences than other evaluation metrics. This highlights their value as both feedback signals and evaluation metrics. Utilizing our robust scoring pipelines, we construct a large audio preference dataset, T2A-FeedBack, which contains 41k prompts and 249k audios, each accompanied by detailed scores. Moreover, we introduce T2A-EpicBench, a benchmark that focuses on long captions, multi-events, and story-telling scenarios, aiming to evaluate the advanced capabilities of T2A models. Finally, we demonstrate how T2A-FeedBack can enhance current state-of-the-art audio model. With simple preference tuning, the audio generation model exhibits significant improvements in both simple (AudioCaps test set) and complex (T2A-EpicBench) scenarios.

Prompt	Good	Not Good
Pots and pans rattle in the background
	Score: 73.21	Score: 8.46

A bell is ringing and a train blows its horn twice long and hard
	Score: 69.64	Score: 33.78

A flushing of water and people talking
	Score: 44.57	Score: -8.60

A man speaks and bees buzz
	Score: 56.53	Score: 20.99

Prompt	Good	Not Good
Child's clear voice carries as they begin speaking, followed by the rhythmic clapping of audience members' hands once they finish a point
	Score: 1.00	Score: -1.00

The sizzling sound of oil in the frying pan begins, followed by the woman's voice carrying a conversation
	Score: 1.00	Score: -1.00

Adult female's clear voice echoes, followed by quick tapping sounds. Subsequently, a dog barks sharply
	Score: 1.00	Score: -0.33

Man's voice carries through the room as he speaks, followed by the sound of a clock ticking in the background, then the distant hum of a car engine
	Score: 1.00	Score: 0.33

Score	Prompt	Prompt
4	An adult male is speaking, and bees are buzzing	The wind blows and birds are singing
4	An adult male is speaking, and bees are buzzing	The wind blows and birds are singing
3	A pig is making oinking noises	Banging then a meow followed by speech
3	A pig is making oinking noises	Banging then a meow followed by speech
2	Humming of passing traffic followed by a musical horn	Waves crash against the beach with just a little wind going by
2	Humming of passing traffic followed by a musical horn
1	A baby is crying and a person sneezes then another person speaks	A woman speaking continuously
1		A woman speaking continuously

Prompt	Before tuning	After tuning
A car screeches loudly as a man speaks over an intercom
A car screeches loudly as a man speaks over an intercom
A family is having fun honking a vehicle horn
A family is having fun honking a vehicle horn
A bird chirping and then a man talking
A bird chirping and then a man talking
Footsteps shuffling followed by a cat meowing and then a toilet flushing

Prompt	Before tuning	After tuning
In a serene garden, the gentle rustle of leaves dances in the breeze. Suddenly, a bird chirps cheerfully from a nearby branch, filling the air with music. A child's giggle rings out as they run through the flowers, brightening the moment. Just then, a soft bell tolls in the distance, reminding everyone of the passing time.

In a vibrant downtown area, the honking of cars creates a chaotic symphony. Suddenly, a street vendor shouts out their specials, trying to attract customers. The laughter of people enjoying a nearby café adds warmth to the urban sounds. Just then, a bus rumbles past, its engine growling as it continues on its route.

In a vibrant marketplace, vendors shout their prices, adding to the lively atmosphere. Suddenly, a bell rings as a customer makes a purchase, drawing attention to the stall. Nearby, a musician strums a guitar, his melody weaving through the conversations. Just then, the aroma of spices is interrupted by a loud laughter from a group of friends enjoying their snacks.

In an open field, the buzz of insects hums steadily, creating a constant backdrop. Suddenly, a hawk screeches overhead, searching for its next meal. The distant sound of a bubbling brook can be heard, providing a soothing contrast. Just then, a child's laughter rings out as they chase butterflies, their joy echoing across the landscape.

T2A-Feedback: Improving Basic Capabilities of Text-to-Audio Generation via Fine-grained AI Feedback

Abstract

T2A-FeedBack

A. AI Audio Scoring Pipelines.

A.1.Events Occurrence Score

A.2.Events Sequence Score

A.3.Acoustic&Harmonic Quality

B. Samples on Audiocaps Test Set.

C. Samples on EpicBench.

A.AI Audio Scoring Pipelines.

A.1.Events Occurrence Score

A.2.Events Sequence Score

A.3.Acoustic&Harmonic Quality

B.Samples on Audiocaps Test Set.

C.Samples on EpicBench.