Warn­ing! Your devicePixelRatio is not 1 (or your brows­er doesn't CSS me­dia queries, in which case you can prob­a­bly ig­nore this warn­ing)! Since HTML is as brain­dead as it can get, cre­at­ing sites out­side of di­ver­si­ty hires cor­po­rate art sphere is damn near im­pos­si­ble, so this site is not re­al­ly op­ti­mized for HiDPI screens. Some im­ages might ap­pear blur­ry.

Loudness normalization with FFmpeg

Cre­at­ed: 1725908832 (2024-09-09T19:07:12Z), Up­dat­ed: 1729543407 (2024-10-21T20:43:27Z), 3458 words, ~15 min­utes

Tags: , , ,

AKA 2024 up­dates

I was writ­ing a blog­post on NFS2 (to be pub­lished soon™), when I made the mis­take of open­ing one of the video files in Au­dac­i­ty with show clip­ping on—and oh my gosh, it was all red. Even the orig­i­nal record­ing was heav­i­ly clipped, and this prompt­ed me dig around—and so a mul­ti-day jour­ney be­gan.

First thing I did was switch obs-stu­dio to record float sam­ples in­stead if S16, so at least in my in­put videos sam­ples won't be clipped. I didn't re­do the old record­ings, they're still bro­ken and there's no fix­ing them, but at least new ones won't have this prob­lem. Well, as long as the game can out­put float sam­ples, it won't help if the game us­es S16 in­ter­nal­ly. Or if I run it in DOSBox, which can on­ly record what a 90s sound­card could do. And of course, this in on­ly half the prob­lem, when I en­code the videos to the blog, I have to keep sam­ples un­der 1, be­cause most lossy codecs will clip the sam­ples to [-1, 1]. And check­ing some of the ex­ist­ing videos on the blog, I quick­ly re­al­ized that their loud­ness was all over the place. They should be nor­mal­ized. So my jour­ney with FFmpeg be­gan.

First I found the replaygain fil­ter, which cal­cu­lates a ReplayGain tag and adds it to the file, but it's more for tag­ging than do­ing the ac­tu­al nor­mal­iza­tion. And while any not se­ri­ous­ly bro­ken mu­sic play­er un­der­stands ReplayGain tags, the same can't be said of video play­ers. So I must nor­mal­ize the au­dio streams, which shouldn't be a big deal, as I have to re-en­code au­dio tracks any­way. Some search lat­er I land­ed up­on the loudnorm fil­ter, which un­like the replaygain fil­ter, cal­cu­lates loud­ness ac­cord­ing to EBU R 128 in­stead of sim­ply find­ing peak loud­ness, so it should be way bet­ter, and al­so sup­ports nor­mal­iz­ing in one fil­ter. Cool! ... That's what I thought. But of course, this is FFmpeg. It wouldn't be FFmpeg if some­thing sim­ply would work as ad­ver­tised, and it wouldn't be filled with fuck­ing land­mines every­where.

Loudness normalization in nutshell and terminology#

But be­fore I start ram­bling about FFmpeg, what is loud­ness nor­mal­iza­tion any­way, what do we want to achieve? EBU R 128 in­cludes a method to cal­cu­late loud­ness of an au­dio stream that at­tempts to take hu­man per­cep­tion in­to ac­count. The re­sult of this is usu­al­ly re­port­ed in LUFS, where LU stands for Loud­ness Unit (and has the same unit as dB), and FS for Full Scale. 0 LUFS cor­re­sponds to 0 dB on the ab­solute dig­i­tal scale (i.e. ±1).

An­oth­er com­mon el­e­ment is TP, short for True Peak, which is the peak of the sig­nal. Here true refers to that you have to take the re­al, re­con­struct­ed sig­nal, not just the sam­pled points. In prac­tice, this is cal­cu­lat­ed by up­sam­pling the sig­nal to a high fre­quen­cy (192 kHz in FFmpeg's case), and find­ing the max­i­mum of that sig­nal. Mea­sured in dBTP, and don't ask how it's dif­fer­ent from dB...

There's al­so LRA, short for Loud­ness RAnge, which de­scribes the loud­ness vari­a­tion of the in­put, but I didn't find any good in­fo on what is this ex­act­ly. (And I couldn't be assed to read the spec­i­fi­ca­tion). Prob­a­bly not that im­por­tant.

So what's the goal here? To make sure all au­dio tracks have rough­ly the same loud­ness. So if you read one post on my site, see a video, then switch to a dif­fer­ent post, the video there won't be sud­den­ly too qui­et or loud. And of course we have to do it in a way that we don't over­shoot the max­i­mum val­ue our dig­i­tal equip­ment can pro­duce (i.e. TP must be small­er than some lim­it).

There are two big com­pet­ing stan­dards to note here, EBU R 128 spec­i­fies a -23 LUFS loud­ness and a -1 dBTP peak (less if you ap­ply a lossy codec). This is sup­pos­ed­ly manda­to­ry if you want to run a TV chan­nel in the EU, but there's no such re­quire­ments for in­ter­net sites, and frankly, nor­mal­iz­ing to -23 LUFS re­sults in a pret­ty qui­et au­dio in to­day's loud­ness war's world. The sec­ond stan­dard here is AES's stream­ing rec­om­men­da­tions, which rec­om­mends a -16 LUFS loud­ness and -1 dBTP peak, which is a much more rea­son­able choice these days. Some sites even go to high­er loud­ness val­ues, but -16 LUFS should be OK for me. I'm not try­ing to deaf­en you.

Up­date 2024-10-21: that's what I thought. Then I re­al­ized I made a few mis­takes. First, don't use -ac 2 to make mono sound, ffm­peg will con­vert stereo to mono at the end of the fil­ter­graph, but you want to do it be­fore the loudnorm fil­ter (so­lu­tion: use aresample=ochl=mono to get mono au­dio). Al­so loudnorm has a dual_mono flag which prob­a­bly should be set (of course, ffm­peg be­ing ffm­peg every flag has a wrong de­fault val­ue), so it will have a cor­rect loud­ness if you play the mono sound on stereo speak­ers. Then while mak­ing videos for my up­com­ing NFS3 ar­ti­cle, af­ter I fig­ured out how to make NFS3 not out­put ridicu­lous­ly com­pressed au­dio, the dy­nam­ic range of -16 LUFS be­came too small (and al­so with the fixed loud­ness cal­cu­la­tion, mono au­dio files al­so start­ed hit­ting TP). So right now I'm reen­cod­ing videos to -18 LUFS, which is a less es­tab­lished stan­dard... but -23 LUFS is way too qui­et. End up­date.

Two pass loudnorm#

If you read the man page, you quick­ly fig­ure out that loudnorm has ac­tu­al­ly two modes, dy­nam­ic and lin­ear nor­mal­iza­tion modes. Dy­nam­ic means that it dy­nam­i­cal­ly changes the am­pli­fi­ca­tion, while lin­ear means it'll just ap­ply a con­stant am­pli­fi­ca­tion to the whole track. Now, the dy­nam­ic mode has a ten­den­cy to de­stroy songs. It on­ly needs a sin­gle pass, so if you have a live stream where you don't care too much about the au­dio qual­i­ty (like a pod­cast), it might be a good choice, oth­er­wise for­get about it. So lin­ear mode I need. Un­for­tu­nate­ly it's a two pass al­go­rithm, at the first pass it cal­cu­lates the loud­ness of the au­dio, and in the sec­ond pass it us­es the cal­cu­lat­ed loud­ness from the first pass to ad­just the vol­ume. Sounds sim­ple? WRONG.

You have the print_format op­tion which you should set to summary or json to re­ceive the mea­sure­ments some­how, then you have to feed them in­to the measured_* params on the sec­ond in­vo­ca­tion. Where does this fil­ter print this re­sult? To the stderr, in­ter­min­gled with all oth­er log mes­sages and ffmpeg sta­tus mes­sages. Where's the op­tion to redi­rect it to a file or any oth­er less re­tard­ed place? Nowhere. And how does this out­put look like, by the way? Some­thing like this:

[Parsed_loudnorm_0 @ 0x5601a72fd3c0]
{
        "input_i" : "-4.66",
        "input_tp" : "0.63",
        "input_lra" : "3.40",
        "input_thresh" : "-14.81",
        "output_i" : "-23.87",
        "output_tp" : "-13.35",
        "output_lra" : "3.40",
        "output_thresh" : "-33.93",
        "normalization_type" : "dynamic",
        "target_offset" : "-0.13"
}

(Here I refers to in­te­grat­ed LU, mo­men­tary LU in­te­grat­ed over the in­put). Yes, pret­ty print­ed JSON, so I can't process the out­put by line, and al­so the num­bers are stored as strings. That green col­or on the first line is not a syn­tax high­light, the line starts with the ANSI es­cape se­quence of "\e[48;5;0m\e[38;5;155m[" and ends with "\[0m". Yes, 256-col­ors! So I have to parse this frag­ile mess? I tried to mess around with ffmpeg's -loglevel pa­ra­me­ter, maybe I can set it to some­thing where it won't print oth­er un­nec­es­sary garbage, but it is print­ed with info sever­i­ty, so you need at least -loglevel info, but that al­ready prints a lot of crap. This can't be this com­pli­cat­ed, I went search­ing the net, and I found the au­thor's blog, where he has the au­dac­i­ty to say:

Of course, dual-pass nor­mal­iza­tion can be eas­i­ly script­ed.

Eas­i­ly script­ed? What is this mad­man talk­ing about? /me pro­ceeds to check the code. The gist of the pars­ing is this:

  stats = JSON.parse(stderr.read.lines[-12, 12].join)

(For peo­ple not speak­ing Ru­by, it just takes the last 12 lines of stderr out­put and pars­es it as JSON). What!? Are you fuck­ing stu­pid or what? Where is in the god­damn doc­u­men­ta­tion that the JSON will be ex­act­ly 12 lines long?! What if there is a new ver­sion and it adds a new filed? Your fuck­ing script will break. But wait, we don't have to go that far. Let's sup­pose some­thing is act­ing up, so you add -loglevel debug to the pa­ra­me­ters to de­bug... and BAM! It will print some crap af­ter the JSON, so your eas­i­ly script­ed script breaks. Yes, it's eas­i­ly script­able if you don't care about break­ing your script from the slight­est change in FFmpeg. If you want some­thing ro­bust, I guess you have to train some LLM mod­el on ran­dom out­puts of a high va­ri­ety of pro­grams try­ing to find a JSON cor­re­spond­ing to a loudnorm fil­ter, so it has some chance of work­ing even if ffmpeg out­put changes a bit. Yeah, spend­ing months to re­search AI LLM mod­els and train a mul­ti gi­ga­byte neur­al net­work just to ex­tract some­thing from the out­put of the pro­gram in a way that it has a chance of work­ing af­ter an up­date (but al­so has a chance of break­ing ran­dom­ly) is con­sid­ered eas­i­ly script­able. Why don't you do a fa­vor to the world, and im­pale your­self on a stick you fuck­ing nig­ger go­ril­la id­iot?!

...

Some rant­i­ng lat­er, I took the bul­let and wrote some code that I con­sid­ered less frag­ile than the brain­let au­thor's re­tard­ed [-12, 12]... just so it al­so broke when I ran ffmpeg with -loglevel debug. Any­way, af­ter I fi­nal­ly parsed the out­put, for­ward­ed the pa­ra­me­ters to the sec­ond in­vo­ca­tion as in the ex­am­ple, ran it on a test file, every­thing seemed to be in work­ing or­der. Oh boy, how wrong I was.

While search­ing, I stum­bled up­on this blog­post(?), which men­tioned need­ing a fil­ter to con­vert back au­dio back to 48 kHz, even in the sec­ond pass. But ac­cord­ing to the man page, up­sam­pling to 192 kHz on­ly hap­pens in dy­nam­ic mode. Was that a quirk of an old­er ver­sion? Lat­er find­ing a not-stack­over­flow post en­light­ened me. Let's re-read the doc­u­men­ta­tion of the linear pa­ra­me­ter (em­pha­sis mine):

Nor­mal­ize by lin­ear­ly scal­ing the source au­dio. measured_I, measured_LRA, measured_TP, and measured_thresh must all be spec­i­fied. Tar­get LRA shouldn't be low­er than source LRA and the change in in­te­grat­ed loud­ness shouldn't re­sult in a true peak which ex­ceeds the tar­get TP. If any of these con­di­tions aren't met, nor­mal­iza­tion mode will re­vert to dynamic. Op­tions are true or false. De­fault is true.

Let's re-re-read the bold sen­tence, and let that sink in. If any of those loose­ly de­fined con­di­tions don't hold, loudnorm will switch back to dy­nam­ic mode, de­stroy­ing your sound­track. But.... you'll at least get a warn­ing in this case, right?—you may ask. NO! NO FUCKING WAY! NOT EVEN WITH -loglevel debug! It just silent­ly de­stroys your au­dio, show­ing a mid­dle fin­ger be­hind the scenes, laugh­ing her ass off. This is FFmpeg, what did you ex­pect? To not be­ing com­plete­ly use­less for once? Ha­ha­ha! The on­ly way to de­tect when this case hap­pens is to have print_format in the sec­ond pass too and check if it has normalization_mode: linear. If not, it fucked up your file. An al­ter­nate way to de­tect is that if the fil­ter sud­den­ly up­sam­ples au­dio to 192 kHz in the sec­ond pass, it fucked up the au­dio. So the poor guy I linked above fucked up his au­dio files.

At this point I was al­ready look­ing at the source code (and let me in­ter­ject there for a mo­ment, this is why I like open-source soft­ware, at least I can view the code and see what's go­ing on be­cause doc­u­men­ta­tion is al­ways lack­ing. And be­fore you say that's what I get for us­ing free soft­ware, pro­pri­etary ex­pen­sive soft­ware aren't bet­ter ei­ther, but there I don't even have the source code, so fig­ur­ing out is­sues is even hard­er. Stay tuned for an Adobe Pre­miere rant lat­er), what are ex­act con­di­tions need­ed to trig­ger lin­ear mode? Here's the rel­e­vant part from af_loudnorm.c:

    if (s->linear) {
        double offset, offset_tp;
        offset    = s->target_i - s->measured_i;
        offset_tp = s->measured_tp + offset;

        if (s->measured_tp != 99 && s->measured_thresh != -70 && s->measured_lra != 0 && s->measured_i != 0) {
            if ((offset_tp <= s->target_tp) && (s->measured_lra <= s->target_lra)) {
                s->frame_type = LINEAR_MODE;
                s->offset = offset;
            }
        }
    }

Let's mull over the last re­quire­ment. What this fil­ter does in lin­ear mode is that it ad­justs the vol­ume of all sam­ples by offset = target_i - measured_i deci­bels. (Every­thing in the above code is in dB). So measured_tp + target_i - measured_i tells us the new TP in the out­put if we ap­ply that vol­ume ad­just­ment. And of course, that can't go over our goal.

Armed with this knowl­edge, how can we pre­vent the cat­a­stro­phe? One way is to keep the print_format on the sec­ond pass, parse the out­put, and if it has a dynamic mode, print an er­ror mes­sage along the lines of, "Sor­ry, I fucked up your out­put, do some­thing with the sound­track and try again fuck­face, ha­ha­ha!" The sec­ond op­tion is to im­ple­ment all these checks in the script call­ing ffmpeg, and do some­thing when they fail. Which brings me to the sec­ond top­ic, some­thing what I al­ready hint­ed at, what is a lin­ear loudnorm? The an­swer is: volume=(target_i - measured_i)dB.

Yes. You on­ly need the measured_i pa­ra­me­ter. All those oth­er measured_* pa­ra­me­ters are just there to fuck with you, to de­stroy your au­dio silent­ly. First I want­ed to im­pale the au­thor on a stick, but now I think he'd need the most painful way to die. AAAAARGGH!!!

loudnorm in practice#

So, now what to do? Since a lin­ear loudnorm is just a fan­cy way to call a volume fil­ter, I de­cid­ed to do that very com­pli­cat­ed math­e­mat­i­cal com­pu­ta­tion of sub­tract­ing two num­bers in my video gen­er­a­tor script, and just call FFmpeg's volume fil­ter di­rect­ly. At least that won't qui­et­ly change to dy­nam­ic nor­mal­iza­tion, even if a lat­er FFmpeg ver­sion changes the log­ic in loudnorm, or due to float­ing point round­ing er­rors, my script says that some num­bers are just valid, but the C code rounds a bit dif­fer­ent­ly and says it's in­valid.

Up­date 2024-10-21: ac­tu­al­ly, I no longer use the loudnorm fil­ter. There's an ebur128 fil­ter which is faster, has an out­put that at least seems like ma­chine read­able (but have fun fig­ur­ing it out from the doc­u­men­ta­tion), and will nev­er change your au­dio stream (but I just pipe it to /dev/null be­cause nor­mal­iza­tion is 2-pass). It al­so cal­cu­lates a few more in­ter­est­ing val­ues, like mo­men­tary I val­ues, but I'm not us­ing them. What you want is some­thing like this:

-af ebur128=metadata=1:framelog=quiet:peak=true:dualmono=true,ametadata=mode=print:file=$FILENAME -f null -

and re­place $FILENAME with the name of the file where you want the out­put. (Or you can use sim­ply peak=sample if you up­sam­ple the au­dio be­fore to 192 kHz... It al­so has pa­ra­me­ters like input which ac­cord­ing to the doc­u­men­ta­tion is "Read-on­ly ex­port­ed val­ue for mea­sured in­te­grat­ed loud­ness, in LUFS.", which sounds like ex­act­ly what I need, ex­cept I couldn't find any trace of how to read a pa­ra­me­ter of a fil­ter, nei­ther in the man page or on the in­ter­net...) framelog=quiet makes the fil­ter shut up (by de­fault it prints a line with mo­men­tary loud­ness and oth­ers every 0.1 s au­dio processed), metadata=1 makes the mea­sured val­ues avail­able as meta­da­ta, and the ametadata fil­ter will write it to a file. The file will have a buch of blocks like this:

frame:144  pts:2764800 pts_time:14.4
lavfi.r128.M=-19.189
lavfi.r128.S=-18.453
lavfi.r128.I=-18.656
lavfi.r128.LRA=3.540
lavfi.r128.LRA.low=-24.420
lavfi.r128.LRA.high=-20.880
lavfi.r128.true_peaks_ch0=0.674
lavfi.r128.true_peak=0.674

You need to find the last one (I don't know if you can do it from ffm­peg, I gave up mak­ing sense of it and I just parse the out­put from a Ru­by script), and get the val­ue of lavfi.r128.I (ex­pressed in dB) and lavfi.r128.true_peak (ex­pressed sam­ple val­ue). Yes, they have dif­fer­ent units. 20 * log10(peak) will get you deci­bels for the peak. It's ffm­peg, don't try to make sense of it.

End up­date.

What to do with the two oth­er pa­ra­me­ters loudnorm has? LRA is the eas­i­er, nei­ther R 128 nor the AES stan­dard has any rec­om­men­da­tion on max­i­mum LRA val­ues, so I just ig­nored it. (It's not like I can tar­get a spe­cif­ic LRA with­out com­plete­ly de­stroy­ing the dy­nam­ics of the in­put any­way.) I have no idea where the 7 LU de­fault val­ue and 11 LU in the au­thor's blog­post came from. But the peak, that will be more prob­lem­at­ic.

First I tried to use asoftclip, which I fig­ured would be OK if the sound sam­ples go just over the lim­it here and there. (By de­fault it'll clip to 0 dB in­stead of -1 dB, but that's the less­er prob­lem). I im­ple­ment­ed it, but the first few videos need­ed no clip­ping, so every­thing seemed al­right, un­til I got to Did­nap­per. And there every­thing fell apart. Did­nap­per is so bad­ly mixed, that the ran­dom sound ef­fects are WAY loud­er than the back­ground mu­sic. Ac­tu­al­ly say­ing that it is mixed is prob­a­bly an over­state­ment, just look at the wave­form of the side­mis­sion video:

Original waveformJXL PNG

Can you tell me when the back­ground mu­sic changes? Al­so those spikes, yikes! I end­ed up with 5-6 dBTP val­ues af­ter nor­mal­iz­ing. That will sound shit, no mat­ter what kind of soft clip­ping you use. Well, back to the draw­ing board.

So sec­ond at­tempt, try to nor­mal­ize to -16 LUFS, but de­crease the vol­ume if we're over -1 dBTP (and com­plain loud­ly). This solved the clip­ping prob­lem, but now Did­nap­per videos be­came re­al­ly qui­et. Maybe I should fall back to nor­mal­iz­ing at -23 LUFS? But that was re­al­ly qui­et, and even with -16 LUFS tar­get, Did­nap­per was the on­ly game that re­quired me to in­crease the vol­ume. Every oth­er game is so fuck­ing com­pressed at max­i­mum vol­ume, that it's ridicu­lous. Af­ter mess­ing around with my op­tions for a long time, in the end I de­cid­ed to com­press the two Did­nap­per record­ing with acompressor to get rid of the ridicu­lous spikes. Here's some sound pro­cess­ing hor­ror, be­fore-af­ter. Sor­ry about that.

Original waveformJXL PNGCompressed waveformJXL PNG

(Up­date 2024-10-21: us­ing -18 LUFS didn't re­move the need to com­press Did­nap­per's sound, I just don't need to com­press it that much.)

Random consequences#

Be­cause of the above, I had to re-en­code the au­dio tracks of all videos on the site. And if I have to touch every video, I might just make some oth­er mi­nor im­prove­ments too. Most videos (where it makes sense) now have a sub­tle fade-in/fade-out at the be­gin/end. Just to have a lit­tle bit more pleas­ing lis­ten­ing ex­pe­ri­ence.

A sec­ond change I made was to re-en­code every VP9 cod­ed video in a slight­ly low­er qual­i­ty, and at the same time of­fi­cial­ly call­ing it medi­um qual­i­ty. It didn't mat­ter too much with 2D games of 640x480 or low­er res­o­lu­tion where bare­ly any­thing changes be­tween frames, but these 60 FPS Full HD Need for Speed videos with a lot of hap­pen­ing be­tween frames, they start­ed to be­come a bit too big, es­pe­cial­ly that in the old set­up I had high qual­i­ty, slight­ly low­er qual­i­ty than high in a dif­fer­ent for­mat, and low qual­i­ty video files. That VP9 files just looked like a wast­ed du­pli­cate for the few users out there who can't play AV-1 movies. Now at least the qual­i­ty scale is more even. (It's not per­fect, for that I'd have to fine tune CRF val­ues for each video on each qual­i­ty lev­el, and I'm not gonna do that. VMAF is nice in the­o­ry, but it's not a re­li­able scor­ing be­tween dif­fer­ent videos, and breaks down hor­ri­bly when fed with non-HD non-re­al-life-like footage.) And to be hon­est, most video host­ing sites would prob­a­bly still call high qual­i­ty what I call medi­um qual­i­ty, and I didn't touch AV-1 videos, so high qual­i­ty is un­changed. (It would be nice if I could host loss­less videos, but they're huge).

A third com­plete­ly un­re­lat­ed change is that while look­ing at the up­scaled NFS4 in­tro, I won­dered—if there are im­age up­scale AIs, is there any AI that could do some­thing with the 22050 Hz com­pressed be­yond hell au­dio track? And the an­swer is yes, meet AudioSR! Well, the re­sult was not as im­pres­sive with the 8 kHz sam­ples on the web­site, so um, what­ev­er.