Multi-pass transcoding to WebM with normalisation

Transcoding video from one format to another seems to be a bit of a black art. There are many tools that allow doing this kind of stuff, but one issue that most seem to have is that they're not very well documented.

I ran against this a few years ago, when I was first doing video work for FOSDEM and did not yet have proper tools to do the review and transcoding workflow.

At the time, I just used mplayer to look at the .dv files, and wrote a text file with a simple structure to remember exactly what to do with it. That file was then fed to a perl script which wrote out a shell script that would use the avconv command to combine and extract the "interesting" data from the source DV files into a single DV file per talk, and which would then call a shell script which used gst-launch and sox to do a multi-pass transcode of those intermediate DV files into a WebM file.

While all that worked properly, it was a rather ugly hack, never cleaned up, and therefore I never really documented it properly either. Recently, however, someone asked me to do so anyway, so here goes. Before you want to complain about how this ate the videos of your firstborn child, however, note the above.

The perl script spent a somewhat large amount of code reading out the text file and parsing it into an array of hashes. I'm not going to reproduce that, since the actual format of the file isn't all that important anyway. However, here's the interesting bits:

foreach my $pfile(keys %parts) {
        my @files = @{$parts{$pfile}};

        say "#" x (length($pfile) + 4);
        say "# " . $pfile . " #";
        say "#" x (length($pfile) + 4);
        foreach my $file(@files) {
                my $start = "";
                my $stop = "";

                if(defined($file->{start})) {
                        $start = "-ss " . $file->{start};
                }
                if(defined($file->{stop})) {
                        $stop = "-t " . $file->{stop};
                }
                if(defined($file->{start}) && defined($file->{stop})) {
                        my @itime = split /:/, $file->{start};
                        my @otime = split /:/, $file->{stop};
                        $otime[0]-=$itime[0];
                        $otime[1]-=$itime[1];
                        if($otime[1]<0) {
                                $otime[0]-=1;
                                $otime[1]+=60;
                        }
                        $otime[2]-=$itime[2];
                        if($otime[2]<0) {
                                $otime[1]-=1;
                                $otime[2]+=60;
                        }
                        $stop = "-t " . $otime[0] . ":" . $otime[1] .  ":" . $otime[2];
                }
                if(defined($file->{start}) || defined($file->{stop})) {
                        say "ln " . $file->{name} . ".dv part-pre.dv";
                        say "avconv -i part-pre.dv $start $stop -y -acodec copy -vcodec copy part.dv";
                        say "rm -f part-pre.dv";
                } else {
                        say "ln " . $file->{name} . ".dv part.dv";
                }
                say "cat part.dv >> /tmp/" . $pfile . ".dv";
                say "rm -f part.dv";
        }
        say "dv2webm /tmp/" . $pfile . ".dv";
        say "rm -f /tmp/" . $pfile . ".dv";
        say "scp /tmp/" . $pfile . ".webm video.fosdem.org:$uploadpath || true";
        say "mv /tmp/" . $pfile . ".webm .";
}

That script uses avconv to read one or more .dv files and transcode them into a single .dv file with all the start- or end-junk removed. It uses /tmp rather than the working directory, since the working directory was somewhere on the network, and if you're going to write several gigabytes of data to an intermediate file, it's usually a good idea to write them to a local filesystem rather than to a networked one.

Pretty boring.

It finally calls dv2webm on the resulting .dv file. That script looks like this:

#!/bin/bash

set -e

newfile=$(basename $1 .dv).webm
wavfile=$(basename $1 .dv).wav
wavfile=$(readlink -f $wavfile)
normalfile=$(basename $1 .dv)-normal.wav
normalfile=$(readlink -f $normalfile)
oldfile=$(readlink -f $1)

echo -e "\033]0;Pass 1: $newfile\007"
gst-launch-0.10 webmmux name=mux ! fakesink \
  uridecodebin uri=file://$oldfile name=demux \
  demux. ! ffmpegcolorspace ! deinterlace ! vp8enc multipass-cache-file=/tmp/vp8-multipass multipass-mode=1 threads=2 ! queue ! mux.video_0 \
  demux. ! progressreport ! audioconvert ! audiorate ! tee name=t ! queue ! vorbisenc ! queue ! mux.audio_0 \
  t. ! queue ! wavenc ! filesink location=$wavfile
echo -e "\033]0;Audio normalize: $newfile\007"
sox --norm $wavfile $normalfile
echo -e "\033]0;Pass 2: $newfile\007"
gst-launch-0.10 webmmux name=mux ! filesink location=$newfile \
  uridecodebin uri=file://$oldfile name=video \
  uridecodebin uri=file://$normalfile name=audio \
  video. ! ffmpegcolorspace ! deinterlace ! vp8enc multipass-cache-file=/tmp/vp8-multipass multipass-mode=2 threads=2 ! queue ! mux.video_0 \
  audio. ! progressreport ! audioconvert ! audiorate ! vorbisenc ! queue ! mux.audio_0

rm $wavfile $normalfile

... and is a bit more involved.

Multi-pass encoding of video means that we ask the encoder to first encode the file but store some statistics into a temporary file (/tmp/vp8-multipass, in our script), which the second pass can then reuse to optimize the transcoding. Since DV uses different ways of encoding things than does VP8, we also need to do a color space conversion (ffmpegcolorspace) and deinterlacing (deinterlace), but beyond that the video line in the first gstreamer pipeline isn't very complicated.

Since we're going over the file anyway and we need the audio data for sox, we add a tee plugin at an appropriate place in the audio line in the first gstreamer pipeline, so that we can later on pick up that same audio data an write it to a wav file containing linear PCM data. Beyond the tee, we go on and do a vorbis encoding, as is needed for the WebM format. This is not actually required for a first pass, but ah well. There's some more conversion plugins in the pipeline (specifically, audioconvert and audiorate), but those are not very important.

We next run sox --norm on the .wav file, which does a fully automated audio normalisation on the input. Audio normalisation is the process of adjusting volume levels so that the audio is not too loud, but also not too quiet. Sox has pretty good support for this; the default settings of its --norm parameter make it adjust the volume levels so that the highest peak will just about reach the highest value that the output format can express. As such, you have no clipping anywhere in the file, but also have an audio level that is actually useful.

Next, we run a second-pass encoding on the input file. This second pass uses the statistics gathered in the first pass to decide where to put its I- and P-frames so that they are placed at the most optimal position. In addition, rather than reading the audio from the original file, we now read the audio from the .wav file containing the normalized audio which we produced with sox, ensuring the audio can be understood.

Finally, we remove the intermediate audio files we created; and the shell script which was generated by perl also contained an rm command for the intermediate .dv file.

Some of this is pretty horrid, and I never managed to clean it up enough so it would be pretty (and now is not really the time). However, it Just Works(tm), and I am happy to report that it continues to work with gstreamer 1.0, provided you replace the ffmpegcolorspace by an equally simple videoconvert, which performs what ffmpegcolorspace used to perform in gstreamer 0.10.