Faster tar

I have a new laptop. The new one is a Dell Latitude 5521, whereas the old one was a Dell Latitude 5590.

As both the old and the new laptops are owned by the people who pay my paycheck, I'm supposed to copy all my data off the old laptop and then return it to the IT department.

A simple way of doing this (and what I'd usually use) is to just rsync the home directory (and other relevant locations) to the new machine. However, for various reasons I didn't want to do that this time around; for one, my home directory on the old laptop is a bit of a mess, and a new laptop is an ideal moment in time to clean that up. If I were to just rsync over the new home directory, then, well.

So instead, I'm creating a tar ball. The first attempt was quite slow:

tar cvpzf wouter@new-laptop:old-laptop.tar.gz /home /var /etc

The problem here is that the default compression algorithm, gzip, is quite slow, especially if you use the default non-parallel implementation.

So we tried something else:

tar cvpf wouter@new-laptop:old-laptop.tar.gz -Ipigz /home /var /etc

Better, but not quite great yet. The old laptop now has bursts of maxing out CPU, but it doesn't even come close to maxing out the gigabit network cable between the two.

Tar can compress to the LZ4 algorithm. That algorithm doesn't compress very well, but it's the best algorithm if "speed" is the most important consideration. So I could do that:

tar cvpf wouter@new-laptop:old-laptop.tar.gz -Ilz4 /home /var /etc

The trouble with that, however, is that the tarball will then be quite big.

So why not use the CPU power of the new laptop?

tar cvpf - /home /var /etc | ssh new-laptop "pigz > old-laptop.tar.gz"

Yeah, that's much faster. Except, now the network speed becomes the limiting factor. We can do better.

tar cvpf - -Ilz4 /home /var /etc | ssh new-laptop "lz4 -d | pigz > old-laptop.tar.gz"

This uses about 70% of the link speed, just over one core on the old laptop, and 60% of CPU time on the new laptop.

After also adding a bit of --exclude="*cache*", to avoid files we don't care about, things go quite quickly now: somewhere between 200 and 250G (uncompressed) was transferred into a 74G file, in 20 minutes. My first attempt hadn't even done 10G after an hour!

Posted
Different types of Backups

In my previous post, I explained how I recently set up backups for my home server to be synced using Amazon's services. I received a (correct) comment on that by Iustin Pop which pointed out that while it is reasonably cheap to upload data into Amazon's offering, the reverse -- extracting data -- is not as cheap.

He is right, in that extracting data from S3 Glacier Deep Archive costs over an order of magnitude more than it costs to store it there on a monthly basis -- in my case, I expect to have to pay somewhere in the vicinity of 300-400 USD for a full restore. However, I do not consider this to be a major problem, as these backups are only to fulfill the rarer of the two types of backups cases.

There are two reasons why you should have backups.

The first is the most common one: "oops, I shouldn't have deleted that file". This happens reasonably often; people will occasionally delete or edit a file that they did not mean to, and then they will want to recover their data. At my first job, a significant part of my job was to handle recovery requests from users who had accidentally deleted a file that they still needed.

Ideally, backups to handle this type of situation are easily accessible to end users, and are performed reasonably frequently. A system that automatically creates and deletes filesystem snapshots (such as the zfsnap script for ZFS snapshots, which I use on my server) works well. The crucial bit here is to ensure that it is easier to copy an older version of a file than it is to start again from scratch -- if a user must file a support request that may or may not be answered within a day or so, it is likely they will not do so for a file they were working on for only half a day, which means they lose half a day of work in such a case. If, on the other hand, they can just go into the snapshots directory themselves and it takes them all of two minutes to copy their file, then they will also do that for files they only created half an hour ago, so they don't even lose half an hour of work and can get right back to it. This means that backup strategies to mitigate the "oops I lost a file" case ideally do not involve off-site file storage, and instead are performed online.

The second case is the much rarer one, but (when required) has the much bigger impact: "oops the building burned down". Variants of this can involve things like lightning strikes, thieves, earth quakes, and the like; in all cases, the point is that you want to be able to recover all your files, even if every piece of equipment you own is no longer usable.

That being the case, you will first need to replace that equipment, which is not going to be cheap, and it is also not going to be an overnight thing. In order to still be useful after you lost all your equipment, they must also be stored off-site, and should preferably be offline backups, too. Since replacing your equipment is going to cost you time and money, it's fine if restoring the backups is going to take a while -- you can't really restore from backup any time soon anyway. And since you will lose a number of days of content that you can't create when you can only fall back on your off-site backups, it's fine if you also lose a few days of content that you will have to re-create.

All in all, the two types of backups have opposing requirements: "oops I lost a file" backups should be performed often and should be easily available; "oops I lost my building" backups should not be easily available, and are ideally done less often, so you don't pay a high amount of money for storage of your off-sites.

In my opinion, if you have good "lost my file" backups, then it's also fine if the recovery of your backups are a bit more expensive. You don't expect to have to ever pay for these; you may end up with a situation where you don't have a choice, and then you'll be happy that the choice is there, but as long as you can reasonably pay for the worst case scenario of a full restore, it's not a case you should be worried about much.

As such, and given that a full restore from Amazon Storage Gateway is going to be somewhere between 300 and 400 USD for my case -- a price I can afford, although it's not something I want to pay every day -- I don't think it's a major issue that extracting data is significantly more expensive than uploading data.

But of course, this is something everyone should consider for themselves...

Posted
Backing up my home server with Bacula and Amazon Storage Gateway

I have a home server.

Initially conceived and sized so I could digitize my (rather sizeable) DVD collection, I started using it for other things; I added a few play VMs on it, started using it as a destination for the deja-dup-based backups of my laptop and the time machine-based ones of the various macs in the house, and used it as the primary location of all the photos I've taken with my cameras over the years (currently taking up somewhere around 500G) as well as those that were taking at our wedding (another 100G). To add to that, I've copied the data that my wife had on various older laptops and external hard drives onto this home server as well, so that we don't lose the data should something happen to one or more of these bits of older hardware.

Needless to say, the server was running full, so a few months ago I replaced the 4x2T hard drives that I originally put in the server with 4x6T ones, and there was much rejoicing.

But then I started considering what I was doing. Originally, the intent was for the server to contain DVD rips of my collection; if I were to lose the server, I could always re-rip the collection and recover that way (unless something happened that caused me to lose both at the same time, of course, but I consider that sufficiently unlikely that I don't want to worry about it). Much of the new data on the server, however, cannot be recovered like that; if the server dies, I lose my photos forever, with no way of recovering them. Obviously that can't be okay.

So I started looking at options to create backups of my data, preferably in ways that make it easily doable for me to automate the backups -- because backups that have to be initiated are backups that will be forgotten, and backups that are forgotten are backups that don't exist. So let's not try that.

When I was still self-employed in Belgium and running a consultancy business, I sold a number of lower-end tape libraries for which I then configured bacula, and I preferred a solution that would be similar to that without costing an arm and a leg. I did have a look at a few second-hand tape libraries, but even second hand these are still way outside what I can budget for this kind of thing, so that was out too.

After looking at a few solutions that seemed very hackish and would require quite a bit of handholding (which I don't think is a good idea), I remembered that a few years ago, I had a look at the Amazon Storage Gateway for a customer. This gateway provides a virtual tape library with 10 drives and 3200 slots (half of which are import/export slots) over iSCSI. The idea is that you install the VM on a local machine, you connect it to your Amazon account, you connect your backup software to it over iSCSI, and then it syncs the data that you write to Amazon S3, with the ability to archive data to S3 Glacier or S3 Glacier Deep Archive. I didn't end up using it at the time because it required a VMWare virtualization infrastructure (which I'm not interested in), but I found out that these days, they also provide VM images for Linux KVM-based virtual machines (amongst others), so that changes things significantly.

After making a few calculations, I figured out that for the amount of data that I would need to back up, I would require a monthly budget of somewhere between 10 and 20 USD if the bulk of the data would be on S3 Glacier Deep Archive. This is well within my means, so I gave it a try.

The VM's technical requirements state that you need to assign four vCPUs and 16GiB of RAM, which just so happens to be the exact amount of RAM and CPU that my physical home server has. Obviously we can't do that. I tried getting away with 4GiB and 2 vCPUs, but that didn't work; the backup failed out after about 500G out of 2T had been written, due to the VM running out of resources. On the VM's console I found complaints that it required more memory, and I saw it mention something in the vicinity of 7GiB instead, so I decided to try again, this time with 8GiB of RAM rather than 4. This worked, and the backup was successful.

As far as bacula is concerned, the tape library is just a (very big...) normal tape library, and I got data throughput of about 30M/s while the VM's upload buffer hadn't run full yet, with things slowing down to pretty much my Internet line speed when it had. With those speeds, Bacula finished the backup successfully in "1 day 6 hours 43 mins 45 secs", although the storage gateway was still uploading things to S3 Glacier for a few hours after that.

All in all, this seems like a viable backup solution for large(r) amounts of data, although I haven't yet tried to perform a restore.

Posted
GR procedures and timelines

A vote has been proposed in Debian to change the formal procedure in Debian by which General Resolutions (our name for "votes") are proposed. The original proposal is based on a text by Russ Allberry, which changes a number of rules to be less ambiguous and, frankly, less weird.

One thing Russ' proposal does, however, which I am absolutely not in agreement with, is to add a absolutly hard time limit after three weeks. That is, in the proposed procedure, the discussion time will be two weeks initially (unless the Debian Project Leader chooses to reduce it, which they can do by up to one week), and it will be extended if more options are added to the ballot; but after three weeks, no matter where the discussion stands, the discussion period ends and Russ' proposed procedure forces us to go to a vote, unless all proposers of ballot options agree to withdraw their option.

I believe this is a big mistake. I think any procedure we come up with should allow for the possibility that we may end up with a situation where everyone agrees that extending the discussion time a short time is a good idea, without necessarily resetting the whole discussion time to another two weeks (modulo a decision by the DPL).

At the same time, any procedure we come up with should try to avoid the possibility of process abuse by people who would rather delay a vote ad infinitum than to see it voted upon. A hard time limit certainly does that; but I believe it causes more problems than it solves.

I think insted that it is necessary for any procedure to allow for the discussion time to be extended as long as a strong enough consensus exists that this would be beneficial.

As such, I have proposed an amendment to Russ' proposal (a full version of my proposed constitution can be seen on salsa) that hopefully solves these issues in a novel way: it allows anyone to request an extension to the discussion time, which then needs to be sponsored according to the same rules as a new ballot option. If the time extension is successfully created, those who supported the extension can then also no longer propose any new ones. Additionally, after 4 weeks, the proposed procedure allows anyone to object, so that 4 weeks is probably the practical limit -- although the possibility exists if enough support exists to extend the discussion time (or not enough to end it). The full rules involve slightly more than that (I don't like to put too much formal language in a blog post), but they're not too complicated, I think.

That proposal has received a number of seconds, but after a week it hasn't yet reached the constitutional requirement for the option to be on the ballot.

So, I guess this is a public request for more support to my proposal. If you're a Debian Developer and you agree with me that my proposed procedure is better than the alternative, please step forward and let yourself be heard.

Thanks!

Posted
SReview::Video is now Media::Convert

SReview, the video review and transcode tool that I originally wrote for FOSDEM 2017 but which has since been used for debconfs and minidebconfs as well, has long had a sizeable component for inspecting media files with ffprobe, and generating ffmpeg command lines to convert media files from one format to another.

This component, SReview::Video (plus a number of supporting modules), is really not tied very much to the SReview webinterface or the transcoding backend. That is, the webinterface and the transcoding backend obviously use the ffmpeg handling library, but they don't provide any services that SReview::Video could not live without. It did use the configuration API that I wrote for SReview, but disentangling that turned out to be very easy.

As I think SReview::Video is actually an easy to use, flexible API, I decided to refactor it into Media::Convert, and have just uploaded the latter to CPAN itself.

The intent is to refactor the SReview webinterface and transcoding backend so that they will also use Media::Convert instead of SReview::Video in the near future -- otherwise I would end up maintaining everything twice, and then what's the point. This hasn't happened yet, but it will soon (this shouldn't be too difficult after all).

Unfortunately Media::Convert doesn't currently install cleanly from CPAN, since I made it depend on Alien::ffmpeg which currently doesn't work (I'm in communication with the Alien::ffmpeg maintainer in order to get that resolved), so if you want to try it out you'll have to do a few steps manually.

I'll upload it to Debian soon, too.

Posted
Planet Grep, now with https

It's been long overdue, but Planet Grep now does the https dance (i.e., if you try to use an unencrypted connection, it will redirect you to https). Thank you letsencrypt!

I hadn't previously done this because some blogs that we carry might link to http-only images; but really, that shouldn't matter, and we can make Planet Grep itself be a https site even if some of the content is http-only.

Enjoy!

Posted
SReview and pandemics

The pandemic was a bit of a mess for most FLOSS conferences. The two conferences that I help organize -- FOSDEM and DebConf -- are no exception. In both conferences, I do essentially the same work: as a member of both video teams, I manage the postprocessing of the video recordings of all the talks that happened at the respective conference(s). I do this by way of SReview, the online video review and transcode system that I wrote, which essentially crowdsources the manual work that needs to be done, and automates as much as possible of the workflow.

The original version of SReview consisted of a database, a (very basic) Mojolicious-based webinterface, and a bunch of perl scripts which would build and execute ffmpeg command lines using string interpolation. As a quick hack that I needed to get working while writing it in my spare time in half a year, that approach was workable and resulted in successful postprocessing after FOSDEM 2017, and a significant improvement in time from the previous years. However, I did not end development with that, and since then I've replaced the string interpolation by an object oriented API for generating ffmpeg command lines, as well as modularized the webinterface. Additionally, I've had help reworking the user interface into a system that is somewhat easier to use than my original interface, and have slowly but surely added more features to the system so as to make it more flexible, as well as support more types of environments for the system to run in.

One of the major issues that still remains with SReview is that the administrator's interface is pretty terrible. I had been planning on revamping that for 2020, but then massive amounts of people got sick, travel was banned, and both the conferences that I work on were converted to an online-only conference. These have some very specific requirements; e.g., both conferences allowed people to upload a prerecorded version of their talk, rather than doing the talk live; since preprocessing a video is, technically, very similar to postprocessing it, I adapted SReview to allow people to upload a video file that it would then validate (in terms of length, codec, and apparent resolution). This seems like easy to do, but I decided to implement this functionality so that it would also allow future use for in-person conferences, where occasionally a speaker requests that modifications would be made to the video file in a way that SReview is unable to do. This made it marginally more involved, but at least will mean that a feature which I had planned to implement some years down the line is now already implemented. The new feature works quite well, and I'm happy I've implemented it in the way that I have.

In order for the "upload" processing and the "post-event" processing to not be confused, however, I decided to import the conference schedules twice: once as the conference itself, and once as a shadow version of that conference for the prerecordings. That way, I could track the progress through the system of the prerecording completely separately from the progress of the postprocessing of the video (which adds opening/closing credits, and transcodes to multiple variants of the same video). Schedule parsing was something that had not been implemented in a generic way yet, however; since that made doubling the schedule in that way rather complex, I decided to bite the bullet and (finally) implement schedule parsing in a generic way. Currently, schedule parsers exist for two formats (Pentabarf XML and the Wafer variant of that same format which is almost, but not quite, entirely the same). The API for that is quite flexible, and I'm happy with the way things have been implemented there. I've also implemented a set of "virtual" parsers, which allow mangling the schedule in various ways (by either filtering out talks that we don't want, or by generating the shadow version of the schedule that I talked about earlier).

While the SReview settings have reasonable defaults, occasionally the output of SReview is not entirely acceptable, due to more complicated matters that then result in encoding artifacts. As a result, the DebConf video team has been doing a final review step, completely outside of SReview, to ensure that such encoding artifacts don't exist. That seemed suboptimal, so recently I've been working on integrating that into SReview as well. First tests have been run, and seem to be acceptable, but there's still a few loose ends to be finalized.

As part of this, I've also reworked the way comments could be entered into the system. Previously the presence of a comment would signal that the video has some problems that an administrator needed to look at. Unfortunately, that was causing some confusion, with some people even thinking it's a good place to enter a "thank you for your work" style of comment... which it obviously isn't. Turning it into a "comment log" system instead fixes that, and also allows for better two-way communication between administrators and reviewers. Hopefully that'll improve things in that area as well.

Finally, the audio normalization in SReview -- for which I've long used bs1770gain -- is having problems. First of all, bs1770gain will sometimes alter the timing of the video or audio file that it's passed, which is very problematic if I want to process it further. There is an ffmpeg loudnorm filter which implements the same algorithm, so that should make things easier to use. Secondly, the author of bs1770gain is a strange character that I'd rather not be involved with. Before I knew about the loudnorm filter I didn't really have a choice, but now I can just rip bs1770gain out and replace it by the loudnorm filter. That will fix various other bugs in SReview, too, because SReview relies on behaviour that isn't actually there (but which I didn't know at the time when I wrote it).

All in all, the past year-and-a-bit has seen a lot of development for SReview, with multiple features being added and a number of long-standing problems being fixed.

Now if only the pandemic would subside, allowing the whole "let's do everything online only" wave to cool down a bit, so that I can finally make time to implement the admin interface...

Posted
Freenode

Bye, Freenode

I have been on Freenode for about 20 years, since my earliest involvement with Debian in about 2001. When Debian moved to OFTC for its IRC presence way back in 2006, I hung around on Freenode somewhat since FOSDEM's IRC channels were still there, as well as for a number of other channels that I was on at the time (not anymore though).

This is now over and done with. What's happening with Freenode is a shitstorm -- one that could easily have been fixed if one particular person were to step down a few days ago, but by now is a lost cause.

At any rate, I'm now lurking, mostly for FOSDEM channels, on libera.chat, under my usual nick, as well as on OFTC.

Posted
Twenty years of Debian

Ten years ago, I reflected on the fact that -- by that time -- I had been in Debian for just over ten years. This year, in early February, I've passed the twenty year milestone. As I'm turning 43 this year, I will have been in Debian for half my life in about three years. Scary thought, that.

In the past ten years, not much has changed, and yet at the same time, much has. I became involved in the Debian video team; I stepped down from the m68k port; and my organizing of the Debian devroom at FOSDEM resulted in me eventually joining the FOSDEM orga team, where I eventually ended up also doing video. As part of my video work, I wrote SReview, for which in these COVID-19 times in much of my spare time I have had to write new code and/or fix bugs.

I was a candidate for the position of DPL one more time, without being elected. I was also a candidate for the technical committee a few times, also without success.

I also added a few packages to the list of packages that I maintain for Debian; most obviously this includes SReview, but there's also things like extrepo and policy-rcd-declarative, both fairly recent packages that I hope will improve Debian as a whole in the longer term.

On a more personal level, at one debconf I met a wonderful girl that I now have just celebrated my first wedding anniversary with. Before that could happen, I have had to move to South Africa two years ago. Moving is an involved process at any one time; moving to a different continent altogether is even more so. As it would have been complicated and involved to remain a business owner of a Belgian business while living 9500km away from the country, I sold my shares to my (now ex) business partner; it turned the page of a 15-year chapter of my life, something I could not do without feelings one way or the other.

The things I do in Debian has changed over the past twenty years. I've been the maintainer of the second-highest number of packages in the project when I maintained the Linux Gazette packages; I've been an m68k porter; I've been an AM, and briefly even an NM frontdesk member; I've been a DPL candidate three times, and a TC candidate twice.

At the turn of my first decade of being a Debian Developer, I noted that people started to recognize my name, and that I started to be one of the Debian Developers who had been with the project longer than most. This has, obviously, not changed. New in the "I'm getting old" department is the fact that during the last Debconf, I noticed for the first time that there was a speaker who had been alive for less long than I had been a Debian Developer. I'm assuming these types of things will continue happening in the next decade, and that the future will bring more of these kinds of changes that will make me feel older as I and the project mature more.

I'm looking forward to it. Here's to you, Debian; may you continue to influence my life, in good ways and in bad (but hopefully mostly good), as well as continue to inspire me to improve the world, as you have over the past twenty years!

Posted
Dear Google

... Why do you have to be so effing difficult about a YouTube API project that is used for a single event per year?

FOSDEM creates 600+ videos on a yearly basis. There is no way I am going to manually upload 600+ videos through your webinterface, so we use the API you provide, using a script written by Stefano Rivera. This script grabs video filenames and metadata from a YAML file, and then uses your APIs to upload said videos with said metadata. It works quite well. I run it from cron, and it uploads files until the quota is exhausted, then waits until the next time the cron job runs. It runs so well, that the first time we used it, we could upload 50+ videos on a daily basis, and so the uploads were done as soon as all the videos were created, which was a few months after the event. Cool!

The second time we used the script, it did not work at all. We asked one of our key note speakers who happened to be some hotshot at your company, to help us out. He contacted the YouTube people, and whatever had been broken was quickly fixed, so yay, uploads worked again.

I found out later that this is actually a normal thing if you don't use your API quota for 90 days or more. Because it's happened to us every bloody year.

For the 2020 event, rather than going through back channels (which happened to be unavailable this edition), I tried to use your normal ways of unblocking the API project. This involves creating a screencast of a bloody command line script and describing various things that don't apply to FOSDEM and ghaah shoot me now so meh, I created a new API project instead, and had the uploads go through that. Doing so gives me a limited quota that only allows about 5 or 6 videos per day, but that's fine, it gives people subscribed to our channel the time to actually watch all the videos while they're being uploaded, rather than being presented with a boatload of videos that they can never watch in a day. Also it doesn't overload subscribers, so yay.

About three months ago, I started uploading videos. Since then, every day, the "fosdemtalks" channel on YouTube has published five or six videos.

Given that, imagine my surprise when I found this in my mailbox this morning...

Google lies, claiming that my YouTube API project isn't being used for 90 days and informing me that it will be disabled

This is an outright lie, Google.

The project has been created 90 days ago, yes, that's correct. It has been used every day since then to upload videos.

I guess that means I'll have to deal with your broken automatic content filters to try and get stuff unblocked...

... or I could just give up and not do this anymore. After all, all the FOSDEM content is available on our public video host, too.

Posted