Wikipedia:Wikipedia Signpost/2021-11-29/Technology report

From Wikipedia, the free encyclopedia
Technology report

What does it take to upload a file?

The 2020 Picture of the Year was uploaded using the Commons UploadWizard.
Legoktm is a site reliability engineer for the Wikimedia Foundation. He wrote this in his volunteer capacity.

There is some irony in a piece of software being named MediaWiki while it struggles with media files, but it's not that surprising given that much of Wikipedia's focus and efforts go towards developing text. On the Main Page, you'll most likely have to scroll past multiple sections celebrating written text until you hit the day's featured photo.

Given the recent issues with uploading files, let's take a look into what it actually takes to upload a file to Wikimedia servers.

A brief history

In the very beginning, you needed to email a Bomis employee to place your photo on the server. The initial version of Magnus Manske's PHP-based wiki would accept any file from editors and administrators. Users had to select a checkbox which said, "I hereby affirm that this file is not copyrighted, or that I own the copyright for this file and donate it to Wikipedia." On the server, the only thing it checked was that the hard drive was not more than 96% full, and if so, it would disable all uploads. And it had a polite request for users, "You can upload as many files you like. Please don't try to crash our server, ha ha."

It was not until 2004 that tagging images with copyright statements became a convention. The Creative Commons licenses and templates were introduced, and Wikimedia Commons was first proposed by Eloquence in March of that same year.

In 2009, the Usability Initiative (see past Signpost coverage) brought grant funding for improving the multimedia experience, leading to UploadWizard on Commons and better metadata extraction, among other things. The English Wikipedia's own File Upload Wizard was developed in 2012, offering users a guided method to upload non-free files.

More recently there has been an increased focus on tools to facilitate mass GLAM contributions, such as bulk uploaders like GWToolset and Pattypan.

How it works today

A rough overview of how media storage is organized (from 2014).

Today all media files are stored in an OpenStack Swift cluster, a cloud storage system similar to Amazon S3, so it's unlikely the disks will actually fill up. These files are made available in both of Wikimedia's two primary data centers in Virginia and Texas for redundancy. Users will end up downloading these files from either the data centers, or one of Wikimedia's CDN servers in Amsterdam, San Francisco, and Singapore that is geographically closer to them.

Most users will never see the original file that was uploaded. Instead, a piece of software named Thumbor generates smaller versions of each image, so users viewing an article would only download the size of images they see (also stored in Swift). Videos are scaled in the same way, generating lower-quality versions of high quality uploads (just like on YouTube and other video-sharing sites).

There are three main ways MediaWiki accepts uploads in the backend, each with its pros and cons. First is a direct upload through Special:Upload, which is the original upload interface. This is the simplest form; the entire file is transferred in one go, and available for processing on the server immediately. However, because it is so direct, there's no opportunity for any nice user-facing progress bars, and any failure means the entire upload must be retried. It also only accepts files up to 100MB.

bigChunkedUpload.js is a gadget that lets users see the individual chunks being uploaded.

Then there's chunked uploading, in which a file is split into much smaller pieces, uploaded chunk-by-chunk, and then finally reassembled into one file (via the job queue) and processed. Tools like UploadWizard and bigChunkedUpload are able to provide progress bars for users, and individual chunks can be retried if there's a brief network interruption. It is more complex to implement, but more reliable and flexible, so most upload tools and bots use it. In theory, users can use chunked uploads to upload files up to the maximum file size of 4GB, but there may be some practical issues like server-side timeouts.

Finally, some editors and administrators can upload files by specifying its URL. MediaWiki will download the file from the remote website, and then process it as if the user uploaded it. This is convenient for users, as they don't need to download the file individually before re-uploading it. However, this comes with limitations, as the server needs to be able to download and finish the entire upload in 180 seconds, and downloading from some sources (especially the Internet Archive) might be too slow for that.

Once the file is actually on the server, MediaWiki does some security checks on each file. It's easy to hide arbitrary files (think malware or copyrighted stuff) inside JPEG images, which we want to reject. This was exploited by some users on Wikipedia Zero networks to share pirated films without having it count against their data plans. SVG files can be written in a way that allows triggering cross-site scripting (XSS) attacks—MediaWiki rejects those too. There are also some validity checks, like making sure a file named "Foo.png" is actually a PNG file.

MediaWiki displays the metadata it knows about at the bottom of each file description page.

At this point, MediaWiki will extract some metadata from the file, like its size, geolocation, and other exif fields. This data is stored separately from the file itself for quicker retrieval. PDF and DjVu files that contain text will have that extracted and indexed for search.

Finally, the original file is uploaded to Swift in both of the primary data centers (Virginia and Texas). Even though this step takes place between Wikimedia servers, it is encrypted using HTTPS in case an attacker is able to tap into cross-data center communications. MediaWiki will also instruct Thumbor to pre-generate thumbnails for common sizes, so users see no delay when trying to use them in an article. If a new version of an existing file was uploaded, MediaWiki would also rename the previous version in Swift, and delete all the old thumbnails.

Once a file has been uploaded, there's no way to modify it in MediaWiki itself. Performing basic functions like rotating or cropping needs to be done by external tools like CropTool.

A complex process

The process for uploading files has grown more and more complex as requirements and scale have increased. The current system is rather optimized for delivering users small images quickly, and less so for handling the upload and processing of very large files. For an encyclopedia that is still mostly focused on text and images, that may be fine. But as people ask for and expect more interactive and engaging elements, that may need to change.

There has been no dedicated Wikimedia Foundation development team focusing on backend media development in the past few years (it's debatable whether it ever had one), with critical components like Thumbor being entirely unmaintained at Wikimedia and outdated. For now, many of the gaps are being filled with various tools and gadgets by those who are interested.

And if you desire some nostalgia, you can still send an email file a Phabricator task and sysadmins will upload the files for you.