Timothy Lethbridge's ideas on Technology and Politics: Give downloadable files more meaningful names

More and more websites give the option to download a files containing such things as bills, software installers (or source code), receipts, posters, brochures, e-books and reports. Once downloaded, people need to file these for extended periods.

Unfortunately it is common for the downloaded files to have meaningless names such as 'Document', 'Results', 'File', 'Report', 'Data', some arbitrary internal identifier, or other names that are not as helpful as they should be.

I propose the following general guidelines for downloadable files. Any downloaded file name should include:

Some way to identify the file's origin, such as the company name or website
Some succinct way to identify the type of content, such as 'bill', 'receipt', the software product etc.
The date. Files stored on computers have a date created by the operating system. But this is not part of the filename and is reset when you copy or edit a file. When the date is intrinsic to the data in a file, such as the date of issue of a report or receipt, the month of a bill, the release date of software or a press release, etc. then this information should be part of the filename on a permanent basis. Very few downloaded filenames have the date. The date should be in yyyy-mm-dd format to facilitate sorting by date. The date should be the date the date being sent was created, not the date of download. If it is a monthly report, always issued at the end of the month, then the day can be omitted.
If appropriate, a version number. Many software downloads have this, but too many do not.
If necessary, some way to ensure downloads of the same type but with different content are distinct. For example, if you are an avid sports fan and regularly download spreadsheets with sports statistics, you might want the time to which the data is valid, not just the date, as part of the filename.

For example:

SurveyMonkey's downloads appear as 'Results.zip'. When you unzip the file, the internal contents are a little nicer, for example 'SurveySummary_' followed by the date at which data was last updated. However, there is no way to distinguish among the various different surveys I am running, nor the filters or collectors I have applied when generating the data.
A recent download of Silverlight just said 'Silverlight.dmg'. There was no version number or release date.
The paystubs I receive from my employer say 'paystub' followed by a serial number, but do not identify the date of the pay in the filename.
A report I downloaded from the CRTC yesterday (subject of a future blog post probably) at URL http://www.crtc.gc.ca/eng/archive/2011/2011-703.pdf resulted in a filename that just has the year and a serial number, and no identification of its source (CRTC) or content.
When I download my Rogers bill from EPost it says 'RogersBill', but does not include the month of the bill. RogersBill-2011-11-15 would be better. Adding the account number on the end, or perhaps the account-holder's name, might also be useful to account for situations where people have to manage the bills for several accounts. I would even go as far as to tack on the bottom-line total (dollars spent or owning) to bills, receipts and other forms of statements. When later reviewing a long list of receipts, for example, this can prove extremely useful.
When I download an investment report from my investment company, it just says 'ClientReport.pdf'. The company, date and content description are missing. Here, the name of the investor would also be useful, to handle situations where one is downloading reports for different family members.

And on and on. In all these cases, people are forced to edit the filename after download. I can't count how many times I have found a file in my downloads folder and had to open it to remember its contents before editing its filename and then putting it in the correct place in my disk.

This aspect of usability seems to be overlooked by a very high percentage of software engineers and web designers. All the designers seem to think about is the content of the files, and perhaps the URL and the location of the file on the server. They forget that the file will have a life of its own and need to be identifiable once it leaves the server and resides on a client's computer.

One can, of course, go to far. The filename must be kept to a manageable length; at one time the absolute maximum was 8 characters in DOS (not including the extension), then 32 characters, and now typically about 255 bytes (which means fewer than 255 characters if non-ASCII Unicode characters are used). A human-usable limit is, I think, about 60 characters. There is a useful Wikipedia page describing the absolute limits and the characters that can be used. The following are some fictitious examples of meaningful download filenames that are kept to a reasonable length, adhere to the guidelines above, and are hopefully self-explanatory.

TeledirectCableBill-2011-11--66.13--acct9876543.pdf
Supersoft-FuriousCowsInstaller-2011-11-16-v1.2.3.dmg
GovNY-TaxLaw-ProposedChgs-2011-11-16-rpt87924.pdf

I recommend not using spaces in names to make it easier for certain programs, and shell scripts to process files. CamelCase and hyphens are useful to separate elements.

Related situations

If the ability to download a pdf file is not available, it is common for people to save bills, receipts, and similar documents using various 'print to pdf' capabilities, such as those built into the MacOS X print function. In these situations, the created pdf file will have a name derived from the html page's title. All rules I described above should therefore also be applied to the html title for pages that are commonly printed (bills, receipts, etc). I would go so far as to use a filename convention without spaces in the title of such html pages, specifically to facilitate one-step printing to pdf.

When emailing files to people, many senders just grab a file from their filesystem and add it as an attachment. People should get in the habit of giving meaningful names to files when they store them in their filesystem, and not just rely on identifying the file by way of its location in the filesystem; that helps prevent sending attachments with meaningless names. All too often I have received and had to save attachments with names such as report.docx.

Even when sending JPEG or other media files, the naming convention comes into play. I love the Olympus style of naming jpegs that embeds the date of the picture and a serial number right in the filename, e.g. PB171234.JPG for a picture stored in the 11th month of the year (B), on day 17, and with serial-number 1234. Many other cameras use filename formats that are not nearly so useful, like IMG_9999.JPG (see this Wikipedia page for naming convention details). In fact, the file naming convention alone is enough to tilt me towards a certain brand of cameras. When sending or uploading images, people should also consider adding description of the contents to their filenames. All this metadata can be stored using EXIF attributes, but most people aren't sophisticated enough to browse using tool that is EXIF-savvy.

I note with interest that even my own projects are sometimes guilty of inadequate names for downloads. For example in UmpleOnline, you can download a directory of generated code. The download has a name like 'JavaFromUmple.zip'. The source is identified (Umple); the content is identified (Java), but neither the name of the model nor the date and time are currently identified. Both of the latter would be useful. We will work on this problem.

Timothy Lethbridge's ideas on Technology and Politics

Wednesday, November 16, 2011

Give downloadable files more meaningful names

1 comment: