Dot or Not? A type safety story about file extensions
Published on:Table of Contents
Let’s write a function that takes in a file extension and returns a file type:
function fileType(fileExtension: string) {
switch (fileExtension) {
case ".mp4": return "video";
// ...
}
}
Our function assumes the file extension has a leading dot, but should it?
- The Python standard library includes the dot in file extensions.
- Rust takes the opposite approach
No worries, we can be flexible and accept either format:
function fileType(fileExtension: string) {
switch (fileExtension) {
case "mp4":
case ".mp4": return "video";
// ...
}
}
If this is the only place in the code base that will ever need to deal with file extensions, we’re safe. However, it’s more likely we’ve introduced a dormant logic bug that awakes when another function doesn’t follow this implicit convention and can’t handle both formats.
function uploadSupported(fileExtension: string) {
switch (fileExtension) {
case ".mp4": return true
// ...
}
}
const extension = "mp4";
const ftype = fileType(extension);
const supported = uploadSuported(extension);
What happens now? If functions don’t eagerly throw an error (ie: maybe not all file types are expected to be uploadable), we may find ourselves far away from the source of the bug wondering why the application is in a funk when implicit invariants are violated.
How can we solve this problem with type safety?
Does creating a file extension class help?
If we update our function signatures to take arguments typed as an Extension
we should be able to avoid this pitfall. Let’s write a class for this.
class Extension {
constructor(readonly extension: string) {
// ... snip canonicalization
}
}
function fileType(fileExtension: Extension) {
// ...
}
But I don’t recommend this approach.
The most pressing problem is referential equality:
const val = new Extension('.png');
const val2 = new Extension('.png');
console.log(val === val2); // false :(
So much of React and the web depends on referential equality that it’s a property we can’t lose.
We can work around it by reworking our class to use interned instances to emulate a value object:
const extensionIntern = new Map<string, Readonly<Extension>>();
class Extension {
private constructor(readonly extension: string) {}
private static addToMap(extension: string) {
const value = Object.freeze(new Extension(extension));
extensionIntern.set(extension, value);
return value;
}
static from(extension: string) {
// ... snip canonicalization
return extensionIntern.get(extension) ?? Extension.addToMap(extension);
}
}
const val = Extension.from('.png');
const val2 = Extension.from('.png');
console.log(val === val2); // true :)
We gained value semantics but now we need to concern ourselves with cleanup of unused instances. If we’re willing to further complicate our code, Javascript provides the means to observe the garbage collector with the FinalizationRegistry
.
And we haven’t considered deserialization, which exposes our internal details and requires an additional abstraction to be able to convert, or at the very least name the conversion, between the wire and in-memory representation.
const req = JSON.stringify(Extension.from(".png"));
const res/*: ??? */ = JSON.parse(req);
const result = Extension.from(res.extension);
So far adding type safety looks to be more effort than it’s worth. What else is there?
Do string literal types help?
Leveraging Typescript string literal types, we can do something pretty slick and force the caller to include the leading period.
type Extension = `.${string}` | "";
function fileType(fileExtension: Extension) {
switch (fileExtension) {
case ".mp4": return "video";
// ...
}
}
fileType("mp4") // compilation error!
We keep our value semantics and serialization, but despite its beauty, we have ourselves a big problem. We have lost canonicalization.
It’s only a matter of time before a caller rushing to finish their Jira ticket adds an erroneous dot!
const myExt: string = ".mp4";
const ftype = fileType(`.${myExt}`);
// ^ we needed to add the dot to satisfy the typescript compiler
// but now we have too many leading dots
We can certainly add a canonicalization function that returns an Extension
, but the Extension
string literal type itself doesn’t convey validity. It’s more of a suggestion and we’d probably need to recompute for each function. Sounds expensive and error prone.
fileType(".tar.gz");
// ".gz" is the correct file extension,
// but typescript doesn't complain
As far as I know, Typescript string literals can’t express “must start with period and not contain a period anywhere else in the string.”
Tagged Types
Creating an abstraction that creates a pit of success has been exasperating so far, but tagged types can save us. Tagged types, also known as opaque and branded types, are a way to introduce a nominal type system into Typescript’s structural world. They aren’t quite as robust as Rust’s newtypes but are close enough.
In other words, Tagged types allow us to create an Extension
that, to the Javascript runtime, acts and works like a string, but Typescript will forbid assigning a string to an extension.
First, comes the setup.
// use type-fest if you don't want to reinvent it yourself
// and want a more powerful type
// import { type Tagged } from 'type-fest';
declare const tag: unique symbol;
export type Extension = string & {
readonly [tag]: Extension;
};
export const Extension = {
from: (input: string): Extension => {
// ... snip canonicalization
// Once we're ready, perform the cast
return input as Extension;
},
};
The code might seem a bit impenetrable, but the important part is to communicate that an Extension
is not just a string and to cast it, as we don’t want to influence the runtime.
Let’s see how to use this.
function fileType(fileExtension: Extension) {
switch (fileExtension) {
case '.mp4':
return 'video';
// ...
}
}
fileType(Extension.from("mp4"));
Our function communicates that it only accepts valid extensions, but there are a couple issues.
If someone triggers intellisense, one can perform any string operation on our extension. This seems unintended.
Typescript is exposing string methods, as Extension
extends the string type so any string method is valid on it too.
But the bigger problem is Extension
usage still requires knowledge about its internal representation. Does one compare against an extension with a dot or not?
function fileType(fileExtension: Extension) {
switch (fileExtension) {
// spot the bug :o
case 'mpeg':
case '.mp4':
return 'video';
// ...
}
}
The trick here is more Typescript shenanigans by declaring that an Extension
is really an unknown data type. It requires a bit more rigamarole with additional casting but achieves the developer experience we’ve been searching for:
declare const tag: unique symbol;
export type Extension = unknown & {
readonly [tag]: Extension;
};
export const Extension = {
from: (input: string): Extension => {
// ... snip canonicalization
return input as unknown as Extension;
},
};
Making Extension
opaque to users isn’t guaranteed. One can use runtime constructs (like typeof
) to recover the inner type, but if developers are bending over backwards to use these constructs with an extension, we have bigger problems.
There are two ways we can make the developer experience better.
The first is to define constants to use:
export const Extension = {
from: (input: string): Extension => { /* ... */ },
formats: {
MP4: '.mp4' as unknown as Extension,
},
}
function fileType(fileExtension: Extension) {
switch (fileExtension) {
case Extension.formats.MP4:
return 'video';
// ...
}
}
Another way is to add helper methods where the extension client explicitly dictates what format they want it in.
export const Extension = {
from: (input: string): Extension => { /* ... */ },
dotted: (input: Extension) => `.${input}` as const,
dotless: (input: Extension) => input as unknown as string,
};
We even get a bit of type safety with the dotted
method in case someone tries to compare against a string that doesn’t start with a dot.
It’s worth repeating that we achieved everything without a runtime cost. Serialization just works:
const data = JSON.stringify(Extension.from("mp4"));
const myVal: Extension = JSON.parse(data);
Though I’d still suggest a parsing stage when accepting any outside data to guarantee that all extensions are canonical.
Conclusion
Given a sufficiently large code base, one will unknowingly stumble into being glue between two systems or components that operate under different assumptions.
This happened to me. I was tying two frontend sub-systems together and one used dotted extensions and the other didn’t, and I didn’t realize the discrepancy until I drilled through multiple layers to see how the extensions were treated.
The hour (or two) that I spent on this bug was an hour (or two) too long and I wanted to demonstrate techniques to avoid this pitfall in the future.
This isn’t the first time I’ve written about tagged types, but I find it worth repeating. If there is ever a time where it’s ambiguous how to use a value, what it means, or if it is even valid, reach for tagged types. Your future self will thank you.
You’ll start to notice opportunities all over the place. One common occurrence are IDs. In an application there may be a dozen different kinds of IDs. Wouldn’t it be nice to add type safety to the equation so two types are never conflated?
This type safety makes naming easier too. What’s more intuitive:
type MyObj = {
fileId: string
}
// or ...
type MyObj = {
id: FileId
}
Imagine interfacing with a new system that also has a file id. You’d have to risk conflation or disambiguate by renaming the field which could mean breaking changes in code and wire format. Our type of FileId
doesn’t need to change as FileIdSystemX
can’t be mistaken for a FileId
.
Primitives are for primitives.
Comments
If you'd like to leave a comment, please email [email protected]