Split Next.js across hosting providers and advocate for direct S3 uploads

I was migrating a Next.js application (pdx.tools) from a self hosted instance to run serverless and edge functions over on Vercel.

One serverless endpoint accepts an uploaded EU4 save file, parses it via a microservice, uploads to S3, and sticks the parsed data in Postgres. Conceptually, the endpoint can be written:

export function POST(req: Request) {
  const bytes = new Uint8Array(await req.arrayBuffer());
  const fileId = genId();
  const info = await parseFile(bytes);
  await s3.upload(fileId, bytes);
  await db.insert({ fileId, ...info });
  return NextResponse.json({ fileId });
}

Error handling has been omitted, but it’s easy to sprinkle in as we control everything.

Unfortunately, when I stressed the endpoint in testing, I encountered the 4.5MB body size limit on Vercel serverless functions. A bummer since only 5% of uploads exceeded this body size limit, and I don’t want to pass this limit onto users, or have to scramble if I ever needed to POST a large JSON payload.

“Pfft”, you say, “everyone knows you should be using presigned upload URLs instead of uploading directly to the server”

Every time someone complains that serverless/tRPC/NextJS/Vercel are bad “because multi-part file upload is hard” I lose a bit of sanity

You’re using S3. Please use presigned POST URLs. aaa[…]

@t3dotgg | Theo - ping.gg | Aug 28, 2022

Fine. It looks like if I want to host on Vercel presigned uploads is the only workaround. Probably not worth looking at other hosting providers: better the provider you know than the provider you don’t know.

Too bad there’s not a way to seamlessly slice and dice how a Next.js application is packaged between hosting providers. Or is there… Stay tuned, but first I want to elaborate all the downsides one is buying into when leveraging presigned uploads.

There are many providers of S3 compatible storage besides AWS: Cloudflare R2, Backblaze B2, and Wasabi. So while you may be able to protect your AWS S3 bucket from maliciously large uploads with S3 policies, other providers may not support such a feature (I know of none that do). Now we need to decide, do we want to be able to substitute other S3 providers for potential major cost savings and performance improvements, or are we locking ourselves to S3 for the featureset.

There are plenty of attractive AWS S3 features, to be honest. Amazon S3 Event Notifications are an excellent way to have your code notified when a client side upload finishes. Again, I’m not aware of other S3 providers that have a similar feature.

Programmers may get clever and have the client instruct the server when it is done uploading. The server would then pull down the file and continue processing, but in a video, Theo, the author of the previous tweet, cautioned against these frontend callbacks

Relying on frontend callbacks is a huge danger source. […] A problem that we encountered at Twitch, and what I’ve seen many codebases have, is that they use S3 presigned POST URLs but they don’t actually keep track of the creation of that file until the client finishes uploading, and tells the server “Hey, I’m done uploading”. Which means, you can put yourself in a state […] where you have ghost files sitting in your bucket that aren’t logged in your database.

Theo goes on to sketch out another problem where the client could lie about what file was done uploading. No wonder he created his own file upload service.

I’m not using AWS S3 and have no interest in migrating. What are my options if S3 event notifications aren’t available and frontend callbacks have issues?

Next.js Long Polling

It’s 2023 and long polling is back.

We can create a Next.js long polling route handler that emits the presigned URL, keeps the response open while polling for the s3 file, and once available, continues processing until the results can be sent back to the client.

// https://developer.mozilla.org/en-US/docs/Web/API/ReadableStream#convert_an_iterator_or_async_iterator_to_a_stream
const iteratorToStream = (iterator) => {
  // [snipped]
};

export function POST() {
  const fileId = genId();
  const encoder = new TextEncoder();
  async function* responseIterator() {
    // give client presigned url to upload to
    const presignedUrl = s3.presignUpload(fileId);
    yield encoder.encode(presignedUrl);

    // poll until the file exists
    while (!s3.exists(fileId)) {
      await new Promise((res) => setTimeout(res, 1000));
    }

    const info = await parseFile(s3.urlFor(fileId));
    await db.insert({ fileId, ...info });
    yield encoder.encode(JSON.stringify({ msg: 'done' }));
  }

  return new Response(iteratorToStream(responseIterator()));
}

Lots to think about:

Should the client send the number of bytes they will be uploading so that we can verify that it is reasonable and then, if we are targeting AWS S3, create a signed upload url that contains the policy?
We need a polling timeout, otherwise we could be waiting forever if the client never completes uploading. What would be a reasonable timeout? Seems like we can use the expected file length to compute a timeout like 10 seconds per megabyte.
I’m not aware of any S3 implementation that enforces an upload timeout, so we can still end up with ghost files in S3 if the server timeout is tripped.
Is waiting 1 second the right retry logic? S3 HEAD requests are cheap, but depending on the S3 provider, it won’t be free. We now need to balance the optimal solution in terms of cost and latency.

From the client perspective we’ll assume the first chunk contains the URL to upload to and the second chunk is the result.

const resp = await fetch('/api/file', { method: "POST" });
const reader = resp.body.getReader();
const { value: presignedUrl } = await reader.read();
const decoder = new TextDecoder();
await uploadToS3(decoder.decode(presignedUrl));
const { value } = await reader.read();
const result = JSON.parse(decoder.decode(value));

And more to think about:

Nothing guarantees the chunk length (ie: the presigned URL could be split in two), so we’d probably want to consider delimiting messages and add some sort of parsing layer on top to reconstruct them.
What if the server is restarted before the upload finishes, how should the server resume polling for the S3 file? We could entrust the client (and one should never trust the client) to kickstart server polling if it notices the connection has been broken.

Long polling solved the problem of relying on frontend callbacks for non-AWS S3 applications, but I’m feeling kinda meh about all the remaining details to tease out.

Everything is solvable. We can setup a proxy in front of our S3 provider to make sure upload size is within reason (though it’ll cost us some of the “purity” of directly uploading to S3). We can store IDs and timestamps of requests that have not been successfully detected and either continue polling S3 or clear them out periodically or on restart. This would solve the conflict of two requests that receive the same identical generated ID that exists in the direct upload too, but it is more pronounced in the presigned upload.

Sometimes the additional complexity is not worth the benefit.

Missed Optimizations

There are optimizations that we can add in when we control everything. I noticed that parsing a file and uploading a file have the same latency. Since they are independent operations, they can be done concurrently.

export function POST(req: Request) {
  const bytes = new Uint8Array(await req.arrayBuffer());
  const fileId = genId();
  const [info, _] = await Promise.all([
    parseFile(bytes),
    s3.upload(fileId, bytes),
  ]);
  await db.insert({ fileId, ...info });
  return NextResponse.json({ fileId });
}

In the omitted error handling, we can catch when something goes wrong and then delete the uploaded file, so we don’t have ghost files.

Response times are now cut in half, and the user sees a successful upload twice as fast. This wouldn’t be possible if we used presigned URL uploads that required the file to be uploaded prior to being parsed. Raise your hand if you want simpler code that has half the latency. ✋

This does hand wave away the fact that the presigned URL method only contains a single upload while the simple method has 3, including uploading to the server and then uploading to s3. Benchmarking showed that this shouldn’t be a large concern. The intermediate upload and database call are single digit percentage contributors to the overall response latency.

There’s further optimization opportunities too. If buffering the entire uploaded file in memory results in too much pressure, we have the option to tee() the file stream to s3 and our parsing service.

Splitting Next.js endpoints

I teased this earlier, but there is a way through Next.js rewrites to make Vercel host our edge runtime and static assets but proxy node.js endpoint requests to a separate host:

// next.config.js
module.exports = {
  output: process.env.NEXT_OUTPUT,
  rewrites: async () => ({
    beforeFiles: process.env.PROXY_NODE_URL
      ? ["/api/saves"].map((source) => ({
          source,
          destination: `${process.env.PROXY_NODE_URL}${source}`,
        }))
      : undefined,
  })
}

In the above example, /api/saves is the endpoint that receives the direct upload and would fail if hosted on Vercel.

To use this config, we build the application twice: once for our self-hosted instance and once for Vercel.

NEXT_OUTPUT=standalone next build
# Create docker image with the `.next/standalone` 
# [...snip...]
# ssh example.com 'docker-compose pull && docker-compose up -d' 

# Create a vercel build witht he proxy URL
PROXY_NODE_URL=https://example.com vercel build --prod && vercel deploy --prod --prebuilt

Even when we exceed the Vercel serverless body limit, Vercel is happy to proxy it to another backend. I don’t think there is a cost (both monetary and performance) with rewrites. An alternative would be to use redirects instead of rewrites, but I kinda like keeping the CSP as simple as possible and not “leaking” the other host.

I was already self hosting the application so I naturally gravitated toward designating this as the other host, but you can explore other Next.js hosting providers.

One question left to answer is, should all serverless node.js endpoints be hosted elsewhere or only the necessary. I do all, so there are no cold starts and there are less environment variables that need to be synced. If Vercel only run edge code, then it doesn’t need to concern itself with database connection environment variables. I will admit that it is a bit of a pain to keep the list of endpoints in sync in next.config.js.

Conclusion

In the end, we have a choice of how we want to complicate our lives:

Migrate to AWS S3 for upload policies and events. We’d still need to solve how to let the client know the file has been processed by the backend, but I imagine client polling can be sufficient, as the AWS S3 callback may be received on a separate instance.
Create a poor man’s implementation of AWS S3 events with long polling and some sort of content length validating proxy in front of S3.
Split Next.js endpoints across hosting providers, and deal with keeping cross cutting concerns (eg: configuration, logging, security) between two backends in sync.

Not everyone’s choice will be the same, but the last option is the most appetizing to me.

Split Next.js across hosting providers and advocate for direct S3 uploads

Table of Contents

Next.js Long Polling

Missed Optimizations

Splitting Next.js endpoints

Conclusion

Comments