Swill
← Writing

Two debugging notes from a Cloudflare Workers tarot app

·8 min read·

Two unrelated bugs hit on the same day deploying lunarcana — a Next.js + Cloudflare Workers tarot app served via OpenNext. They share no root cause but each leaves a lesson worth pinning down.


Bug 1 — /api/reading hangs on Workers with "code had hung"

Symptom

On dev.lunarcana.app, drawing a reading on any non-free spread stalled for ~25–30 s, then the interpretation panel stayed blank and the request failed. wrangler tail:

POST https://dev.lunarcana.app/api/reading - Exception Thrown
✘ [ERROR] Error: The Workers runtime canceled this request because it
          detected that your Worker's code had hung and would never
          generate a response.

The 4 free spreads (FREE_SPREAD_IDS) worked. The hang only fired when the request reached the DeepSeek stream path.

Diagnosis

Ruled out four obvious hypotheses in order:

  1. DeepSeek key bad / provider down. curl to DeepSeek from the local machine returned HTTP 200 + a valid SSE stream in 1.9 s. Host reachable, key valid.
  2. Worker → DeepSeek egress blocked. Built a temporary /api/ping?mode=deepseek probe that did a plain non-streaming fetch from the Worker with the same key. It succeeded. Worker can talk to DeepSeek.
  3. Route-handler import blowing up at cold start. Built a zero-dep /api/ping streaming endpoint returning a hardcoded ReadableStream. It returned 200 with streamed bytes. OpenNext/Workers streaming works for a clean endpoint.
  4. Trace where in the route it stalls. Added console.log("[reading] enter +Nms") marks at every stage of the handler. None of them appeared in wrangler tail — see Bug 1.5 below. Had to reason structurally instead.

With external deps cleared, attention moved to the Worker-side response construction. Diffing what /api/ping did (worked) against what /api/reading did (hung) isolated two differences: the response headers and the ReadableStream strategy.

Root cause — two contributing bugs

(1) Manual Transfer-Encoding: chunked response header.

return new Response(stream, {
  headers: {
    "Content-Type": "text/plain; charset=utf-8",
    "Transfer-Encoding": "chunked",   // ← don't do this
    "Cache-Control": "no-cache",
  },
});

Cloudflare Workers' runtime owns response framing. Setting Transfer-Encoding yourself confuses the runtime's buffering heuristics and contributes to the hang detector tripping on streaming bodies. The right "don't buffer me" hint is X-Accel-Buffering: no plus Cache-Control: no-cache, no-transform.

(2) pull-based ReadableStream for SSE passthrough.

new ReadableStream({
  async pull(controller) {          // consumer-driven
    const { done, value } = await reader.read();
    // … parse SSE, enqueue content …
  }
})

pull is consumer-driven: the runtime calls it when the Response body's downstream reader signals demand. On @opennextjs/cloudflare@1.19.x, back-pressure does not reliably propagate from the Worker's outgoing Response body into the user handler's ReadableStream. pull was never called → the stream had no demand → no bytes flowed → the hang detector fired ~25 s in.

Fix

Three changes, all landed together.

1. Remove Transfer-Encoding; keep only opt-out-of-caching headers.

src/app/api/reading/route.ts:

return new Response(stream, {
  headers: {
    "Content-Type": "text/plain; charset=utf-8",
    "Cache-Control": "no-cache, no-transform",
    "X-Accel-Buffering": "no",
  },
});

2. Rewrite the stream to producer-driven eager drain via start.

src/lib/ai/deepseek.ts:

// Eager-drain via `start`: a `pull`-based stream stalls on Workers when
// OpenNext doesn't propagate back-pressure into the Response body.
return new ReadableStream<Uint8Array>({
  async start(controller) {
    const reader = response.body!.getReader();
    let buffer = "";
    try {
      while (true) {
        const { done, value } = await reader.read();
        if (done) break;
        buffer += decoder.decode(value, { stream: true });
        const lines = buffer.split("\n");
        buffer = lines.pop() ?? "";
        for (const raw of lines) {
          if (!raw.startsWith("data: ")) continue;
          const data = raw.slice(6);
          if (data === "[DONE]") { controller.close(); return; }
          try {
            const content = JSON.parse(data).choices?.[0]?.delta?.content;
            if (content) controller.enqueue(encoder.encode(content));
          } catch {}
        }
      }
      controller.close();
    } catch (err) { controller.error(err); }
  },
});

start fires once when the stream is constructed and pumps upstream → downstream independently of consumer demand. The 3-line comment above the block is load-bearing — it's a guardrail against a future refactor reverting to pull "for idiomatic reasons".

3. Add an upstream timeout.

const response = await fetch(DEEPSEEK_API_URL, {
  // …
  signal: AbortSignal.timeout(25_000),
});

25 s is under the Workers wall-clock budget. A silently stalled upstream now throws TimeoutErrortry/catch in the route → clean 500, not a hang-detector kill.

Lesson

  • On Workers + OpenNext, SSE-proxy Route Handlers must use start-based eager drain, not pull. If you're constructing a ReadableStream to pipe a fetch response, the start variant is the safe default until OpenNext fixes back-pressure propagation.
  • Never set Transfer-Encoding manually on a Workers response.
  • Always put AbortSignal.timeout(...) on outbound fetch to streaming external APIs. "Hangs forever" is a worse failure mode than "fails in 25 s".

Bug 1.5 — console.log inside Route Handlers is swallowed by OpenNext

Discovered while debugging Bug 1. Worth calling out separately because it changes how you investigate any future Workers issue.

Symptom. console.log / console.error inside app/api/**/route.ts handlers do not appear in wrangler tail on the deployed dev worker. The HTTP request line shows up (POST /api/reading - Exception Thrown / - Ok), but nothing between the request line and the next request's line.

Confirmed by. A zero-dep /api/ping route with console.log("[ping] enter") produced no tail output despite returning 200. Meanwhile, logs from OpenNext's own runtime wrapper — e.g. (warn) env.IMAGES binding is not defined from the image optimizer — do surface.

Cause. @opennextjs/cloudflare@1.19.x does not forward route-handler console.* output to the Workers log pipeline that wrangler tail consumes. Wrapper logs pass through; user-handler logs are swallowed. If this changes in a later OpenNext release, this section needs an update.

What to do instead when debugging a deployed worker.

  • Return diagnostics in the response body or headers. A stage marker header like X-Stage: post-auth or an initial [stage:pre-provider]\n line in the stream before real content is visible to curl / DevTools without any log infrastructure.
  • Short-lived probe endpoint. For isolating a single hypothesis (e.g. "can the Worker even reach DeepSeek?"), a temporary /api/ping route beats structural guessing. Delete the probe as soon as the bug is identified — ours briefly exposed DEEPSEEK_API_KEY prefix/suffix/length on an unauthenticated route.
  • Reproduce locally. bun run cf:preview (Wrangler dev server on :8787) streams console.log from route handlers to the terminal correctly — only the deployed path swallows them.

Lesson. When a symptom suggests "my code didn't run", on OpenNext it might just mean "my logs didn't ship". Diffing two endpoints (one broken, one working) until the failure mode changes shape beats waiting for traces that will never arrive.


Bug 2 — "save reading" fails silently with 400 Invalid note

Symptom

Reading interpretation renders correctly. Clicking seal into grimoire shows the generic error toast The grimoire did not accept this seal, please try again every time. No reading is saved.

Diagnosis

useSaveReading (src/hooks/useSaveReading.ts) catches the failed response and shows the generic toast without surfacing the body. The Network tab would have revealed 400 { error: "Invalid note" } — but reading the validator was faster:

// src/app/api/readings/route.ts (before fix)
if (note !== undefined && (typeof note !== "string" || note.length > 2000)) {
  return apiError("Invalid note", 400);
}

Meanwhile useSaveReading sends:

body: JSON.stringify({
  spreadId,
  cards: drawnCards.map((dc) => ({ id: dc.card.id, reversed: false })),
  note,            // ← typed as `string | null`
  interpretation,
}),

Root cause

The payload type of note is string | null. The client sends null when the user doesn't type a note. The validator used strict !== undefined, so for note === null:

  • null !== undefined → true
  • typeof null === "object""string" → validator returns 400 Invalid note.

Same-file sibling check on interpretation correctly used != null — accidental asymmetry between two adjacent validators in the same file.

Fix

One-character diff in src/app/api/readings/route.ts:

- if (note !== undefined && (typeof note !== "string" || note.length > 2000)) {
+ if (note != null && (typeof note !== "string" || note.length > 2000)) {

Lesson

  • For optional request-body fields typed as T | null | undefined, guard with != null (loose equality) — it covers both null and undefined. !== undefined alone admits null, which typeof then rejects.
  • JSON has no undefined. Over the wire, omitted fields arrive as missing keys (→ undefined after JSON.parse), and explicitly-absent fields typically arrive as null. Validators for optional fields must accept both.
  • Mirror the pattern used by sibling fields in the same validator. This bug was a single-line drift from the interpretation check right below it. Consistency within a file is free; diverging on optional-field handling is a recurring source of silent 400s.
  • Adjacent UX gap (not fixed here): useSaveReading swallows the server error body. If another field-validation bug like this slips through, consider surfacing await res.text() to the toast in dev, or at minimum console.error(...) client-side on !res.ok.

Common thread

Both bugs were "the request goes out, nothing useful comes back." The CF Workers hang failed loudly but with the wrong signal (Workers runtime canceled made it look like a runtime bug, not a back-pressure bug). The note: null validation failed quietly with a generic toast and required reading the validator to find. Different shapes, same lesson: invest in the diagnostic surface before you need it — body-returned stage markers for opaque streaming failures, real error bodies in toasts for UX-level failures. Both bugs would have been 5-minute fixes with better error pipes; both took an hour without.