Datacap pages cut from seconds to milliseconds.

A tier-one retail bank’s IBM Datacap workflow had become painful to use. Every action in the Datacap Navigator Plugin — opening a batch, panning the document viewer, clicking a dropdown — took several seconds, with no obvious culprit on the server. Hardware was healthy, the database was fast, and the network team had signed off on the link. We were called in to find the cause and stabilise it.

§ II The symptom

Every action in the workflow paused for several seconds.

A browser HAR capture across a typical session showed 144 requests totalling 455 seconds — with 96% of that time spent in wait (server time-to-first-byte). Heavy 1.4 MB image fetches returned in 350 ms, but a 76-byte acknowledgement to a saveFile call took 15 seconds. The pattern had nothing to do with payload size; it was a uniform per-request overhead being added somewhere on the server side.

HAR waterfall showing repeated XHR requests of about 1 kB returning in 4.3 to 4.7 seconds, against a single cached PNG that returned in 4 ms. HAR waterfall showing 304 Not Modified responses and small POSTs taking 2.7 to 5.5 seconds, while a 1.5 MB JPEG returned in 344 ms.
Two slices of the same HAR capture. Payload size is uncorrelated with response time — tiny acknowledgements and 304s sit in the multi-second range while a 1.5 MB image returns in 344 ms. The uniform per-request floor near 5 seconds is the fingerprint of the OS resolver timeout.
§ III The diagnosis

A lock convoy inside the logger, fed by a 5-second DNS timeout.

The HAR ruled out the network, the database, and payload size — 96% of total time was server-side time-to-first-byte, and the wait floor was a suspiciously clean ~5 seconds. We asked the team to capture eight Java thread dumps on the WebSphere JVM hosting IBM Content Navigator, taken five seconds apart while the slowness was being reproduced.

HAR waterfall showing repeated XHR requests of about 1 kB returning in 4.3 to 4.7 seconds, against a single cached PNG that returned in 4 ms. HAR waterfall showing 304 Not Modified responses and small POSTs taking 2.7 to 5.5 seconds, while a 1.5 MB JPEG returned in 344 ms.
Two slices of the same HAR capture. Payload size is uncorrelated with response time — tiny acknowledgements and 304s sit in the multi-second range while a 1.5 MB image returns in 344 ms. The uniform per-request floor near 5 seconds is the fingerprint of the OS resolver timeout.

The dumps were decisive. In six of the eight, a single WebContainer thread held the global com/ibm/ejs/ras/SystemOutStream monitor — the JVM-wide lock the WAS logger acquires for every write — and was stuck in the same native frame: java/net/Inet6AddressImpl.getHostByAddr, called from InetAddress.getCanonicalHostName(), called from RasHelper.getFullHostName(), called from SystemOutStream.logRolled(). WAS was rotating its SystemOut.log, writing the new file's header, and resolving the host’s own IP back to an FQDN to put in that header — a reverse DNS lookup on itself. Behind that one thread, fourteen to twenty‑one other WebContainer threads were queued on the same monitor, every one of them coming through NavigatorContext.<init> → Logger.logDebug. Every JAX-RS request to ICN — CSS, JS, AJAX, image fetches, even 304 Not Modified responses — goes through that constructor. So while one thread sat in DNS, every user-facing request piled up behind it.

That is a textbook lock convoy: a fast lock made slow because the code holding it is waiting on something off-process. It also explained every prior puzzle — the bursty pattern (fast → 5 s stall → fast, in line with log‑rollover events), why static and database paths were equally affected (all paths log), and why the WebContainer pool looked under-utilised in Tivoli (the bottleneck was the lock, not the pool).

A targeted Wireshark capture confirmed the DNS hypothesis on the wire. The server had two NICs. The primary interface carried all real application traffic. A second interface, Ethernet2, held the address 172.21.113.130 and had no DNS server configured. Reverse-lookup queries that landed on that interface had nowhere to go and waited out the full 5-second OS resolver timeout before failing — the exact 5-second floor the HAR had been showing all along.

§ IV The change

A few lines in the hosts file.

The fastest, lowest-risk remediation was to short-circuit the resolver. We added entries to the Windows hosts file on the WAS server mapping every IP the host owns — both NICs — to its canonical FQDN. With that in place, getCanonicalHostName() returns from the local file in microseconds and never touches DNS, so the log-rollover thread can no longer stall holding the logger monitor. No JVM restart, no application change, no network ticket — a contained, reversible edit aligned with IBM’s own documented mitigation for this WebSphere behaviour. Longer-term hardening (a JVM property to skip the FQDN resolution, reducing ICN debug-trace volume, and a proper PTR record from the network team) was scheduled for the next maintenance window.

§ V The result

Page response dropped to under 30 ms.

After the hosts file change, every page in the workflow rendered in well under 30 ms. Follow-up thread dumps were clean — no thread anywhere in the stack of getCanonicalHostName or RasHelper.getFullHostName, no waiters on the SystemOutStream monitor — and the 5-second deltas in the Wireshark trace were gone. The fix held through subsequent peak cycles without intervention, the marker we use for a problem genuinely solved rather than temporarily relieved.

Seconds → < 30 ms

Per-page response time, end-to-end, on the same hardware.

Two lines

A static hosts file edit on a single server — no application or infrastructure change.

Same day

From HAR and Wireshark evidence to verified production fix, contained to one maintenance touch.