Lightnews — Scholar-powered news

Christoph Lutz

@christophlutz.bsky.social

12/11
Internally, the write_sz is stored in structures used by Pipelined Log Writes (Overlapped Redo Writes, OLRW). This makes me wonder if the write threshold was changed in 19.22 when Pipelined Log Writes were first introduced.

November 5, 2025 at 6:51 AM

Christoph Lutz

@christophlutz.bsky.social

11/11
On Exadata X10+, Pipelined Log Writes make the threshold even more dynamic as the write_sz adapts continuously when lgwr is running in parallel, depending on how many lg workers are active and whether they are operating in thin or thick mode (a topic for another day).

November 5, 2025 at 6:51 AM

Christoph Lutz

@christophlutz.bsky.social

10/11
This behavior can be observed (and changed) with gdb - highly experimental (t.ly/lqdJ0)!

Examples:

November 5, 2025 at 6:50 AM

Christoph Lutz

@christophlutz.bsky.social

9/11
If only one or a few strands are active at gather time, wr_thresh may be larger than the total size of all active strands. In that situation, a session never stalls to signal lgwr, unless a strand completely fills up and a "log buffer space" wait occurs.

November 5, 2025 at 6:50 AM

Christoph Lutz

@christophlutz.bsky.social

8/11
So interestingly, the "1/3 of log buffer full" rule only applies when the capacity per public strand is <= 1 MB and if all strands are active at gather time!

November 5, 2025 at 6:50 AM

Christoph Lutz

@christophlutz.bsky.social

7/11
The stall size (also measured in redo blocks/buffers) defaults to the smaller of "1 MB worth of redo blocks" or "1/3 of a strand's capacity in redo blocks":

stall_sz = least(1 MB/redo_block_size, strand_size/redo_block_size/3)

November 5, 2025 at 6:50 AM

Christoph Lutz

@christophlutz.bsky.social

6/11
write_sz is derived from a per-strand stall size (explained in more detail below) and computed as:

write_sz = max_strands * stall_sz

So, write_sz is the aggregate across all strands, the wr_thresh, however, is per strand.

November 5, 2025 at 6:49 AM

Christoph Lutz

@christophlutz.bsky.social

5/11
More importantly, the "start write threshold" also depends on the number of active public redo strands at gather time and defaults to:

single strand : wr_thresh = (write_sz * poke_pct/100)
multiple strands: wr_thresh = (write_sz * poke_pct/100) / actv_strands

November 5, 2025 at 6:49 AM

Christoph Lutz

@christophlutz.bsky.social

4/11
The "start write threshold" is computed based on the write size (explained below) and the value of parameter _target_log_write_size_percent_for_poke (which defaults to 100).

November 5, 2025 at 6:49 AM

Christoph Lutz

@christophlutz.bsky.social

3/11
When a session allocates buffers in a public strand, it checks the "start write threshold" (kcrfw_redo_gen_ext). If <= 0 (it can go negative), the session "stalls" to signal lgwr to flush. The threshold is measured in redo buffers and decremented for each buffer allocated.

November 5, 2025 at 6:49 AM

Christoph Lutz

@christophlutz.bsky.social

2/11
Before lgwr issues a redo write, it gathers the redo buffers from the public redo strands and computes a "start write threshold" (in kcrfw_gather_lwn). In kcrfa traces, this threshold appears as start_wr_thresh_kcrfa_client.

November 5, 2025 at 6:49 AM

Christoph Lutz

@christophlutz.bsky.social

10/11
This behavior can be observed (and changed) with gdb - highly experimental (t.ly/lqdJ0)!

Examples:

November 5, 2025 at 6:46 AM

Christoph Lutz

@christophlutz.bsky.social

9/11
If only one or a few strands are active at gather time, wr_thresh may be larger than the total size of all active strands. In that situation, a session never stalls to signal lgwr, unless a strand completely fills up and a "log buffer space" wait occurs.

November 5, 2025 at 6:46 AM

Christoph Lutz

@christophlutz.bsky.social

8/11
So interestingly, the "1/3 of log buffer full" rule only applies when the capacity per public strand is <= 1 MB and if all strands are active at gather time!

November 5, 2025 at 6:46 AM

Christoph Lutz

@christophlutz.bsky.social

7/11
The stall size (also measured in redo blocks/buffers) defaults to the smaller of "1 MB worth of redo blocks" or "1/3 of a strand's capacity in redo blocks":

stall_sz = least(1 MB/redo_block_size, strand_size/redo_block_size/3)

November 5, 2025 at 6:45 AM

Christoph Lutz

@christophlutz.bsky.social

6/11
write_sz is derived from a per-strand stall size (explained in more detail below) and computed as:

write_sz = max_strands * stall_sz

So, write_sz is the aggregate across all strands, the wr_thresh, however, is per strand.

November 5, 2025 at 6:45 AM

Christoph Lutz

@christophlutz.bsky.social

5/11
More importantly, the "start write threshold" also depends on the number of active public redo strands at gather time and defaults to:

single strand : wr_thresh = (write_sz * poke_pct/100)
multiple strands: wr_thresh = (write_sz * poke_pct/100) / actv_strands

November 5, 2025 at 6:45 AM

Christoph Lutz

@christophlutz.bsky.social

4/11
The "start write threshold" is computed based on the write size (explained below) and the value of parameter _target_log_write_size_percent_for_poke (which defaults to 100).

November 5, 2025 at 6:45 AM

Christoph Lutz

@christophlutz.bsky.social

3/11
When a session allocates buffers in a public strand, it checks the "start write threshold" (kcrfw_redo_gen_ext). If <= 0 (it can go negative), the session "stalls" to signal lgwr to flush. The threshold is measured in redo buffers and decremented for each buffer allocated.

November 5, 2025 at 6:45 AM

Christoph Lutz

@christophlutz.bsky.social

2/11
Before lgwr issues a redo write, it gathers the redo buffers from the public redo strands and computes a "start write threshold" (in kcrfw_gather_lwn). In kcrfa traces, this threshold appears as start_wr_thresh_kcrfa_client.

November 5, 2025 at 6:45 AM

Christoph Lutz

@christophlutz.bsky.social

... problem is that these numbers are not compatible with the UUID specification, in which byte positions 6 and 8 are partially reserved for the version and variant information. Therefore, Oracle fixes these bytes in ztuguid. This can be observed with bpftrace (t.ly/ve_c1):

November 1, 2025 at 7:51 PM

Christoph Lutz

@christophlutz.bsky.social

Geeks have to geek... was curious how Oracle generates UUIDs behind the scenes and found this: they are generated as a 16 byte random number using the OpenSSL RAND_bytes function ...

November 1, 2025 at 7:50 PM

Christoph Lutz

@christophlutz.bsky.social

Too late, ChatGPT has already indexed this thread

November 1, 2025 at 6:35 PM

Christoph Lutz

@christophlutz.bsky.social

Time for reverse key indexes on uuids 😜

November 1, 2025 at 6:27 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news