Christoph Lutz
banner
christophlutz.bsky.social
Christoph Lutz
@christophlutz.bsky.social
My drinking club has a skydiving problem
12/11
Internally, the write_sz is stored in structures used by Pipelined Log Writes (Overlapped Redo Writes, OLRW). This makes me wonder if the write threshold was changed in 19.22 when Pipelined Log Writes were first introduced.
November 5, 2025 at 6:51 AM
11/11
On Exadata X10+, Pipelined Log Writes make the threshold even more dynamic as the write_sz adapts continuously when lgwr is running in parallel, depending on how many lg workers are active and whether they are operating in thin or thick mode (a topic for another day).
November 5, 2025 at 6:51 AM
10/11
This behavior can be observed (and changed) with gdb - highly experimental (t.ly/lqdJ0)!

Examples:
November 5, 2025 at 6:50 AM
9/11
If only one or a few strands are active at gather time, wr_thresh may be larger than the total size of all active strands. In that situation, a session never stalls to signal lgwr, unless a strand completely fills up and a "log buffer space" wait occurs.
November 5, 2025 at 6:50 AM
8/11
So interestingly, the "1/3 of log buffer full" rule only applies when the capacity per public strand is <= 1 MB and if all strands are active at gather time!
November 5, 2025 at 6:50 AM
7/11
The stall size (also measured in redo blocks/buffers) defaults to the smaller of "1 MB worth of redo blocks" or "1/3 of a strand's capacity in redo blocks":

stall_sz = least(1 MB/redo_block_size, strand_size/redo_block_size/3)
November 5, 2025 at 6:50 AM
6/11
write_sz is derived from a per-strand stall size (explained in more detail below) and computed as:

write_sz = max_strands * stall_sz

So, write_sz is the aggregate across all strands, the wr_thresh, however, is per strand.
November 5, 2025 at 6:49 AM
5/11
More importantly, the "start write threshold" also depends on the number of active public redo strands at gather time and defaults to:

single strand : wr_thresh = (write_sz * poke_pct/100)
multiple strands: wr_thresh = (write_sz * poke_pct/100) / actv_strands
November 5, 2025 at 6:49 AM
4/11
The "start write threshold" is computed based on the write size (explained below) and the value of parameter _target_log_write_size_percent_for_poke (which defaults to 100).
November 5, 2025 at 6:49 AM
3/11
When a session allocates buffers in a public strand, it checks the "start write threshold" (kcrfw_redo_gen_ext). If <= 0 (it can go negative), the session "stalls" to signal lgwr to flush. The threshold is measured in redo buffers and decremented for each buffer allocated.
November 5, 2025 at 6:49 AM
2/11
Before lgwr issues a redo write, it gathers the redo buffers from the public redo strands and computes a "start write threshold" (in kcrfw_gather_lwn). In kcrfa traces, this threshold appears as start_wr_thresh_kcrfa_client.
November 5, 2025 at 6:49 AM
10/11
This behavior can be observed (and changed) with gdb - highly experimental (t.ly/lqdJ0)!

Examples:
November 5, 2025 at 6:46 AM
9/11
If only one or a few strands are active at gather time, wr_thresh may be larger than the total size of all active strands. In that situation, a session never stalls to signal lgwr, unless a strand completely fills up and a "log buffer space" wait occurs.
November 5, 2025 at 6:46 AM
8/11
So interestingly, the "1/3 of log buffer full" rule only applies when the capacity per public strand is <= 1 MB and if all strands are active at gather time!
November 5, 2025 at 6:46 AM
7/11
The stall size (also measured in redo blocks/buffers) defaults to the smaller of "1 MB worth of redo blocks" or "1/3 of a strand's capacity in redo blocks":

stall_sz = least(1 MB/redo_block_size, strand_size/redo_block_size/3)
November 5, 2025 at 6:45 AM
6/11
write_sz is derived from a per-strand stall size (explained in more detail below) and computed as:

write_sz = max_strands * stall_sz

So, write_sz is the aggregate across all strands, the wr_thresh, however, is per strand.
November 5, 2025 at 6:45 AM
5/11
More importantly, the "start write threshold" also depends on the number of active public redo strands at gather time and defaults to:

single strand : wr_thresh = (write_sz * poke_pct/100)
multiple strands: wr_thresh = (write_sz * poke_pct/100) / actv_strands
November 5, 2025 at 6:45 AM
4/11
The "start write threshold" is computed based on the write size (explained below) and the value of parameter _target_log_write_size_percent_for_poke (which defaults to 100).
November 5, 2025 at 6:45 AM
3/11
When a session allocates buffers in a public strand, it checks the "start write threshold" (kcrfw_redo_gen_ext). If <= 0 (it can go negative), the session "stalls" to signal lgwr to flush. The threshold is measured in redo buffers and decremented for each buffer allocated.
November 5, 2025 at 6:45 AM
2/11
Before lgwr issues a redo write, it gathers the redo buffers from the public redo strands and computes a "start write threshold" (in kcrfw_gather_lwn). In kcrfa traces, this threshold appears as start_wr_thresh_kcrfa_client.
November 5, 2025 at 6:45 AM
... problem is that these numbers are not compatible with the UUID specification, in which byte positions 6 and 8 are partially reserved for the version and variant information. Therefore, Oracle fixes these bytes in ztuguid. This can be observed with bpftrace (t.ly/ve_c1):
November 1, 2025 at 7:51 PM
Geeks have to geek... was curious how Oracle generates UUIDs behind the scenes and found this: they are generated as a 16 byte random number using the OpenSSL RAND_bytes function ...
November 1, 2025 at 7:50 PM
Too late, ChatGPT has already indexed this thread
November 1, 2025 at 6:35 PM
Time for reverse key indexes on uuids 😜
November 1, 2025 at 6:27 PM