Skip to content

pbc/df: chunk kpt-pairs in CCGDFBuilder.outcore_auxe2 to avoid OOM#3257

Open
gauravharsha wants to merge 2 commits into
pyscf:masterfrom
gauravharsha:ccgdf-kpts-chunking
Open

pbc/df: chunk kpt-pairs in CCGDFBuilder.outcore_auxe2 to avoid OOM#3257
gauravharsha wants to merge 2 commits into
pyscf:masterfrom
gauravharsha:ccgdf-kpts-chunking

Conversation

@gauravharsha

Copy link
Copy Markdown

Fixes #3111.

_CCGDFBuilder.outcore_auxe2 allocates the int3c buffer for all nkpts**2 kpt-pairs at once: (nkpts_ij, max_buflen, nauxc) * 2 doubles. For a 6×6×6 k-mesh j_only=False that's 46656 pairs → hundreds of GB.

The existing memory check at gdf_builder.py:288 doesn't help — buflen clamps to 1 once nkpts_ij is large, _guess_shell_ranges rounds up to AO-shell granularity (max_buflen can be O(10³)), and the existing "may be N times over max_memory" warning only logs.

Fix

split kikj_idx into chunks sized so each chunk's int3c output fits in max_memory, calling gen_int3c_kernel per chunk with a restricted reindex_k. Whole kk_adapted groups are kept together so the merge_dd (ij_idx, ji_idx) pairing stays in one chunk; a local pair_pos map handles index translation. The fswap layout is unchanged so downstream readers see no difference.

Also: define nauxc = self.fused_cell.nao and use it in buflen (master uses naux, which underestimates per-pair workspace when CCDF's compensating basis is non-trivial).

Compatibility

when kpts_chunk >= nkpts_ij, nchunks == 1 and the loop body is functionally identical to master.

Verified

all test_gdf_builder cases pass; forced multi-chunk runs (up to 14 chunks on 27 kpts / 729 pairs) match a single-chunk reference to roundoff; the original failing 6×6×6 CCDF build completes.

The CCDF int3c kernel was invoked for all nkpts^2 kpt-pairs at once,
producing an output buffer of (nkpts_ij, max_buflen, nauxc) doubles
(R + I). For a 6x6x6 k-mesh with j_only=False this is 46656 pairs and
the buffer reaches hundreds of GB, causing OOM even when the formula
at line 295 honestly returns buflen ~ 1 (the AO-shell granularity
prevents shrinking further, and the "memory usage may be N times over
max_memory" warning then prints but the loop allocates anyway).

Split kikj_idx into chunks sized so each chunk's int3c output fits in
max_memory, and rebuild gen_int3c_kernel per chunk with reindex_k
restricted to that chunk. The shell-block granularity (sh_ranges) is
kept fixed across chunks so fswap row writes remain consistent, and
the pre-allocated fswap layout (indexed by global kpt-pair index) is
unchanged so downstream readers see no difference.

The merge_dd path that pairs (ij_idx, ji_idx) within a kk_adapted
group requires both indices to live in the same chunk; that path now
groups whole kk_adapted groups together. A local pair_pos map
translates global kpt-pair indices to positions in the chunk's
outR/outI arrays.

When kpts_chunk >= nkpts_ij (small problems or ample memory) nchunks
== 1 and the execution path is identical to before.
The chunk loop computes reindex_k_chunk inline per chunk, so the
top-level reindex_k variable became dead. Removes ruff F841.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[UNEXPECTED BEHAVIOR] Memory errors while generating density-fitted integrals with PBC

1 participant