Skip to content

Segfault in dfpt_nsteltwf (strain DFPT post-SCF) when npband > 1: array slice mixes local and global band index #96

Description

@SchrodingersCattt

Summary

In src/72_response/m_dfpt_scfcv.F90, subroutine dfpt_nsteltwf (the
non-self-consistent elastic-tensor accumulator that runs after a strain
SCF cycle converges) builds two wave-function buffers with an array
slice whose lower bound uses the local band index iband_me but whose
upper bound uses the global band index iband. When MPI band
parallelism is active (npband > 1) the two indices differ, the slice
becomes wider than the destination buffer, and ABINIT segfaults inside
this routine immediately after the first strain perturbation finishes
its SCF cycle.

This bug is present in v9.8.2 (the version we hit it on) and in the
current master branch (v10.6.7, file content as of
master/src/72_response/m_dfpt_scfcv.F90 lines 2774-2775).

Buggy lines

src/72_response/m_dfpt_scfcv.F90 (master, v10.6.7), around line 2774:

iband_me = 0
do iband=1,nband_k
  if(mpi_enreg%proc_distrb(ikpt, iband, isppol) /= mpi_enreg%me_kpt) then
    cycle
  end if
  iband_me = iband_me + 1
!  Get ground-state and first-order wavefunctions
  cwave0(:,:)=cg(:,1+(iband_me-1)*npw_k*nspinor+icg:iband*npw_k*nspinor+icg)
  cwavef(:,:)=cg1(:,1+(iband_me-1)*npw1_k*nspinor+icg1:iband*npw1_k*nspinor+icg1)
  ...

The destination arrays were allocated with width npw_k*nspinor
(resp. npw1_k*nspinor):

ABI_MALLOC(cwave0,(2,npw_k*nspinor))
ABI_MALLOC(cwavef,(2,npw1_k*nspinor))

For the slice width to match the destination width we need
iband == iband_me, which is only true when npband == 1 (each rank
sees every band of its k-point).

When npband > 1, iband_me is the position in the local band list
while iband is the global band number, so the slice reads
(iband - iband_me + 1) * npw_k * nspinor columns of cg instead of
one column. For a typical run with nproc=32, nkpt=2, ABINIT
auto-selects npband=16 and the first kept band on a rank has
iband_me=1, iband=16, so the slice reads 16x the buffer width. This
falls off the end of the allocated cg array on most ranks and crashes.

The same module already uses the local index consistently elsewhere
(line 4219-4220 in master) for a similar loop:

do ipw=1,npw_k*nspinor
  cwave0(1,ipw)=cg(1,ipw+(iband_me-1)*npw_k*nspinor+icg)
  cwave0(2,ipw)=cg(2,ipw+(iband_me-1)*npw_k*nspinor+icg)
end do

confirming that iband_me is the intended index.

Proposed fix (2 lines)

-  cwave0(:,:)=cg(:,1+(iband_me-1)*npw_k*nspinor+icg:iband*npw_k*nspinor+icg)
-  cwavef(:,:)=cg1(:,1+(iband_me-1)*npw1_k*nspinor+icg1:iband*npw1_k*nspinor+icg1)
+  cwave0(:,:)=cg(:,1+(iband_me-1)*npw_k*nspinor+icg:iband_me*npw_k*nspinor+icg)
+  cwavef(:,:)=cg1(:,1+(iband_me-1)*npw1_k*nspinor+icg1:iband_me*npw1_k*nspinor+icg1)

I am happy to open a PR if that is helpful; just say the word. Please
let me know whether the active development branch on gitlab also wants
the fix mirrored.

Reproducer

System: 164-atom MOF, orthorhombic, sg=19 (P 2_1 2_1 2_1), 760 valence
electrons, nband=400, ecut=50 Ha, ngkpt=3 2 1, PseudoDojo NC SR
v0.4 PBE pseudos. The bug is independent of the cell -- any strain
DFPT run on >~30 ranks with paral_kgb=0 reproduces it.

Minimum reproducer is a strain-only DFPT (DS4) reading a pre-existing
GS WFK + DDK files via irdwfk=1 irdddk=1:

ndtset 1
jdtset 4

irdwfk 1
irdddk 1

paral_kgb 0
paral_rf  0
npband    1     ! IGNORED by ABINIT in paral_kgb=0 -- see "Workaround" below
npfft     1
bandpp    1
nband     400

rfstrs4   3
rfdir4    1 1 1
nqpt4     1
qpt4      0 0 0
kptopt4   2
tolvrs4   1.0d-8
nstep4    50

Launched with mpirun -np 32 abinit dma-1_strain.abi > log.

Actual behaviour

The first strain perturbation (ipert=natom+3, idir=1, uniaxial xx)
converges normally in ~20 SCF iterations. When ABINIT then proceeds to
the non-SC elastic step it crashes:

Backtrace:
#3  0x846228 in dfpt_nsteltwf
        at src/72_response/m_dfpt_scfcv.F90:2882
#4  0x846228 in dfpt_nselt
        at src/72_response/m_dfpt_scfcv.F90:2581
#5  0x846228 in __m_dfpt_scfcv_MOD_dfpt_scfcv
        at src/72_response/m_dfpt_scfcv.F90:1365
#6  0x512da1 in __m_dfpt_loopert_MOD_dfpt_looppert
        at src/95_drive/m_dfpt_looppert.F90:1902
...
mpirun noticed that process rank 27 with PID 0 on node n1 exited on
signal 11 (Segmentation fault).

The crash is at line 2882 in v9.8.2 (line 2774 in master) -- the array
slice with iband as the upper bound.

ABINIT's own log right above the crash shows the parallel decomposition
that triggered it:

-   mpi_nproc: 32, omp_nthreads: 1
 Present parallel dimensions: nkpt= 2 nsppol 1 nband per processor= 25 npband= 16

i.e. the user-supplied npband=1 was overridden because paral_kgb=0.

Workarounds (none satisfactory)

  1. Force npband=1 via input. paral_kgb=0 (the only mode
    supported by DFPT response calculations) silently ignores the
    npband input variable. We confirmed this both by reading the log
    above and by the fact that setting paral_rf 0 npband 1 npfft 1 in
    the input had no effect.

  2. Set nproc == nkpt (i.e. NP=2 for this system). This is the
    only way to force npband=1 from the user side, but it makes the
    per-iter wall ~16x slower (no band parallelism). For our 164-atom
    MOF that turns a 24 h job into a ~70 h job which no longer fits the
    batch slot.

  3. Use kptopt 3 (full BZ, no symmetry reduction) to increase
    nkpt from 2 to 6. Same issue: choosing nproc=6 works, anything
    higher re-introduces band parallelism. Plus the 3x extra k-points
    triples the compute cost.

  4. Patch the source as above -- what we ultimately did locally.
    Both DS3 (E-field) and DS4 (strain) converge cleanly afterwards.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions