Summary
In src/72_response/m_dfpt_scfcv.F90, subroutine dfpt_nsteltwf (the
non-self-consistent elastic-tensor accumulator that runs after a strain
SCF cycle converges) builds two wave-function buffers with an array
slice whose lower bound uses the local band index iband_me but whose
upper bound uses the global band index iband. When MPI band
parallelism is active (npband > 1) the two indices differ, the slice
becomes wider than the destination buffer, and ABINIT segfaults inside
this routine immediately after the first strain perturbation finishes
its SCF cycle.
This bug is present in v9.8.2 (the version we hit it on) and in the
current master branch (v10.6.7, file content as of
master/src/72_response/m_dfpt_scfcv.F90 lines 2774-2775).
Buggy lines
src/72_response/m_dfpt_scfcv.F90 (master, v10.6.7), around line 2774:
iband_me = 0
do iband=1,nband_k
if(mpi_enreg%proc_distrb(ikpt, iband, isppol) /= mpi_enreg%me_kpt) then
cycle
end if
iband_me = iband_me + 1
! Get ground-state and first-order wavefunctions
cwave0(:,:)=cg(:,1+(iband_me-1)*npw_k*nspinor+icg:iband*npw_k*nspinor+icg)
cwavef(:,:)=cg1(:,1+(iband_me-1)*npw1_k*nspinor+icg1:iband*npw1_k*nspinor+icg1)
...
The destination arrays were allocated with width npw_k*nspinor
(resp. npw1_k*nspinor):
ABI_MALLOC(cwave0,(2,npw_k*nspinor))
ABI_MALLOC(cwavef,(2,npw1_k*nspinor))
For the slice width to match the destination width we need
iband == iband_me, which is only true when npband == 1 (each rank
sees every band of its k-point).
When npband > 1, iband_me is the position in the local band list
while iband is the global band number, so the slice reads
(iband - iband_me + 1) * npw_k * nspinor columns of cg instead of
one column. For a typical run with nproc=32, nkpt=2, ABINIT
auto-selects npband=16 and the first kept band on a rank has
iband_me=1, iband=16, so the slice reads 16x the buffer width. This
falls off the end of the allocated cg array on most ranks and crashes.
The same module already uses the local index consistently elsewhere
(line 4219-4220 in master) for a similar loop:
do ipw=1,npw_k*nspinor
cwave0(1,ipw)=cg(1,ipw+(iband_me-1)*npw_k*nspinor+icg)
cwave0(2,ipw)=cg(2,ipw+(iband_me-1)*npw_k*nspinor+icg)
end do
confirming that iband_me is the intended index.
Proposed fix (2 lines)
- cwave0(:,:)=cg(:,1+(iband_me-1)*npw_k*nspinor+icg:iband*npw_k*nspinor+icg)
- cwavef(:,:)=cg1(:,1+(iband_me-1)*npw1_k*nspinor+icg1:iband*npw1_k*nspinor+icg1)
+ cwave0(:,:)=cg(:,1+(iband_me-1)*npw_k*nspinor+icg:iband_me*npw_k*nspinor+icg)
+ cwavef(:,:)=cg1(:,1+(iband_me-1)*npw1_k*nspinor+icg1:iband_me*npw1_k*nspinor+icg1)
I am happy to open a PR if that is helpful; just say the word. Please
let me know whether the active development branch on gitlab also wants
the fix mirrored.
Reproducer
System: 164-atom MOF, orthorhombic, sg=19 (P 2_1 2_1 2_1), 760 valence
electrons, nband=400, ecut=50 Ha, ngkpt=3 2 1, PseudoDojo NC SR
v0.4 PBE pseudos. The bug is independent of the cell -- any strain
DFPT run on >~30 ranks with paral_kgb=0 reproduces it.
Minimum reproducer is a strain-only DFPT (DS4) reading a pre-existing
GS WFK + DDK files via irdwfk=1 irdddk=1:
ndtset 1
jdtset 4
irdwfk 1
irdddk 1
paral_kgb 0
paral_rf 0
npband 1 ! IGNORED by ABINIT in paral_kgb=0 -- see "Workaround" below
npfft 1
bandpp 1
nband 400
rfstrs4 3
rfdir4 1 1 1
nqpt4 1
qpt4 0 0 0
kptopt4 2
tolvrs4 1.0d-8
nstep4 50
Launched with mpirun -np 32 abinit dma-1_strain.abi > log.
Actual behaviour
The first strain perturbation (ipert=natom+3, idir=1, uniaxial xx)
converges normally in ~20 SCF iterations. When ABINIT then proceeds to
the non-SC elastic step it crashes:
Backtrace:
#3 0x846228 in dfpt_nsteltwf
at src/72_response/m_dfpt_scfcv.F90:2882
#4 0x846228 in dfpt_nselt
at src/72_response/m_dfpt_scfcv.F90:2581
#5 0x846228 in __m_dfpt_scfcv_MOD_dfpt_scfcv
at src/72_response/m_dfpt_scfcv.F90:1365
#6 0x512da1 in __m_dfpt_loopert_MOD_dfpt_looppert
at src/95_drive/m_dfpt_looppert.F90:1902
...
mpirun noticed that process rank 27 with PID 0 on node n1 exited on
signal 11 (Segmentation fault).
The crash is at line 2882 in v9.8.2 (line 2774 in master) -- the array
slice with iband as the upper bound.
ABINIT's own log right above the crash shows the parallel decomposition
that triggered it:
- mpi_nproc: 32, omp_nthreads: 1
Present parallel dimensions: nkpt= 2 nsppol 1 nband per processor= 25 npband= 16
i.e. the user-supplied npband=1 was overridden because paral_kgb=0.
Workarounds (none satisfactory)
-
Force npband=1 via input. paral_kgb=0 (the only mode
supported by DFPT response calculations) silently ignores the
npband input variable. We confirmed this both by reading the log
above and by the fact that setting paral_rf 0 npband 1 npfft 1 in
the input had no effect.
-
Set nproc == nkpt (i.e. NP=2 for this system). This is the
only way to force npband=1 from the user side, but it makes the
per-iter wall ~16x slower (no band parallelism). For our 164-atom
MOF that turns a 24 h job into a ~70 h job which no longer fits the
batch slot.
-
Use kptopt 3 (full BZ, no symmetry reduction) to increase
nkpt from 2 to 6. Same issue: choosing nproc=6 works, anything
higher re-introduces band parallelism. Plus the 3x extra k-points
triples the compute cost.
-
Patch the source as above -- what we ultimately did locally.
Both DS3 (E-field) and DS4 (strain) converge cleanly afterwards.
Summary
In
src/72_response/m_dfpt_scfcv.F90, subroutinedfpt_nsteltwf(thenon-self-consistent elastic-tensor accumulator that runs after a strain
SCF cycle converges) builds two wave-function buffers with an array
slice whose lower bound uses the local band index
iband_mebut whoseupper bound uses the global band index
iband. When MPI bandparallelism is active (
npband > 1) the two indices differ, the slicebecomes wider than the destination buffer, and ABINIT segfaults inside
this routine immediately after the first strain perturbation finishes
its SCF cycle.
This bug is present in v9.8.2 (the version we hit it on) and in the
current
masterbranch (v10.6.7, file content as ofmaster/src/72_response/m_dfpt_scfcv.F90lines 2774-2775).Buggy lines
src/72_response/m_dfpt_scfcv.F90(master, v10.6.7), around line 2774:The destination arrays were allocated with width
npw_k*nspinor(resp.
npw1_k*nspinor):For the slice width to match the destination width we need
iband == iband_me, which is only true whennpband == 1(each ranksees every band of its k-point).
When
npband > 1,iband_meis the position in the local band listwhile
ibandis the global band number, so the slice reads(iband - iband_me + 1) * npw_k * nspinorcolumns ofcginstead ofone column. For a typical run with
nproc=32, nkpt=2, ABINITauto-selects
npband=16and the first kept band on a rank hasiband_me=1, iband=16, so the slice reads 16x the buffer width. Thisfalls off the end of the allocated
cgarray on most ranks and crashes.The same module already uses the local index consistently elsewhere
(line 4219-4220 in master) for a similar loop:
confirming that
iband_meis the intended index.Proposed fix (2 lines)
I am happy to open a PR if that is helpful; just say the word. Please
let me know whether the active development branch on gitlab also wants
the fix mirrored.
Reproducer
System: 164-atom MOF, orthorhombic, sg=19 (
P 2_1 2_1 2_1), 760 valenceelectrons,
nband=400,ecut=50 Ha,ngkpt=3 2 1, PseudoDojo NC SRv0.4 PBE pseudos. The bug is independent of the cell -- any strain
DFPT run on >~30 ranks with
paral_kgb=0reproduces it.Minimum reproducer is a strain-only DFPT (DS4) reading a pre-existing
GS WFK + DDK files via
irdwfk=1 irdddk=1:Launched with
mpirun -np 32 abinit dma-1_strain.abi > log.Actual behaviour
The first strain perturbation (
ipert=natom+3, idir=1, uniaxial xx)converges normally in ~20 SCF iterations. When ABINIT then proceeds to
the non-SC elastic step it crashes:
The crash is at line 2882 in v9.8.2 (line 2774 in master) -- the array
slice with
ibandas the upper bound.ABINIT's own log right above the crash shows the parallel decomposition
that triggered it:
i.e. the user-supplied
npband=1was overridden becauseparal_kgb=0.Workarounds (none satisfactory)
Force
npband=1via input.paral_kgb=0(the only modesupported by DFPT response calculations) silently ignores the
npbandinput variable. We confirmed this both by reading the logabove and by the fact that setting
paral_rf 0 npband 1 npfft 1inthe input had no effect.
Set
nproc == nkpt(i.e. NP=2 for this system). This is theonly way to force
npband=1from the user side, but it makes theper-iter wall ~16x slower (no band parallelism). For our 164-atom
MOF that turns a 24 h job into a ~70 h job which no longer fits the
batch slot.
Use
kptopt 3(full BZ, no symmetry reduction) to increasenkptfrom 2 to 6. Same issue: choosingnproc=6works, anythinghigher re-introduces band parallelism. Plus the 3x extra k-points
triples the compute cost.
Patch the source as above -- what we ultimately did locally.
Both DS3 (E-field) and DS4 (strain) converge cleanly afterwards.