Skip to content

fix: segfault in DFPT strain non-self-consistent elastic-tensor accumulator#97

Open
SchrodingersCattt wants to merge 1 commit into
abinit:masterfrom
SchrodingersCattt:fix/dfpt-nsteltwf-iband-me-slice
Open

fix: segfault in DFPT strain non-self-consistent elastic-tensor accumulator#97
SchrodingersCattt wants to merge 1 commit into
abinit:masterfrom
SchrodingersCattt:fix/dfpt-nsteltwf-iband-me-slice

Conversation

@SchrodingersCattt

Copy link
Copy Markdown

In dfpt_nsteltwf (m_dfpt_scfcv.F90) the slices that copy ground-state and first-order wavefunctions out of cg/cg1 mixed the global band index (iband) on the upper bound with the local-to-MPI-process index (iband_me) on the lower bound:

cwave0(:,:)=cg(:,1+(iband_me-1)npw_knspinor+icg:ibandnpw_knspinor+icg)
cwavef(:,:)=cg1(:,1+(iband_me-1)npw1_knspinor+icg1:ibandnpw1_knspinor+icg1)

Whenever DFPT band parallelism is active (npband > 1, which ABINIT auto-selects in many strain-response runs even with paral_kgb = 0), iband > iband_me on every rank that does not own the lowest band group, so the slice is longer than the destination buffer (npw_k*nspinor) and overruns cg/cg1. This produces a SIGSEGV inside dfpt_nsteltwf during the strain (rfstrs) dataset of a 5-dataset DFPT pipeline; the crash was observed in a 200-atom MOF DFPT run on ABINIT 9.8.2 and reproduces on current master (v10.6.7).

Related to #96

…lator

In dfpt_nsteltwf (m_dfpt_scfcv.F90) the slices that copy ground-state
and first-order wavefunctions out of cg/cg1 mixed the global band index
(iband) on the upper bound with the local-to-MPI-process index
(iband_me) on the lower bound:

  cwave0(:,:)=cg(:,1+(iband_me-1)*npw_k*nspinor+icg:iband*npw_k*nspinor+icg)
  cwavef(:,:)=cg1(:,1+(iband_me-1)*npw1_k*nspinor+icg1:iband*npw1_k*nspinor+icg1)

Whenever DFPT band parallelism is active (npband > 1, which ABINIT
auto-selects in many strain-response runs even with paral_kgb = 0),
iband > iband_me on every rank that does not own the lowest band group,
so the slice is longer than the destination buffer (npw_k*nspinor) and
overruns cg/cg1. This produces a SIGSEGV inside dfpt_nsteltwf during the
strain (rfstrs) dataset of a 5-dataset DFPT pipeline; the crash was
observed in a 200-atom MOF DFPT run on ABINIT 9.8.2 and reproduces on
current master (v10.6.7).

Fix: use iband_me consistently on both bounds, so the slice length is
always exactly npw_k*nspinor (resp. npw1_k*nspinor), matching the
allocation of cwave0/cwavef.

Related to the issue describing this segfault and reproducer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant