The following holds for code pw.x and for non-US PPs. For US PPs there are additional terms to be calculated, that may add from a few percent up to 30-40% to execution time. For phonon calculations, each of the 3Nat modes requires a CPU time of the same order of that required by a self-consistent calculation in the same system. For cp.x, the required CPU time of each time step is in the order of the time Th + Torth + Tsub defined below.
The computer time required for the self-consistent solution at fixed ionic positions, Tscf , is:
where Niter = niter = number of self-consistency iterations, Titer = CPU time for a single iteration, Tinit = initialization time for a single iteration. Usually Tinit < < NiterTiter .
The time required for a single self-consistency iteration Titer is:
where Nk = number of k-points, Tdiag = CPU time per hamiltonian iterative diagonalization, Trho = CPU time for charge density calculation, Tscf = CPU time for Hartree and exchange-correlation potential calculation.
The time for a Hamiltonian iterative diagonalization Tdiag is:
where Nh = number of H ψ products needed by iterative diagonalization, Th = CPU time per H ψ product, Torth = CPU time for orthonormalization, Tsub = CPU time for subspace diagonalization.
The time Th required for a H ψ product is
The first term comes from the kinetic term and is usually much smaller than the others. The second and third terms come respectively from local and nonlocal potential. a1,a2,a3 are prefactors, M = number of valence bands, N = number of plane waves (basis set dimension), N1,N2,N3 = dimensions of the FFT grid for wavefunctions (N1N2N3 ∼ 8N ), P = number of projectors for PPs (summed on all atoms, on all values of the angular momentum l, and m = 1, . . . , 2l + 1)
The time Torth required by orthonormalization is
and the time Tsub required by subspace diagonalization is
where b1 and b2 are prefactors, Mx = number of trial wavefunctions (this will vary between M and a few times M , depending on the algorithm).
The time Trho for the calculation of charge density from wavefunctions is
where c1,c2,c3 are prefactors, Nr1,Nr2,Nr3 = dimensions of the FFT grid for charge density (Nr1Nr2Nr3∼ 8Ng , where Ng = number of G-vectors for the charge density), and Tus = CPU time required by ultrasoft contribution (if any).
The time Tscf for calculation of potential from charge density is
where d1,d2 are prefactors.
A typical self-consistency or molecular-dynamics run requires a maximum memory in the order of O double precision complex numbers, where
with m, p, q = small factors; all other variables have the same meaning as above. Note that if the Γ-point only (q = 0) is used to sample the Brillouin Zone, the value of N will be cut into half.
The memory required by the phonon code follows the same patterns, with somewhat larger factors m, p, q .
A typical pw.x run will require an amount of temporary disk space in the order of O double precision complex numbers:
where q = 2 · mixing ndim (number of iterations used in self-consistency, default value = 8) if disk io is set to ’high’ or not specified; q = 0 if disk io=’low’ or ’minimal’.
pw.x and cp.x can run in principle on any number of processors. The effectiveness of parallelization is ultimately judged by the scaling, i.e. how the time needed to perform a job scales with the number of processors, and depends upon:
Ideally one would like to have linear scaling, i.e. for Np processors. In addition, one would like to have linear scaling of the RAM per processor: , so that large-memory systems fit into the RAM of each processor.
As a general rule, image parallelization:
Parallelization on k-points:
Parallelization on plane-waves:
A note on scaling: optimal serial performances are achieved when the data are as much as possible kept into the cache. As a side effect, plane-wave parallelization may yield superlinear (better than linear) scaling, thanks to the increase in serial speed coming from the reduction of data size (makingit easier for the machine to keep data in the cache).
For each system there is an optimal range of number of processors on which to run the job. A too large number of processors will yield performance degradation. If the size of pools is especially delicate: Np should not exceed by much N3 and Nr3. For large jobs, it is convenient to further subdivide a pool of processors into task groups. The 3D FFT grid is parallelized using ation of the 3D FFT when the number of processors exceeds the number of FFT planes, data can be redistributed to "task groups" so that each group can process several wavefunctions at the same time.
The optimal number of processors for the ortho (cp.x) or ndiag (pw.x) parallelization, taking care of linear algebra operations involving matrices, is automatically chosen by the code.
Actual parallel performances will also depend a lot on the available software (MPI libraries) and on the available communication hardware. For Beowulf-style machines (clusters of PC) the newest version 1.1 and later of the OpenMPI libraries (http://www.openmpi.org/) seems to yield better performances than other implementations (info by Kostantin Kudin). Note however that you need a decent communication hardware (at least Gigabit ethernet) in order to have acceptable performances with PW parallelization. Do not expect good scaling with cheap hardware: plane-wave calculations are by no means an “embarrassing parallel” problem.
Also note that multiprocessor motherboards for Intel Pentium CPUs typically have just one memory bus for all processors. This dramatically slows down any code doing massive access to memory (as most codes in the Quantum-ESPRESSO package do) that runs on processors of the same motherboard.