15 May 2013

414. Frequency vs cores? Crude benchmarking on AMD FX 8150

I'm thinking about building my next computational node, and one issue which is preoccupying me is whether to go for lots of cores (e.g. a dual sock mobo with two 16 core 2.1 GHz cpus) or for a balance of cores and frequency (e.g. single-socket mobo with a 3.8 GHz 8 core cpu). Remember, this is built with private money -- not research grants -- so the budget is tight.

I mean, I can't look at something like this without wanting to buy it: http://www.newegg.com/Product/Product.aspx?Item=N82E16819113036. The question is whether I'm better off buying another one or two fx8150 for the price of 16x2 down-clocked cores.

Benchmarking with the FX 8150 actually makes some sense here if one of the newegg reviewers is to be believed, since the Opteron 6272 is described as two 8150s glued together and down-clocked.

The system: 32 gb ram, fx 8150, nwchem 6.1.1 with acml 5.3.1 (gfortran,int64, fma4) and openmpi.

Short of finding benchmarks for the type of applications that interest me (nwchem, mostly), I figure I could get a rough idea by throttling the frequency of my eight-core FX8150 and compare with unthrottled runs where the number of cores is limited.

Two things to take into account when looking at the times below:
  • modern processors are complex beasts -- I don't claim to fully understand threads vs virtual threads and integer vs FPU. In the FX8150 there are four fpus but eight cores. What this really means in practical terms when doing these particular test calculations, I don't know.
  • This isn't my job, and I need my nodes for running job-related calcs, so by necessity I had to use a short test job. There's inevitably some variability in the results, and using longer test jobs might affect the results somewhat.
  • The execution times vary A LOT for 'identical' conditions (see raw data), hence why I repeated the runs in bold ten times at 3.6 GHz to get reasonably solid comparison values. Still not perfect since the distribution isn't properly gaussian.

The specific question I wanted answered is:
Are 8 threads at 2.1 GHz significantly better than 4 threads at 3.6 GHz?
Short answer: No.
Looks like I won't be investing in 2 x 16 core 2.1 GHz cpus after all.


Optimization
c/f     3.60    3.30    2.70    2.10    1.40
8       44/3    49/6    58/1    75/6    110/5  
7       48/3                     72
6       52/1                    106
5       59/4            85       97
4       67/8            93     113/10    156
3       85/7
2      117/10
1      237/24
c=number of cores; f= frequency in GHz.

(times in seconds. 44/3 means 44 s +/- 3 s)

The way I read this is that it's better to have a 4-core 3.6 GHz cpu than an 8-core 2.1 GHz CPU. The whole 4 FPU/8 cores has me confused though, so I'm not sure whether that's affecting the results in a significant way.

The other thing to take into account is that there isn't normally a linear relationship between number of cores and execution times anyway -- doubling the number of cores doesn't normally lead to a halving of the execution time, so 16 cores at 2.10 GHz wouldn't necessarily be anywhere near 75/2=37 s. (again, that's ignoring the 2 cores/1 fpu issue)

-------------
c/f: raw data
--------------
8/3.6: 37.7,47.4,46.9,38.8, 46.8, 42.4,46.6, 43.9,44.7,42.8 => 44+/-3 s
7/3.6: 41.3,48.7,47.9,48.8,47.0,48.8,50.8,42.4,52.1,47.9 => 48+/-3 s
6/3.6: 49.5,53.4,50.5,53.4,52.4,53.3,51.3,53.4,52.5,53.55 => 52+/-1 s
5/3.6: 54.1,57.1, 67.7,52.2,59.6,58.4,59.8,57.6,59.4,58.6 => 59+/-4 s
4/3.6: 83.1,63.5,73.7,70.0,68.6,58.1,58.1,67.2,69.9,58.2 => 67 +/-8 s
3/3.6: 89.5, 86.0, 82.8, 97.9, 74.4,86.2,89.7, 86.3, 74.5, 86.2 => 85 +/-7 s
2/3.6: 114.1,137.4, 118.6, 108.3, 116.3, 123.6, 104.4,124.3,104.7, 120.6 => 117+/-10 s
1/3.6: 242.6,201.9,232.9,242.7, 233.2,202.0,233.1,265.2, 278.9,233.5 => 237+/- 24
8/3.3: 51.9, 42.4,42.7,55.3,43.3,55.8,54.6,48.1,42.4,48.1 => 49+/-6 s
8/2.7: 59.4, 57.3,59.1,57.8,58.9,56.8,59.0,58.5,59.2,56.9 => 58+/-1
8/2.1: 75.6,82.9,73.7,65.1,76.9,84.3,65.4,73.9,76.4,78.1 => 75+/-6 s
8/1.4: 112.5,110.5,112.1,108.6,113.1,114.4,112.4,109.1,97.9 => 110+/-5
4/2.1: 124.9,103.7,104.1, 92.4, 117.6,115.5,117.5,120.1,115.6,120.2 => 113+/-10 s

An alternative would be to report the fastest time (out of e.g. 10 tries) since it represents maximum capacity.



optimization input
scratch_dir /scratch
start benzeneopt 

geometry units angstroms
C  0.100  1.396  0.000
C  1.209  0.698  0.000
C  1.209 -0.698  0.000
C  0.000 -1.396  0.000
C -1.209 -0.698  0.000
C -1.209  0.698  0.000
H  0.000  2.479  0.000
H  2.147  1.240  0.000
H  2.147 -1.240  0.000
H  0.000 -2.479  0.000
H -2.147 -1.240  0.000
H -2.147  1.240  0.000
end

basis
 H library "6-31+g*" 
 c library "6-31+g*"
end
dft
 direct
end

task dft optimize



Setting frequency
The following script was called with the frequency in GHz, e.g. sudo setfreq 3.6

setfreq
/usr/bin/cpufreq-set -c 0 -g userspace
/usr/bin/cpufreq-set -c 1 -g userspace
/usr/bin/cpufreq-set -c 2 -g userspace
/usr/bin/cpufreq-set -c 3 -g userspace
/usr/bin/cpufreq-set -c 4 -g userspace
/usr/bin/cpufreq-set -c 5 -g userspace
/usr/bin/cpufreq-set -c 6 -g userspace
/usr/bin/cpufreq-set -c 7 -g userspace
/usr/bin/cpufreq-set -c 0 -f $1G
/usr/bin/cpufreq-set -c 1 -f $1G
/usr/bin/cpufreq-set -c 2 -f $1G
/usr/bin/cpufreq-set -c 3 -f $1G
/usr/bin/cpufreq-set -c 4 -f $1G
/usr/bin/cpufreq-set -c 5 -f $1G
/usr/bin/cpufreq-set -c 6 -f $1G
/usr/bin/cpufreq-set -c 7 -f $1G

No comments:

Post a Comment