Ryzen 3900X
Grid Hardware
Published: 2019-07-12

The Firepear computing stack is currently all Ryzen. It’s simple cost-benefit analysis: I’m interested in crunching as much data as possible per unit time, at a reasonable cost (in both money and electricity). Right now, that’s Ryzen.

I’m upgrading all my machines to the new R9 3900X CPU. This provides an opportunity to compare all three generations of the Ryzen family.

3900X vs 3950X

I want to throw as many cores as I can at the problems that I volunteer compute time for, but I also have a finite budget. The 3950X will be beyond my reach at launch, so my strategy is to upgrade to 3900X starting in early July.

Extrapolating from the price history of the 1X00 and 2X00 processors, between 6 months and a year after release, prices on the 3X00 CPUs will start to slide downward. When the 3950X reaches a price that works for me, I’ll do a second round of upgrades – assuming, of course, that the pattern holds!

Ryzen by the numbers

Here’s how every Ryzen I’ve ever had stacks up, by specs and metrics. Temperature data for 3X00 is not yet available, due to lack of initial system support.

Series Model Threads BaseClk LoadClk IPC Scale TDP LoadP Tdie
Zen 1600 12 3.2GHz 3.40GHz N/A 1.00x 65W 74.0W 57.9
Zen 1700 16 3.0GHz 3.20GHz N/A 1.25x 65W No data No data
Zen+ 2700 16 3.2GHz 3.30GHZ +3% 1.33x 65W 80.0W 64.2
Zen+ 2700L 16 3.2GHz 3.20GHZ +3% 1.29x 65W 67.5W 57.2
Zen 2 3900X 24 3.8GHz 4.01GHz +18% 2.78x 105W 156.5W
Zen 2 3900L 24 3.8GHz 3.55GHz +18% 2.46x 105W 78.0W
  1. LoadClk is the observed, all-core frequency under 100% load
  2. Scale is how one CPU compares to another, with the 1600 being defined as the baseline (1.00)
    • Formula: 1 * LoadClkΔ% * IPCΔ% * ThreadΔ%
  3. LoadP is the observed power usage under 100% load, minus the system idle load of 24.5W
  4. Tdie temps were measured by lm-sensors, at 100% load, with stock coolers
  • The notional 2700L model represents values obtained by clocking the 2700 to 3200MHz and setting Vcore offset to -100mV
  • The notional 3900L model represents values obtained by underclocking the 3900X for lower power usage (details below)
  • There are no metrics for the 1700 because I no longer have any of them

All CPUs were tested using ASRock B450 mini-ITX boards, NVMe SSDs, and 450W STX PSUs. There are memory speed variances between the machines: initially there was a 2400MHz/2666MHz split, later the split was 3000MHZ/3200MHz.

Underclocking the 3900X

Config LoadClk LoadP LoadPΔ% Tdie TdieΔ%
Stock 4.01GHz 156.5W
Test 1 3.80GHz 96.5W -38.3%
Test 2 3.60GHz 91.5W -41.5%
Test 3 3.40GHz 87.9W -43.8%
Test 4 3.20GHz 83.5W -46.6%
Test 5 3.00GHz 80.5W -48.6%
Test 6 2.72GHz 76.0W -51.4%
  1. LoadP is the observed power usage under 100% load, minus the system idle load of 24.5W
  2. Tdie temps were measured by lm-sensors, at 100% load, with the stock cooler
  • Test 1 (and all following tests) turnsoff AMD CBS and changes the Vcore offset to -100mV
  • Test 4 is where the 3900X is running at the same clockspeed as the 2700L. Equal clocks, but the 3900X still has 15% IPC uplift
  • Test 6 is where the 3900X should be doing the same amount of work per clock as the 2700L. 15% lower clock results in equivalent performance.

Undervolting the 3900X

My goal is to find the intersection of high performance and low power. My initial goal, with respect to power, was to meet or beat 80W (the draw of a 2700 at 3.3GHz). At stock voltages, that would require dropping below 3GHz, which was throwing away too much performance. Here’s the path I took to defining the 3700L configuration.

Vcore offset was -100mV. All tests were performed with a full load of WCG workunits running.

Clock CPU V LoadP Notes
3.20GHz 1.10000V 83.5W Stable; > 80W
3.40GHz 1.05000V 82.5W Stable; > 80W
3.60GHz 1.00000V Crash on boot
3.60GHz 1.02500V 82.0W Stable; > 80W
3.60GHz 1.00625V Crash on boot
3.60GHz 1.01250V 76.5W OS stable; processes segfault
3.55GHz 1.01875V 78.0W OK; 3900L config

World Community Grid: WUs in 24h

Now it’s time to look at how the CPUs compare when running actual scientific computing workloads.

These tables shows the number of workunits crunched in 24 hours for each processor type, for several World Community Grid subprojects. During these periods, the CPUs were only working on one type of WU.

These are the WU counts for 24 hours, with SMT disabled. There are no non-SMT counts for the 3900X/L because – spoiler alert – there was no point in spending three days gathering that data. SMT is good for all these applications.

There is information on the performance of the 3900 with varying numbers of threads in the Stockfish and OpenFOAM sections of this document.

Subproject 1600:6 2700:8 2700L:8
WCG OpenZika (AutoDock Vina) 195 237 254
WCG FAH2 (IMPACT/BEDAM/ASyncRe) 46 58 56
WCG Mapping Cancer Markers 69 88 87

Note: I do not have an explanation for the 2700 underperforming the 2700L when crunching Zika workunits in this instance. It had a slight performance lead in all other head-to-head tests, including Zika with all threads enabled. I double-checked the logs: that machine crunched 237 Zika WUs and zero other WUs in 24 hours.

These are the counts for 24h in the stock configuration (SMT enabled; SMT uplift in parens). There is no OpenZika data for 3X00 CPUs yet because there are currently no WUs available. 3900X data will be available later.

Subproject 1600 2700 2700L 3900X 3900L
WCG OpenZika 233 (1.19x) 308 (1.30x) 294 (1.16x)
WCG FAH2 60 (1.30x) 80 (1.38x) 80 (1.43x) 139
WCG MCM1 86 (1.25x) 110 (1.25x) 110 (1.26x) 195

World Community Grid: WU timings

There is no OpenZika data for 3X00 CPUs yet because there are currently no WUs available. 3900X testing will be done later.

WCG OpenZika WU timings: min/max and by quintile

CPU Min Max Avg
1600 00h 30min 22s 01h 44min 27s 01h 13min 36s
2700 00h 32min 54s 02h 40min 41s 01h 14min 40s
2700L 00h 27min 07s 03h 05min 24s 01h 18min 37s
3900X
3900L

OpenZika is problematic for benchmarking, as not all Zika WUs contain similar amounts of work. Therefore, in addition to min/max numbers, here is a table of WU runtimes, bucketed by quintile.

Quintiles are not equal across models, but split the span from minimum to maximum time for each CPU. This is bad for pure statistics, but 1:1 comparisons are impossible here due to the many variables in play. Treat these as simply more datapoints about each CPU’s performance under real-world conditions.

CPU 1st 2nd 3rd 4th 5th
1600 4 (01.7%) 9 (03.8%) 111 (46.8%) 108 (45.6%) 5 (02.1%)
2700 7 (02.4%) 289 (92.6%) 15 (04.7%) 0 (00.0%) 1 (00.3%)
2700L 8 (02.7%) 273 (92.9%) 2 (00.7%) 2 (00.7%) 9 (03.1%)
3900X
3900L

WCG Fight Aids @ Home WU timings

CPU Min Max Avg
1600 04h 21min 10s 04h 41min 28s 04h 33min 12s
2700 04h 23min 45s 04h 49min 35s 04h 37min 00s
2700L 04h 33min 19s 04h 54min 14s 04h 45min 32s
3900X
3900L 03h 29min 35s 05h 15min 49s 04h 08min 00s

Despite having a an 8% higher maximum runtime, the 3900L is 10% faster on average.

WCG Mapping Cancer Markers WU timings

CPU Min Max Avg
1600 03h 02min 22s 03h 29min 01s 03h 21min 46s
2700 03h 03min 25s 03h 35min 13s 03h 25min 36s
2700L 03h 01min 06s 03h 35min 07s 03h 26min 04s
3900X
3900L 02h 00min 59s 04h 23min 36s 02h 56min 30s

Again we see the 3900L with a higher maximum runtime but a lower minimum and average. This time it’s 14% faster on average.

Microbiome Immunity Project Workunit timings

MIP1 uses the Rosetta suite of molecular dynamics tools, which wants 4MB of L3 cache per running thread. Even the 3900, with its 70MB of L3, would be oversubscribed if 24 instances of MIP1 were running. MIP1 performance suffers when L3 misses are common and the working data needs to be reloaded from system memory.

Therefore, instead of running only MIP WUs on each CPU, this chart shows the times for MIP1 WUs which ran concurrently with other types of workunits over a period of 144h. The low end was (probably) when only 1 or 2 MIP1 WUs were running concurrently. The upper end was (probably) when more were in-flight.

This is also why there is no count of MIP1 WUs completed in a 24h window; the CPUs were crunching an effectively random mix of subproject WUs during the sample period, so counts would be meaningless.

CPU Min Max Avg
1600 00h 26min 04s 03h 48min 53s 01h 43min 07s
2700 00h 28min 47s 04h 22min 55s 02h 07min 37s
2700L 00h 28min 54s 04h 06min 25s 02h 07min 32s
3900X
3900L 00h 30min 08s 01h 54min 43s 01h 11min 46s
CPU 1st 2nd 3rd 4th 5th
1600 29 (13.2%) 94 (42.7%) 73 (33.2%) 21 (09.5%) 3 (01.4%)
2700 43 (09.4%) 172 (37.6%) 179 (39.0%) 47 (10.3%) 16 (03.7%)
2700L 16 (06.5%) 16 (06.5%) 88 (35.9%) 45 (18.4%) 9 (03.7%)
3900X
3900L 24 (08.8%) 66 (24.3%) 101 (37.1%) 57 (21.0%) 24 (08.8%)

Here the enormous L3 cache of the 3900 comes into play. Its average WU runtimes are half that of the other CPUs, and its maximum runtime is close to three times faster.

Stockfish chess engine

These numbers were generated by the version of Stockfish used widely for benchmarking CPUs, which you can find here. The command used was:

./asmFishL_2017-05-22_popcnt bench 1024 [THREADS] 26

CPU:threads Nodes/s Scale SMT Uplift
1600:6 12_522_525
1600:12 17_424_915 1.00x 1.39x
2700:8 16_675_725
2700:16 23_800_828 1.37x 1.43x
2700L:8 15_583_508
2700L:16 22_767_312 1.31x 1.46x
3900X:12 30_436_136
3900X:24 43_169_392 2.47x 1.42x
3900L:12 27_270_950
3900L:24 39_009_503 2.24x 1.43x
  • Scale is performance in multiples of the 1600:12
  • SMT Uplift is all-threads performance of a CPU in multiples of its own all-cores performance

During the 24 thread test, the 3900X was showing sustained boosts of 4.01GHz, with a power draw of 155.5W.

Note that the real-world differential between the 1600 and the 3900X, in this test, was 2.47X. The by-the-numbers theoretical differential was 2.78X. That’s pretty close to on-paper performance.

OpenFOAM CFD

These numbers were generated with the current Docker image (of_v1906) of the OpenFOAM computational fluid dynamics package, using the motorBike simpleFOAM tutorial, which is the standard for benchmarking. The algorithm was set to scotch, end time set to 100, and the tutorial’s Allrun script was used for execution. The command used to extract the performance data was:

grep Execution log.simpleFoam | tail -n 1 | cut -d " " -f 3

All CPUs were tested with 2 and 4 processes, to provide a baseline for comparison. Then each was tested with processes equal to its actual number of cores and threads, to show maximum performance.

CPU Time:2 Time:4 Time:6 Time:8 Time:12 Time:16 Time:24
1600 85.80 51.21 41.84 40.42
2700L 93.37 54.56 42.08 43.88
3900X 68.86 34.89 23.86 24.00
3900L 75.93 39.32 23.37 24.10

Incredible numbers from the 3900: 1.7X the performance of the 1600 at 12 processes. And again we see an isolated case where an underclocked/undervolted Ryzen (barely) outperforms one running at stock configuration.

There is very little ILU parallelism in this test, but I suppose that makes sense given the nature of fluid dynamics simulations. Jumping from all-cores to all-threads resulted in a tiny (3.5%) speedup on the 1600, and tiny regressions on the 2700 and 3900X (4.2% and 1%, respectively).

This test was also a good chance to get a look at frequency response and power usage in the 3900X under varying amounts of load.

Processes LoadFreq IdleFreq Power
2 4.50GHz 2.2Ghz 49.5W
4 4.20GHz 2.1GHz 55.5W
12 4.15GHz 105.5W
24 4.00GHz 117.5W
  • LoadFreq is the sustained clock on loaded cores
  • IdleFreq is the clock on unloaded cores
  • Power was sampled during the actual simpleFOAM run portion of the test, which was the most demanding. The 24.5W system idle power has been subtracted from this value.

Conclusions

The 3900X is…

  • A monster that can churn through incredible amounts of work
  • HEDT performance for $500
  • Absolutely worth it, if you’ve got the work to do
  • But hot and power-hungry in a stock configuration

You probably don’t want to disable SMT

The technique known as Symmetric Multithreading (SMT) dates back to 1982, but it didn’t appear in consumer-grade CPUs until 2002, when Intel introduced it in the Pentium 4 (calling it HyperThreading). Intel’s initial implementation – coupled with the terrible design of the P4 – meant that some workloads, in some cases, would see performance degragations with HT enabled.

There are examples in gaming, even today, of many cores impacting performance. But this seems to have more to do with the nature of game engines, which are coded to be as responsive as possible to the senses of the human playing the game. Concerns like accuracy, provability, and/or reproducibility, are distant seconds at best. It is an entirely separate regime of programming from the tasks this review is concerned with. (To be clear, that regime is neither better or worse, or any more or less legitimate: simply very, very different.)

In these tests, I saw only one instance of a performance regression due to threads exceeding the number of cores: the OpenFOAM runs. Given the small scale of these regressions, my hunch is that the issue was that there are no “extra” FPU/vector units in the Zen or Zen 2 cores, so oversubscribing execution threads led to scheduling hold-ups.

I don’t believe there was an actual performance penalty due to SMT being enabled. 2X the threads running in 1.01X the time sounds like scheduling overhead to me. It is a clear sign, however, that if you know your workload is extremely FPU/vector heavy, you should test to be sure of the response you’ll get.

Generally speaking, disabling SMT is throwing away between 15% and 45% of your CPU’s total possible performance.

Make sure your RAM is fast enough

During the initial rounds of testing, my stock 2700 was slightly but consistently slower than my clock-locked and undervolted 2700 (the 2700L, above). This made absolutely no sense, as I could see that the 2700’s clock speeds were a minimum of 100MHz faster than the 2700L’s at all times, and the systems were otherwise identical. At least, that’s what I thought.

It turned out that the 2700’s BIOS was clocking its memory at 2133MHz, rather than its rated 3000MHz. Changing this setting resulted in a 5% speedup for the 2700. The lesson here is that if you have many cores to feed, make sure you have memory fast enough to keep them fed.

Given that 3200MHz/CL16 RAM is currently ~$70 for 16GB, do yourself a favor and run decently fast RAM.

You can save power without sacrificing much performance

I was worried that the performance impacts would be severe, but you’ve seen the results with real-world testing. For the 2700/2700L:

  • Performance delta between 0% and 5%
  • Temperatures 11% cooler
  • Using 16% less power

And for the 3900X/3900L it’s even better:

  • Performance delta between 0% and 11% (about 9% on average)
  • Temperatures ??% cooler
  • Using 50% less power