Ryzen 3900X

Published: 2019-07-12

The Firepear computing stack is currently all Ryzen. It’s simple cost-benefit analysis: I’m interested in crunching as much data as possible per unit time, at a reasonable cost (in both money and electricity). Right now, that’s Ryzen.

I’m upgrading all my machines to the new R9 3900X CPU. This provides an opportunity to compare all three generations of the Ryzen family.

3900X vs 3950X

I want to throw as many cores as I can at the problems that I volunteer compute time for, but I also have a finite budget. The 3950X will be beyond my reach at launch, so my strategy is to upgrade to 3900X starting in early July.

Extrapolating from the price history of the 1X00 and 2X00 processors, between 6 months and a year after release, prices on the 3X00 CPUs will start to slide downward. When the 3950X reaches a price that works for me, I’ll do a second round of upgrades – assuming, of course, that the pattern holds!

Ryzen by the numbers

Here’s how every Ryzen I’ve ever had stacks up, by specs and metrics. Temperature data for 3X00 is not yet available, due to lack of initial system support.

Series	Model	Threads	BaseClk	LoadClk	IPC	Scale	TDP	LoadP	Tdie
Zen	1600	12	3.2GHz	3.40GHz	N/A	1.00x	65W	74.0W	57.9
Zen	1700	16	3.0GHz	3.20GHz	N/A	1.25x	65W	No data	No data
Zen+	2700	16	3.2GHz	3.30GHZ	+3%	1.33x	65W	80.0W	64.2
Zen+	2700L	16	3.2GHz	3.20GHZ	+3%	1.29x	65W	67.5W	57.2
Zen 2	3900X	24	3.8GHz	4.01GHz	+18%	2.78x	105W	156.5W
Zen 2	3900L	24	3.8GHz	3.55GHz	+18%	2.46x	105W	78.0W

LoadClk is the observed, all-core frequency under 100% load
Scale is how one CPU compares to another, with the 1600 being defined as the baseline (1.00)
- Formula: 1 * LoadClkΔ% * IPCΔ% * ThreadΔ%
LoadP is the observed power usage under 100% load, minus the system idle load of 24.5W
Tdie temps were measured by lm-sensors, at 100% load, with stock coolers

The notional 2700L model represents values obtained by clocking the 2700 to 3200MHz and setting Vcore offset to -100mV
The notional 3900L model represents values obtained by underclocking the 3900X for lower power usage (details below)
There are no metrics for the 1700 because I no longer have any of them

All CPUs were tested using ASRock B450 mini-ITX boards, NVMe SSDs, and 450W STX PSUs. There are memory speed variances between the machines: initially there was a 2400MHz/2666MHz split, later the split was 3000MHZ/3200MHz.

Underclocking the 3900X

NB: I am leaving this section intact for historical purposes, but it was written before Zen 2’s clock stretching issues were known. For current advice and accurate numbers, please see this update

Config	LoadClk	LoadP	LoadPΔ%	TdieΔ%
Stock	4.01GHz	156.5W	–	–
Test 1	3.80GHz	96.5W	-38.3%
Test 2	3.60GHz	91.5W	-41.5%
Test 3	3.40GHz	87.9W	-43.8%
Test 4	3.20GHz	83.5W	-46.6%
Test 5	3.00GHz	80.5W	-48.6%
Test 6	2.72GHz	76.0W	-51.4%

LoadP is the observed power usage under 100% load, minus the system idle load of 24.5W
Tdie temps were measured by lm-sensors, at 100% load, with the stock cooler

Test 1 (and all following tests) turnsoff AMD CBS and changes the Vcore offset to -100mV
Test 4 is where the 3900X is running at the same clockspeed as the 2700L. Equal clocks, but the 3900X still has 15% IPC uplift
Test 6 is where the 3900X should be doing the same amount of work per clock as the 2700L. 15% lower clock results in equivalent performance.

Undervolting the 3900X

My goal is to find the intersection of high performance and low power. My initial goal, with respect to power, was to meet or beat 80W (the draw of a 2700 at 3.3GHz). At stock voltages, that would require dropping below 3GHz, which was throwing away too much performance. Here’s the path I took to defining the 3700L configuration.

Vcore offset was -100mV. All tests were performed with a full load of WCG workunits running.

Clock	CPU V	LoadP	Notes
3.20GHz	1.10000V	83.5W	Stable; > 80W
3.40GHz	1.05000V	82.5W	Stable; > 80W
3.60GHz	1.00000V	–	Crash on boot
3.60GHz	1.02500V	82.0W	Stable; > 80W
3.60GHz	1.00625V	–	Crash on boot
3.60GHz	1.01250V	76.5W	OS stable; processes segfault
3.55GHz	1.01875V	78.0W	OK; 3900L config

World Community Grid: WUs in 24h

Now it’s time to look at how the CPUs compare when running actual scientific computing workloads.

These tables shows the number of workunits crunched in 24 hours for each processor type, for several World Community Grid subprojects. During these periods, the CPUs were only working on one type of WU.

These are the WU counts for 24 hours, with SMT disabled. There are no non-SMT counts for the 3900X/L because – spoiler alert – there was no point in spending three days gathering that data. SMT is good for all these applications.

There is information on the performance of the 3900 with varying numbers of threads in the Stockfish and OpenFOAM sections of this document.

Subproject	1600:6	2700:8	2700L:8
WCG OpenZika (AutoDock Vina)	195	237	254
WCG FAH2 (IMPACT/BEDAM/ASyncRe)	46	58	56
WCG Mapping Cancer Markers	69	88	87

Note: I do not have an explanation for the 2700 underperforming the 2700L when crunching Zika workunits in this instance. It had a slight performance lead in all other head-to-head tests, including Zika with all threads enabled. I double-checked the logs: that machine crunched 237 Zika WUs and zero other WUs in 24 hours.

These are the counts for 24h in the stock configuration (SMT enabled; SMT uplift in parens). There is no OpenZika data for 3X00 CPUs yet because there are currently no WUs available. 3900X data will be available later.

Subproject	1600	2700	2700L	3900X	3900L
WCG OpenZika	233 (1.19x)	308 (1.30x)	294 (1.16x)	–	–
WCG FAH2	60 (1.30x)	80 (1.38x)	80 (1.43x)	–	139
WCG MCM1	86 (1.25x)	110 (1.25x)	110 (1.26x)	–	195

World Community Grid: WU timings

There is no OpenZika data for 3X00 CPUs yet because there are currently no WUs available. 3900X testing will be done later.

WCG OpenZika WU timings: min/max and by quintile

CPU	Min	Max	Avg
1600	00h 30min 22s	01h 44min 27s	01h 13min 36s
2700	00h 32min 54s	02h 40min 41s	01h 14min 40s
2700L	00h 27min 07s	03h 05min 24s	01h 18min 37s
3900X
3900L

OpenZika is problematic for benchmarking, as not all Zika WUs contain similar amounts of work. Therefore, in addition to min/max numbers, here is a table of WU runtimes, bucketed by quintile.

Quintiles are not equal across models, but split the span from minimum to maximum time for each CPU. This is bad for pure statistics, but 1:1 comparisons are impossible here due to the many variables in play. Treat these as simply more datapoints about each CPU’s performance under real-world conditions.

CPU	1st	2nd	3rd	4th	5th
1600	4 (01.7%)	9 (03.8%)	111 (46.8%)	108 (45.6%)	5 (02.1%)
2700	7 (02.4%)	289 (92.6%)	15 (04.7%)	0 (00.0%)	1 (00.3%)
2700L	8 (02.7%)	273 (92.9%)	2 (00.7%)	2 (00.7%)	9 (03.1%)
3900X
3900L

WCG Fight Aids @ Home WU timings

CPU	Min	Max	Avg
1600	04h 21min 10s	04h 41min 28s	04h 33min 12s
2700	04h 23min 45s	04h 49min 35s	04h 37min 00s
2700L	04h 33min 19s	04h 54min 14s	04h 45min 32s
3900X
3900L	03h 29min 35s	05h 15min 49s	04h 08min 00s

Despite having a an 8% higher maximum runtime, the 3900L is 10% faster on average.

WCG Mapping Cancer Markers WU timings

CPU	Min	Max	Avg
1600	03h 02min 22s	03h 29min 01s	03h 21min 46s
2700	03h 03min 25s	03h 35min 13s	03h 25min 36s
2700L	03h 01min 06s	03h 35min 07s	03h 26min 04s
3900X
3900L	02h 00min 59s	04h 23min 36s	02h 56min 30s

Again we see the 3900L with a higher maximum runtime but a lower minimum and average. This time it’s 14% faster on average.

Microbiome Immunity Project Workunit timings

MIP1 uses the Rosetta suite of molecular dynamics tools, which wants 4MB of L3 cache per running thread. Even the 3900, with its 70MB of L3, would be oversubscribed if 24 instances of MIP1 were running. MIP1 performance suffers when L3 misses are common and the working data needs to be reloaded from system memory.

Therefore, instead of running only MIP WUs on each CPU, this chart shows the times for MIP1 WUs which ran concurrently with other types of workunits over a period of 144h. The low end was (probably) when only 1 or 2 MIP1 WUs were running concurrently. The upper end was (probably) when more were in-flight.

This is also why there is no count of MIP1 WUs completed in a 24h window; the CPUs were crunching an effectively random mix of subproject WUs during the sample period, so counts would be meaningless.

CPU	Min	Max	Avg
1600	00h 26min 04s	03h 48min 53s	01h 43min 07s
2700	00h 28min 47s	04h 22min 55s	02h 07min 37s
2700L	00h 28min 54s	04h 06min 25s	02h 07min 32s
3900X
3900L	00h 30min 08s	01h 54min 43s	01h 11min 46s

CPU	1st	2nd	3rd	4th	5th
1600	29 (13.2%)	94 (42.7%)	73 (33.2%)	21 (09.5%)	3 (01.4%)
2700	43 (09.4%)	172 (37.6%)	179 (39.0%)	47 (10.3%)	16 (03.7%)
2700L	16 (06.5%)	16 (06.5%)	88 (35.9%)	45 (18.4%)	9 (03.7%)
3900X
3900L	24 (08.8%)	66 (24.3%)	101 (37.1%)	57 (21.0%)	24 (08.8%)

Here the enormous L3 cache of the 3900 comes into play. Its average WU runtimes are half that of the other CPUs, and its maximum runtime is close to three times faster.

Stockfish chess engine

These numbers were generated by the version of Stockfish used widely for benchmarking CPUs, which you can find here. The command used was:

./asmFishL_2017-05-22_popcnt bench 1024 [THREADS] 26

CPU:threads	Nodes/s	Scale	SMT Uplift
1600:6	12_522_525
1600:12	17_424_915	1.00x	1.39x
2700:8	16_675_725
2700:16	23_800_828	1.37x	1.43x
2700L:8	15_583_508
2700L:16	22_767_312	1.31x	1.46x
3900X:12	30_436_136
3900X:24	43_169_392	2.47x	1.42x
3900L:12	27_270_950
3900L:24	39_009_503	2.24x	1.43x

Scale is performance in multiples of the 1600:12
SMT Uplift is all-threads performance of a CPU in multiples of its own all-cores performance

During the 24 thread test, the 3900X was showing sustained boosts of 4.01GHz, with a power draw of 155.5W.

Note that the real-world differential between the 1600 and the 3900X, in this test, was 2.47X. The by-the-numbers theoretical differential was 2.78X. That’s pretty close to on-paper performance.

OpenFOAM CFD

These numbers were generated with the current Docker image (of_v1906) of the OpenFOAM computational fluid dynamics package, using the motorBike simpleFOAM tutorial, which is the standard for benchmarking. The algorithm was set to scotch, end time set to 100, and the tutorial’s Allrun script was used for execution. The command used to extract the performance data was:

grep Execution log.simpleFoam | tail -n 1 | cut -d " " -f 3

All CPUs were tested with 2 and 4 processes, to provide a baseline for comparison. Then each was tested with processes equal to its actual number of cores and threads, to show maximum performance.

CPU	Time:2	Time:4	Time:6	Time:8	Time:12	Time:16	Time:24
1600	85.80	51.21	41.84	–	40.42	–	–
2700L	93.37	54.56	–	42.08	–	43.88	–
3900X	68.86	34.89	–	–	23.86	–	24.00
3900L	75.93	39.32	–	–	23.37	–	24.10

Incredible numbers from the 3900: 1.7X the performance of the 1600 at 12 processes. And again we see an isolated case where an underclocked/undervolted Ryzen (barely) outperforms one running at stock configuration.

There is very little ILU parallelism in this test, but I suppose that makes sense given the nature of fluid dynamics simulations. Jumping from all-cores to all-threads resulted in a tiny (3.5%) speedup on the 1600, and tiny regressions on the 2700 and 3900X (4.2% and 1%, respectively).

This test was also a good chance to get a look at frequency response and power usage in the 3900X under varying amounts of load.

Processes	LoadFreq	IdleFreq	Power
2	4.50GHz	2.2Ghz	49.5W
4	4.20GHz	2.1GHz	55.5W
12	4.15GHz	–	105.5W
24	4.00GHz	–	117.5W

LoadFreq is the sustained clock on loaded cores
IdleFreq is the clock on unloaded cores
Power was sampled during the actual simpleFOAM run portion of the test, which was the most demanding. The 24.5W system idle power has been subtracted from this value.

Conclusions

The 3900X is…

A monster that can churn through incredible amounts of work
HEDT performance for $500
Absolutely worth it, if you’ve got the work to do
But hot and power-hungry in a stock configuration

You probably don’t want to disable SMT

The technique known as Symmetric Multithreading (SMT) dates back to 1982, but it didn’t appear in consumer-grade CPUs until 2002, when Intel introduced it in the Pentium 4 (calling it HyperThreading). Intel’s initial implementation – coupled with the terrible design of the P4 – meant that some workloads, in some cases, would see performance degragations with HT enabled.

There are examples in gaming, even today, of many cores impacting performance. But this seems to have more to do with the nature of game engines, which are coded to be as responsive as possible to the senses of the human playing the game. Concerns like accuracy, provability, and/or reproducibility, are distant seconds at best. It is an entirely separate regime of programming from the tasks this review is concerned with. (To be clear, that regime is neither better or worse, or any more or less legitimate: simply very, very different.)

In these tests, I saw only one instance of a performance regression due to threads exceeding the number of cores: the OpenFOAM runs. Given the small scale of these regressions, my hunch is that the issue was that there are no “extra” FPU/vector units in the Zen or Zen 2 cores, so oversubscribing execution threads led to scheduling hold-ups.

I don’t believe there was an actual performance penalty due to SMT being enabled. 2X the threads running in 1.01X the time sounds like scheduling overhead to me. It is a clear sign, however, that if you know your workload is extremely FPU/vector heavy, you should test to be sure of the response you’ll get.

Generally speaking, disabling SMT is throwing away between 15% and 45% of your CPU’s total possible performance.

Make sure your RAM is fast enough

During the initial rounds of testing, my stock 2700 was slightly but consistently slower than my clock-locked and undervolted 2700 (the 2700L, above). This made absolutely no sense, as I could see that the 2700’s clock speeds were a minimum of 100MHz faster than the 2700L’s at all times, and the systems were otherwise identical. At least, that’s what I thought.

It turned out that the 2700’s BIOS was clocking its memory at 2133MHz, rather than its rated 3000MHz. Changing this setting resulted in a 5% speedup for the 2700. The lesson here is that if you have many cores to feed, make sure you have memory fast enough to keep them fed.

Given that 3200MHz/CL16 RAM is currently ~$70 for 16GB, do yourself a favor and run decently fast RAM.

You can save power without sacrificing much performance

I was worried that the performance impacts would be severe, but you’ve seen the results with real-world testing. For the 2700/2700L:

Performance delta between 0% and 5%
Temperatures 11% cooler
Using 16% less power

And for the 3900X/3900L it’s even better:

Performance delta between 0% and 11% (about 9% on average)
Temperatures ??% cooler
Using 50% less power