Grid
Published: 2020-08-25

## 2020-08-25: Wrapping up Rosetta

With WCG’s OPN project in full swing, we are finishing up our existing work for Rosetta@Home, and then detaching our nodes for now. Rosetta is very heavy, and it’s also extremely popular so it’ll be fine while we put those cycles toward WCG’s projects.

## 2020-08-18: RPis back online

I forgot to post about their existance, but we now have 4 Raspberry Pi 4Bs as part of our farm. They were added to test Homefarm’s ability to deploy and manage ARM machines (coming in v2.8.0 and, of course, to crunch.

Last week three of the four went down due to SD card failures during their second system update after installation. I had never experienced this problem before, but I also had never used no-name SD cards which cost less than $3 before. Even when they worked, they were incredibly slow since they weren’t SDHC cards. Yesterday I cloned the base Arch install onto replacement cards, and last night I completed the install and Homefarm deployment. They’re all back in service now. Don’t buy cheap SD cards. Unlike cheap SSDs, there does appear to be an appreciable difference in their quality and durability. ## 2020-07-30: 150 years of WCG compute Today we passed 150 years of CPU time for World Community Grid! From here on out, only centuries will be called out as milestones. ## 2020-06-13: Top 500 As of this evening’s stats run, we are ranked 499th in total points for World Community Grid. I’ve always felt that results returned was more meaningful, but some people try to game that one by running lots of short WUs. Onward! ## 2020-05-23: Ryzen and PPT (Pt 2: Mobos matter) If you didn’t read part 1 of this, just scroll down one update and catch yourself up. If you did but have forgotten, here’s the nutshell version: I switched one of my 3900X nodes from a manual underclock/undvervolt of 3400MHz and 1.01825 Vcore, to full-auto everything with a PPT of 70W. This resulted in slightly higher clocks, lower temperatures, and lower Icore measurements. Today I decided it was time to clean the HSFs of the other three 3900X nodes, and give them the same settings. Now, these four nodes are identical except for one thing: some months back the first node’s motherboard died and I replaced it with the only mITX board I could get my hands on at the moment. • node01 is on a Gigabyte X570 Aorus Pro Wifi • nodes 02-04 all use the ASRock Fatal1ty B450 Gaming-ITX/ac Those B450 boards were rock solid for me for over a year, with both R5 1600 and R7 2700 CPUs. When I upgraded to the 3900X though… it was a fight to get them to boot reliably. People online were talking about VRM designs and how some B450 boards weren’t architected to handle a 3900X – which made sense, as the AM4 socket had topped out at 8 cores and 95W when the B450 was designed. Well, it looks like the issue of mobo electrical supply design has reared its head again, though not in quite so troublesome a fashion. It turns out that my B450 boards simply can’t run the 3900X at the same frequency that the X570 board can, given the same PPT limit. When I cleaned off node02 and reset its BIOS options to match node01’s (auto clocks and voltages; PPT limit of 70W), clock speeds were averaging almost 3200MHz, and occasionally dipping just below 3000MHz. Meanwhile, node01 was running very consistently at ~3450MHz. Temperatures were nice and low (very much in line with node01’s), but clocks were slower by 250-350MHz. I upped node02’s PPT limit to 75W and that helped a lot. Clocks now looked to bottom out at 3200MHz, and often approached 3400MHz (my original, manual underclocking setpoint). I decided to leave node02 as it was, clean node03 and set its PPT limit to 80W, and leave node04 alone (for testing), so I could get a complete picture of what’s going on. Here’s the setup for the data that follows: • node01: X570; clocks and voltages at auto; PPT limit 70W • node02: B450; clocks and voltages at auto; PPT limit 75W • node03: B450; clocks and voltages at auto; PPT limit 80W • node04: B450; clocks manual (3400MHz); Vcore manual (1.01825); core boost off Average core clocks: # farmctl cmd 'cat /proc/cpuinfo | grep MHz | awk "{total += \$4; count++} END {print total/count}"'
------------------------------------------------------------------------- node01
3524.11

------------------------------------------------------------------------- node02
3311.31

------------------------------------------------------------------------- node03
3399.64

------------------------------------------------------------------------- node04
3391.54


Ambient temperature is a little lower today than when I last tested, so you can see the X570 pushing just north of 3.5GHz (at 70W!) because it has the thermal and electrical headroom. Meanwhile node03 is sticking so close to 3.4GHz that I dropped back into BIOS to make sure my old underclocking settings weren’t still in use (they were not).

Now the temperatures, voltages, and amperages:

# farmctl cmd 'sensors | grep -E "(die|core)"'
------------------------------------------------------------------------- node01
Vcore:       944.00 mV
Tdie:         +57.1 C
Icore:        60.00 A

------------------------------------------------------------------------- node02
Vcore:       938.00 mV
Tdie:         +56.2 C
Icore:        79.00 A

------------------------------------------------------------------------- node03
Vcore:       988.00 mV
Tdie:         +60.1 C
Icore:        79.00 A

------------------------------------------------------------------------- node04
Vcore:       975.00 mV
Tdie:         +58.0 C
Icore:        95.00 A


Two things stand out here. First, my manual underclocking settings are actually causing the CPU to pull many more amps than using PPT limits. Second, the X570’s board design manages to do a lot more work with a lot less electricity. I don’t know enough about the engineering of motherboards to posit why this is, but my data clearly shows that it’s true.

I hope that B550 boards share this level of performance, and it’s clear to me that anyone who wants to limit a modern Ryzen chip should simply set the PPT limit and leave the chip to figure out how much work it can get done.

Now I’m gonna go set nodes 02 and 04 to an 80W PPT and be done for the day.

## 2020-05-15: OpenPandemics is live

Yesterday WCG’s OPN project went from beta to live. We just barely got some work done before the deaily update, giving us 2 results done, for a total of 3h33min runtime on the official first day.

With about an hour to go before today’s update, my tools are showing almost 40 days of runtime in the past day. WCG will show less, since I’m looking at completed WUs and they’re concerned with validated WUs, but we’re still getting a lot of WUs for day one of a project.

ARP was our first subproject launch, and it initially moved at a glacial pace, taking us days to get a single WU. In contrast, we’ve crunched almost 450 WUs for OPN in the first day, which is nice.

But Wuprop is showing over 81k WUs crunched in the past 24h globally. That’s fantastic.

Update: after processing, our total for the day was 250 WUs and 21d07h of CPU time. Looks like we’re way ahead of our wingmen on these WUs.

## 2020-05-10: Ryzen and PPT

TLDR: Anyone who wants to power-limit a Ryzen CPU should be using PPT instead of manual settings. It will do a better job than you will.

When the R9 3900X was released, I was there on release day. But weeks before, I had figured out from leaks and previews that I was going to be dipping my toes into underclocking/undervolting. My tiny apartment was not going to handle 4 3900Xs turning and burning at 100% utilization. (The 3900X is rated at 105W nominal TDP but under full load it will dissipate 135W+)

I did a whole day of fiddling with voltages and clocks, followed by roughly another day’s worth once the clock stretching/vdroop issues became known. Finally things settled down to a stable 3400MHz at 1.0125Vcore.

This was a win because my 3900Xs ran decently cool (let’s say 62C +/- 3C, 95% of the time) while delivering good performance. I didn’t want to clock them below the 3.4GHz of the R7 2700 because I wanted to retain all of Zen2’s 12-15% IPC gains.

Several months later, an AGESA update accompanying the release of the 3900 enabled “Eco Mode”. It turned out that a 3900 was just a 3900X with its CPU limited to 65W instead of 105W. And more importantly, it turned out that this AGESA update enabled all Zen 2 CPUs to have a PPT (Package Power Tracking) limit set.

This is far easier and more reliable than underclocking/undervolting. It lets the chip’s CBS/PBO stay on, which lets the chip make its own moment-to-moment decisions about clock speed, given thermal and power envelopes. And that, in turn, eliminates worries about voltage droop and clock stretching. The CPU will be stable, and the reported clock speed will be the actual clock speed.

But it took until today for me to enable PPT on the first of my 3900Xs. The “why” of this is a two parter: one, my machines were all running stably as-is; and two, the motherboards on three of my 3900Xs are kinda twitchy and flaky any time the BIOS needs to be accessed. So I left well enough alone. However, today I did a 16GB-to-32GB upgrade on the first of them, and decided this was a good time to experiment with using PPT limits instead of manual clock and voltage settings.

I first set the PPT to 60W, which turned out to be a little low for my purposes. Clock speeds were around 2.8GHz, and temps were a balmy ~48C. So I tried 70W, since I already knew how 65W would perform from reviews of the 3900. Bingo!

$uptime 14:08:09 up 42 min, 1 user, load average: 24.70, 24.91, 23.44$ sensors | grep -E '(die|core)'
Vcore:       944.00 mV
Tdie:         +57.0 C
Icore:        58.00 A

# average clock speed across all cores
$cat /proc/cpuinfo | grep MHz | awk '{ total +=$4; count++ } END { print total/count }'
3477.14


After almost an hour of crunching, I’m seeing temperatures ~5degC cooler, with clock speeds around 70MHz higher. the lowest I’ve seen in 40 minutes of checking was 3430MHz, and the highest was 3480MHz.

The current 30-day average of WCG results returned for this device (excluding today) is 236.069. As you can see from its device page, it’s numbers fluctuate a lot day-to-day because it’s crunching all available subprojects and they have wildly different runtimes (also, because there’s sometimes validation lag). But I assume that 30 days is enough time to smooth things out for this purpose. I’ll report back in a month with the updated average.

## 2020-05-03: WCG Quasquicentennial!

Today we passed 125 years of CPU time for World Community Grid! Onward!

Later this year (probably Q4) the Ryzen 4000 series CPUs and the B550 chipset will be released. When this happens, we’ll be rebuilding the entire fleet.

Between now and then, we’re going to upgrade all 3900X machines to 32GB of RAM. The new memory for the first of these has been ordered, and when it arrives it will also enable node05 to be bumped up to 16GB. Nodes 05 and 06 were more-or-less built from spare parts after the previous round of upgrades, so they’re a little below-spec in various ways.

After this, all nodes will be identical, and likely, all nodes will have a GPU.

## 2020-04-28: WCG workunits milestone

Today we passed 500,000 WUs returned for World Community Grid!

Our official count as of today’s team stats update is: 500,838.

We’re closing in on 125 years of CPU time for WCG, as well.

## 2020-03-21: Covid19

As grid computing efforts toward combatting Covid19 have come to the forefront, we have devoted more of our CPU time to Rosetta@Home.

We’ve also dialed back the GPU work we were doing for Einstein@Home and Asteroids@Home; we are now only crunching WUs for them when there are no WUs available from GPUGrid.

This means that our current blend of compute resources is 50/50 Rosetta and WCG on the CPU side of things, and 100% GPUGrid (again, unless they do not have work available) on the GPU side. We’re all in on biomedical research right now.

## 2020-01-13: One century of WCG compute time

Just over two years since founding our team, we’ve hit 100 years of CPU time in support of World Community Grid projects. We’re very happy and proud of this, but the work goes ever onward.

## 2020-01-13: 50 million points; GPGPU

Points in BOINC are a lot like points in Whose Line, but it’s worth noting when a big number goes by anyway. Sometime last week, we passed 50 milion points across all projects.

Also, in the past 2 weeks we’ve gotten back into GPGPU crunching, meaning that we’ve returned to GPUGrid and Einstein@Home.