Voltara
Member
I have just released the source code for vcube, a fast optimal Rubik's Cube solver. It is a rewrite of one of the solvers behind my "optimal scrambler" that I posted about several months ago. The code, which is licensed GPLv3, is available on Github at:
https://github.com/Voltara/vcube
This new version borrows heavily from Tomas Rokicki's nxopt solver, using his pruning table design which is superior to the 1.6-bit format that I used before, and incorporates many of his optimizations.
From my old solver, it inherits a SIMD optimized cube model which takes advantage of the AVX2 instruction set introduced with Intel's Haswell microarchitecture. One feat it's capable of is it can compose (multiply) two cube positions in 5 CPU instructions -- an edges-only cube requires only 3 instructions.
On my 32GB i7-7700K (4.20 GHz, 4 core, 8 thread) desktop, using a 22GB pruning table, vcube is able to solve random cube positions at a rate of 6.0 cubes/second. This is an improvement over nxopt (modified to support 1GB huge pages), which I measured at 3.8 cubes/second on the same hardware and data set.
Some additional speed tests, run on Linode virtual servers were:
64GB, E5-2680 v3 @ 2.50 GHz, 16 core virtual server:
- vcube with 32GB pruning table: 6.9 cubes/second
- vcube with 58GB pruning table: 15.1 cubes/second
192GB, E5-2697 v4 @ 2.30GHz, 32 core virtual server:
- vcube with 170GB pruning table: 55.0 cubes/second
- nxopt with 170GB pruning table: 49.2 cubes/second
I also ran a series of tests of the 22GB pruning table on the E5-2697 v4 @ 2.30GHz virtual server at varying concurrency levels, to get an idea of how that hardware scales, and how it compares to my desktop.
- 32 threads: 9.8 cubes/second
- 16 threads: 6.4 cubes/second
- 8 threads: 3.5 cubes/second
- 4 threads: 1.8 cubes/second
The results for my desktop were 6.0 cubes/second at 8 threads, and 3.9 cubes/second at 4 threads. Much of the speed difference can be attributed to the widely differing CPU frequencies. Also, certain tuning I was unable to perform on the virtual servers. If I get the opportunity, it would be interesting to test the larger tables on a physical server.
https://github.com/Voltara/vcube
This new version borrows heavily from Tomas Rokicki's nxopt solver, using his pruning table design which is superior to the 1.6-bit format that I used before, and incorporates many of his optimizations.
From my old solver, it inherits a SIMD optimized cube model which takes advantage of the AVX2 instruction set introduced with Intel's Haswell microarchitecture. One feat it's capable of is it can compose (multiply) two cube positions in 5 CPU instructions -- an edges-only cube requires only 3 instructions.
On my 32GB i7-7700K (4.20 GHz, 4 core, 8 thread) desktop, using a 22GB pruning table, vcube is able to solve random cube positions at a rate of 6.0 cubes/second. This is an improvement over nxopt (modified to support 1GB huge pages), which I measured at 3.8 cubes/second on the same hardware and data set.
Some additional speed tests, run on Linode virtual servers were:
64GB, E5-2680 v3 @ 2.50 GHz, 16 core virtual server:
- vcube with 32GB pruning table: 6.9 cubes/second
- vcube with 58GB pruning table: 15.1 cubes/second
192GB, E5-2697 v4 @ 2.30GHz, 32 core virtual server:
- vcube with 170GB pruning table: 55.0 cubes/second
- nxopt with 170GB pruning table: 49.2 cubes/second
I also ran a series of tests of the 22GB pruning table on the E5-2697 v4 @ 2.30GHz virtual server at varying concurrency levels, to get an idea of how that hardware scales, and how it compares to my desktop.
- 32 threads: 9.8 cubes/second
- 16 threads: 6.4 cubes/second
- 8 threads: 3.5 cubes/second
- 4 threads: 1.8 cubes/second
The results for my desktop were 6.0 cubes/second at 8 threads, and 3.9 cubes/second at 4 threads. Much of the speed difference can be attributed to the widely differing CPU frequencies. Also, certain tuning I was unable to perform on the virtual servers. If I get the opportunity, it would be interesting to test the larger tables on a physical server.