standup5x5

Solutions to Stand Up Maths 5x5 Unique 25 letter problem

119
11
C

Stand Up Maths 5 x 5 unique letters solutions

My solutions to the problem posed by Matt Parker of Stand Up Maths

See: https://www.youtube.com/watch?v=_-AfhLQfb6w

Building and Running

Just run make and execute ./a25 -v, ./s25 -v, ./v25 -v or ./525 -v

NB: Omit the -v option if planning to use the time shell call

I’ve set the default compiler to clang-12 in the Makefile, as this produces
the fastest executables in my testing, but edit the Makefile and switch to
gcc if that’s what you have

a25, s25, v25, and 525 all support the -v option which will emit run-time metrics
If measuring with the time shell call, be aware that it typically takes about 1ms
for an executable to get up and running, the times reported by time will be slightly
higher than the internally measured times. Also, the time taken to print the metrics
can become a non-neglible factor at these low execution times.

For speed, all solutions are written to a file named solutions.txt in the
current directory

[a25|s25|v25|525] [-v] [-t num_threads] [-f word-file]

  • -v : Normally no console output is produced. -v allows the executable to emit metrics
  • -t : Allows the user to specify the number of threads to use. By default the executables will use 1 or 2 less threads than there are CPUs on the system
  • -f : Allows the user to specify an input word file to use. By default the executables will use the words-alpha.txt file

Execution Times

My development systems are a desktop AMD 5950x based PC, and an Intel i7-1165G7
based laptop. On my AMD system, which runs at a fixed at 4.7GHz CPU clock with
hyper-threading disabled, the following (internal) times are seen, which include
the loading of the full 4MB words_alpha.txt file:

a25 takes around 18ms to complete, using 16 threads

s25 takes around 0.98ms to complete using 16 threads

v25 takes around 0.82ms to complete using 16 threads

525 won’t run on my AMD system. On my Intel laptop it takes 1.35ms to complete
using 8 threads. I estimate that on a full desktop runtimes of 0.80ms should be achievable

All algorithms use a bit-wise representation of the words for efficiency of comparing

All algorithms also implement lockless thread co-ordination for speed.

History

My first attempt at a solution (source not included) focused on the New York Times
wordle list, for which there is 10 valid solutions. It split the word set into
2-vowel and 1-vowel sub-sets, and utilised the fact that all actual words must have
one vowel in them (vowel being a, e, i, o , u, y). This meant that there could be
either 5 words with 1-vowel, or 4 x 1-vowel words and a single 2-vowel word as
valid solutions. That code just did a straight reduction search on the 1-vowel
set, which for the NYT Wordle List, only meant a little over 1-thousand 1-vowel
words. The single 2-vowel set was then run over all 4 x 1-vowel sets to find
solutions. This ran extremely quickly, being able to find all 10 solutions in
around 1.5ms

After commenting on the Youtube video, I then discovered that most people were
using the full words-alpha.txt file as a starting set, and even crazier to me,
people were using the full ~4MB word set, instead of at least using just a set
of only the 5 letter words. Okay, I guess?

My first solution did not work properly on the words-alpha input set as that set
contains many words with no-vowels in it, thereby breaking my assumption that
all valid words should have at least 1 vowel.

So I needed to redesign

a25

This was my 2nd attempt at a full solution. It implements a more generic version
of my first attempt. Instead of using 2-vowel and 1-vowel words, it split the
input set into one set containing the most frequently character, which also
handily happens to be a, and everything else. Using my first approach, this
generates a set of 4-word partial solutions from the non-a words. For
completeness it also tries to find 5-word solutions using all letters but a,
but none exist.

Once all 4-word partial solutions are found, the a set is combined with the
4-word solutions to find all 5-word solutions.

This approach works very well and is very fast, but the idea of splitting the
input into multiple sets grew into my third solution attempt.

s25

I’ve included combos.c which is non-functional pseudo-code that formed the
basis of s25. I wrote a frequency implementation that originally ordered
all words from most the frequently occurring letter to the least. This
approach produced just 9 word sets to represent all words and was easy to
combinate over.

After putting it all together, it worked, but quite slowly. Taking about 50s
to run. I then remembered some of the comments in the Youtube video about
trying from least-frequent to most-frequent, so I reversed the sort order, and
the solution now completed in about 15s. This was still disappointing.

I then realised that we only need to use 25 letters out of 26, and so I tried
stopping the combination sequence 1 set short of the full combination set.
Doing this got the execution time down to under 1s while still finding all 538
solutions.

After this, I starting looking at other solutions online, and came across
Sylvester Hesp’s solution, and to my surprise, we had both essentially developed
the same solution. Where I was focusing on combinating over sets, Sylvester was
combinating over letters, but the actual implementation was extremely similar
in a remarkable example of convergent design.

Sylvester’s work is here: https://github.com/oisyn/parkerwords

I then saw how Sylvester had solved the 25 instead of 26 approach with his
clever skipping implementation that skips at the start of the combinator,
instead of my approach that skipped at the end. His approach actually
applied perfectly to my code (I owe a huge debt of gratitude to his solution
to this issue), and with that the initial implementation of s25 was running
in around 4-5ms for the actual main algorithm loop, and ~5-7ms to load in the
the words-alpha file, with overall run-times ranging from 9-14ms depending on
what the system is doing. Run times were signiciantly less if using the
words-alpha-five.txt or NYT files as input.

v25

v25.c is just s25.c but with AVX2 instructions added for processing the
key sets. The use of AVX-2 saw the main algorithm loop be sped up by almost
3x. This is my first time ever playing around with AVX instructions, so it
was a fun learning experience.

525

525.c is v25.c with AVX-512 instructions, instead of AVX2. This will not
compile successfully on any platform that does not support AVX-512.

AVX-512 will run the main algorithm in around 65% of the time of the AVX2
version. Of course, file load times are still a significant factor, and so
I’d expect to see overall run-times at ~75-80% of v25 run times. Sadly I
do not have a high-end AVX-512 capable desktop to verify this claim, although
these ratios were true on the AVX-512 laptop.

Words Alpha File Reading

A large problem I dealt with was how to quickly read in and process the 4MB
word file. My first attempt just processed the file with a straight mmap()
and that saw the file get read in and processed in around 7ms. My second
attempt used multiple threads, which each thread working on their own portion
of the input file. This worked fairly well, but hash table building was a
large source of contention. Additionally, the system doesn’t really like
many threads jumping about all over a single file.

My final solution was to set the readers to incrementing an atomic integer to
gain access to the next 8KB block of the file. This improved locality and
also ensured a (mostly) sequential step through the file, which boosted file
reading times dramatically. I then had the readers focus on finding 5 letter
words and inserting them into an array, with an atomic integer defining where
in the array any particular reader would insert their word.

I then had the main thread process the word array that was being built by the
reader threads to build the hash table concurrently. Doing all this together
managed to get file load and hash table build times to under 0.7ms on my AMD
system.

Frequency Rescanning

About mid-way through the frequency set build, we rescan the frequencies and
re-order the remaining sets from most-frequent to least. The works because
the bulk of the benefit from the least to most frequent ordering has already
been gained by this stage, and reducing the length of the combinatorial “tail”
has an amortizing effect of more than making up for the increased set sizes
mid-algorithm

Clean-Up

I moved all the file reading and other utility functions to a common utilities.h
file that is included by the main algorithms. In this way, any future tweaks
should be consistent across all implementations

Major Update Sep 1st

I received an email from Landon Kryger about his solution here:
https://github.com/GuiltyBystander/5words

Landon had come up with the most efficient implementation of the search problem
I’d ever seen. Truly fantastic work! Landon says that he arrived at his
solution also fairly independently, but at its core it uses the same basic
algorithm that Sylvester and myself had found.

Where Landon went one step further is that he had added a further breakdown of
the search space, where the presence (or not) of the most common letters is
used to divide each set further, such that when progressing through the sets,
characters that are already seen are used to eliminate sub-sets of the sets
from being considered. This comes at a cost of a somewhat high amount of set
up overhead, but for a single-threaded solution, his was simply the fastest
around.

Landon was looking for someone to collaborate with to help make his solution
faster. I spent some time analysing how his algorithm works, and while it is
extremely clever and elegant, it is somewhat difficult to parallelise the
setup steps, and the bit-mask remapping he implements which is essential to
the speed of his approach consumes a significant amount of CPU time, and also
forces a point of linearity to derive the mapping. The remapping of the bitmaps
can be parallelised, as can the set building, but spatial locality issues arise
from the data being scattered across many different arrays/vectors.

As a direct comparison, even though Landon’s solution makes 30x less comparisons
during the search phase, his algorithm is just 3x faster when using a single
thread. I think we can implement a solution true to his original vision in a
highly parallelised manner, but it would be a fair amount of work.

So instead I worked with Landon to arrive at a hybrid solution that isn’t as
elegant or as flexible as Landon’s core solution, but preserves the strong
spatial locality inherent in the “pointers within a single set” approach that
my solution implements.

With that, I updated s25.c, v25.c and 525.c to implement this hybrid approach,
and huge speedups were seen in low thread numbers, but smaller speedups as
threads are increased.

v25.c is now capable of finding all 538 solutions on my system in under 1ms
including loading the 4MB words-alpha.txt file, solving, and writing the
results out. Even using just a single thread takes under 6ms.

Where Landon’s approach uses the 6 most frequently occurring characters to
create subsets from, the hybrid solution only uses 2, and it is still slightly
slower than Landon’s solution for a single thread scenario for non-AVX mode.

Update: The master branch has 6 tier code now

I was able to parallelise the tier setup paths. This has resulted in the setup
for 6 tiers taking only 10us more than the setup for 3 tiers. As a result, the
6 tier path is now the fastest in all scenarios.

Update 2: I discovered a low-CPU-cost way to calculate the best characteres to
use to split subsets for each character set. Main algorithm compares were
reduced by 33%.

Conclusion

I’m including the words-alpha file, plus a version of words-alpha that only has
the 5-letter words in it, as well as the NYT Wordle sets. It seems silly to me
for us to be spending time finding ~80KB of 5 letter words from a ~4MB file but
since that’s what people seem to be doing, so I included the full set here.

Thank you to Matt Parker for making the problem public, and all the intelligent
and robust discussion of solutions in the Youtube comments, and a big thank you
to Sylvester Hesp for his approach to the set skipping issue, and to Landon
Kryger for his insightful algorithm improvements!