Parallelizing N-Queens

Vanadium 50 · Jun 8, 2024

A colleague has a parallel algorithm problem. There is a similar, more familiar problem that also uses backtracking, so lets discuss that.

The problem is called n-queens. You have an nxn chessbaord and the goal is to place n queens on them so that no queen attacks another: no two queens are on the same row, the same column, or the same diagonal.

The classic solution involves backtracking. Given x queens have been placed, try all possible places for queen x+1 that satisfy the conditions. If you find one, go on to queen x+2. If you don't, go back and try another position for queen x.

The question is, "how do you best parallelize this?"

I have two possible solutions, neither of which I really like.

(1) Imagine you have n threads. The first thread places the queen on square 1, the second on square 2 and so on, and then the n threads continue on from that point. At the end, you collect all the solutions.

The first problem is that this is limited to n-way parallelim. If I am solving the 8 queens problem and my computer has 64 cores, this is not very efficient. "Ah!" you say. "Split this up using the 1st two queens!" This is better - that's n(n-1) threads, or 28 for 8 queens. But that exposes the second problem: load balancing. you might have some threads terminate quickly because they have only a few solutions and others go on and on. So its not very efficient.

(2) This comes about it from the other direction. Suppose I have all but two queens placed. I have two unused rows and two unused columns, so I have two possibilities. One thread tries one, and another tries the others. That's the idea, but for efficiency you want maybe "all but 5 queens" or some other large number.

The issue here is that you spend a lot of time figuring out what the permutations are - the checking of potential solutions goes quickly compared to this. (I have columns 2, 3,5,and 7 open, along with rows 1, 2, 4 and 5. Quick! What is the 11th permutation of queens?) So while the parallel part looks great, the price is a big serial part.

The best I can come up with is a variation on (1) - launch 1000 threads on the 64 cores. Way, way overcommit them and balance the load that way. You still have some long-running threads, but they have less work and balance the load that way.

This feels very brute force-ish. Can someone do better?

Baluncore · Jun 8, 2024

How would a quantum computer solve the problem?
https://arxiv.org/abs/1803.00735

Vanadium 50 · Jun 8, 2024

Baluncore said:

How would a quantum computer solve the problem?

Well, first it would need to exist. (i.e. be big enough and hold coherence long enough)

Ibix · Jun 8, 2024

You can "name" a partial solution to N-queens as a list of queen positions - e.g. [1,3,5]. Naively, there are N possible next moves, [1,3,5,1], [1,3,5,2] ... [1,3,5,N]. I know when I've found a solution because I have a list of length N and I can tell when I've failed.

Does an algorithm something like this work? A node checks if its current partial list is valid and exits if not. If it's valid, though, it checks to see if any other nodes are free. If there are any, it divides its next search level into equal shares and hands them out to the idle nodes.

So with 8 queens and 64 nodes you'd use one node to check [], but that node would share out [1], [2], [3] ... [8] to 8 nodes, and those nodes would share [1,1] ... [8,8] among 64 nodes, but thereafter each node would check all its own list's "children" until one node terminates and divides one other lucky node's work with it. (Yes, I'm aware checking all 8 positions after rank 1 is naive.)

Vanadium 50 · Jun 8, 2024

So let me see if I understand.

Consider N = 4.

Launch 4 threads each with the first queen in positions 1-4.
Each of those threads launches 4 threads. [1,1], [1,2] etc. so there are 16.
We check each of the 16. Four fail because two are on the same row: [1,1], [2,2], etc. and 6 fail because they are on the same diagonal. We are left with 6 threads: [1,3], [1-4], [2-4], [3,1], [4,1], and [4,2], plus their 4 parents which are waiting.

We then launch 4 more threads per survivor. For example, [1,3] launches [1,3,1], [1,3,2], [1,3,3] and [1,3,4]. None of these work, so they die. Since [1,3] no longer has parents, it dies as well. So we're actually left with only two leaf threads: [2, 4, 1] and [3, 1, 4] and their parents: six threads, of which 4 are waiting.

The two active threads launch 4 more each. Six die immediately, leaving two successful, which then get reported up the chain.

This peaks at 24 active threads.

Baluncore · Jun 8, 2024

Vanadium 50 said:

Well, first it would need to exist.

Vanadium 50 said:

Consider N = 4.

How many queens can you place on a 4 x 4 board ?
What about a 3 x 3 ?

Ibix · Jun 8, 2024

Vanadium 50 said:

So let me see if I understand.

Seems like what I had in mind. I think it's only practical because the states form a directed acyclic graph and there's an obvious indexing that makes it easy to divide up work on the fly without worrying about duplication or missing something.

Note that it's a breadth-first search as I proposed it, which may not be what you want. You could perhaps re-jig it as breadth-first while there are idle processors and depth-first otherwise?

Vanadium 50 · Jun 8, 2024

Ibix said:

re-jig

I think some re-jigging is required anyway.

As described, the amount of calculation each thread needs to do is minimal. If it is more expensive to launch a thread (it likely will be), the parallelism won't scale well. So to better balance, you'd probably toss the obvious cases sooner - i.e. the parent thread checks for row and columns, and then the daughter thread does the diagonals.

As before, consider N = 4.

Launch 4 threads each with the first queen in positions 1-4.
Each of those threads launches three threads. [1,2], [1,4] etc. so there are 12.
We check each of the 12. Six fail because they are on the same diagonal. We are left with 6 threads: [1,3], [1-4], [2-4], [3,1], [4,1], and [4,2], plus their 4 parents which are waiting.

We then launch 2 more threads per survivor. For example, [1,3] launches ], [1,3,2] and [1,3,4]. Neither of these work, so they die. Since [1,3] no longer has parents, it dies as well. So we're actually left with only two leaf threads: [2, 4, 1] and [3, 1, 4] and their parents: six threads, of which 4 are waiting.

Then finishing up is trivial.

This peaks at 16 threads, 12 active and 2 waiting.

pbuk · Jun 10, 2024

Vanadium 50 said:

So to better balance, you'd probably toss the obvious cases sooner

What is the difference between a test for an obvious dead end and a not obvious one? For n-queens I don't think there is one, however this may exist for "your colleague's" problem (I've heard that one before

)

Vanadium 50 said:

- i.e. the parent thread checks for row and columns, and then the daughter thread does the diagonals.

The way the algorithm is usually implemented guarantees that the column (or row) is unique and the test for diagonals is no more expensive than (and usually implemented in the same loop as) the test for rows (or columns):

JavaScript:

function check(board, currentCol) {
  const currentRow = board[currentCol];
  for (let col = 0; col < currentCol; ++col) {
    // Check row.
    if (board[col] === currentRow) return false;
    // Check upper diagonal.
    if (board[col] === currentRow + currentCol - col) return false;
    // Check lower diagonal.
    if (board[col] === currentRow - currentCol + col) return false;
  }
  return true;
}

Vanadium 50 said:

The best I can come up with is a variation on (1) - launch 1000 threads on the 64 cores. Way, way overcommit them and balance the load that way.

Each parallel thread will require its own memory for maintaining state. It is therefore usually better (for L1/L2/L3 cache efficiency) to only launch as many threads as you have cores. There are at least two ways to do this for a processor with C cores:

Launch C threads utilising a shared state for depths up to D to ensure they operate on unique branches, therafter using their own state.
Launch one controlling thread operating on depths up to D which launches C parallel threads, and then generates a new partial candidate at depth D when each one terminates.

Vanadium 50 · Jun 10, 2024

pbuk said:

however this may exist for "your colleague's" problem (I've heard that one before

Their problem is to create a ML training dataset of particle trajectories, and the hope is that they can simulate them without using a full physics-based simulation, which is slow. They are looking at special cases, which makes doing it the right way even slower. Their code uses backtracking: n-queens is another, better known, backtracking problem.

(For the record, my suggestion is to plow ahead with the old, slow algorithm - by the time a better one is developed, the jobs will be finished. CPU time is cheaper than scientist time.)

pbuk said:

to only launch as many threads as you have cores

That is totally counter to my experience!

I agree that is a logical starting point. But I have had cases where it was faster the leave one core idle because now everything fit in cache. Toss another core at it, and the cache miss rate skyrocketed and everything ran slower.

I've had the opposite - overcommit the cores by one, so while one core is waiting for something to happen, another can do work. On one machine, running 80 threads on 64 cores was ~8-10% faster than running 64.

In the n-queens case, the memory needed to specify the state is small, and the computational intensity relative to thread-creation overhead is small. (Neither of these is necessarily true in the real-life case) To me that suggests a master-slave design: you start all the worker threads first, and one thread hands the others data to process. It suggests something more like MPI than OpenMP.

pbuk · Jun 11, 2024

Vanadium 50 said:

Their code uses backtracking: n-queens is another, better known, backtracking problem.

But apart from that, the problem sounds very different to n-queens: I'm assuming the state vector is pretty big and therefore expensive to clone for a new thread.

It does sound like an interesting problem, although using a simulation to generate training data for a neural solution is scary! Presumably the simulation validates well against experimental data.

Vanadium 50 said:

(For the record, my suggestion is to plow ahead with the old, slow algorithm - by the time a better one is developed, the jobs will be finished. CPU time is cheaper than scientist time.)

Probably the best answer - if a problem can be solved by simply throwing resources at an existing trusted method then that is usually the best way. Novel methods can be saved for intractable problems.

Vanadium 50 said:

But I have had cases where it was faster the leave one core idle because now everything fit in cache. Toss another core at it, and the cache miss rate skyrocketed and everything ran slower.

Yes, that can happen.

Vanadium 50 said:

I've had the opposite - overcommit the cores by one, so while one core is waiting for something to happen, another can do work.

So can that - this is the default situation where you have a marshalling thread which sleeps until one of the C backtracking threads terminates.

Vanadium 50 said:

In the n-queens case, the memory needed to specify the state is small, and the computational intensity relative to thread-creation overhead is small. (Neither of these is necessarily true in the real-life case)

Yes, this seems to match my observation at the top of this post.

Vanadium 50 · Jun 11, 2024

The actual problem is more like lining queens up than not lining them up, but I didn't want to spend many messages explaining it. I also don't know all the details, since the colleague is focusing on a subset of events where conventional reconstruction does poorly.

It's not my code, but I can see a use case where the ML works "just well enough" to get the conventional code started on the right path.

The state is still small: say 100 queens. It is also not necessary to find every solution, but to find enough to populate a training dataset. And as you know, when the number of queens increases, the number of solutions goes up, but the density goes way down.

My 80 vs, 64 example ran identical code on all 64 nodes. I assume what happened is when #64 was waiting on a cache miss, #65 would start work. So long as a context switch is faster than a memory line, this works. If #65 had a cache miss too, it would fire up #66, and so on until someone was able to resume work.

I was actually running just under 4 million threads. Most of my time was not spent in algorihm development. It was keeping the cores fed.

Parallelizing N-Queens

Similar threads

Hot Threads

Recent Insights