Your parallelization strategy is wrong. The problem is the block delimited with this within the function checkQueen
#pragma omp parallel num_threads(n)
The content code for this block will check the entire board. Since you are casting n
threads there, each of these n
threads will check the entire board, which means a lot of redundant work. This block will only end when the last of these threads ends. Add that with the complexity and performance impact to maintain the atomic assignments and with that the result will be much slower than a version with a single thread.
In other words, the problem is that instead of you dividing the work into n
threads, you multiplied with%
Worse, the n
function is called multiple times within the checkQueen
recursive function. This means that this overhead of creating and synchronizing threads will be multiplied by the number of times play
is called, making everything even slower.
I strongly suggest you to parallelize the checkQueen
function, not the play
function. To benefit from parallelism, it is important that you depend on as few sequential parts as possible < in> single-thread , and that's exactly how the checkQueen
function is.
In addition, to have a good performance gain when paralleling, it's important to avoid threads having to interact with each other. It is perfectly possible to let them interact and is often unavoidable, but in doing so, you will probably have some performance cost, and so this is something to be avoided / minimized. The way you have implemented play
has a lot of thread interactions through the checkQueen
variable.
The a
within for
to initialize everything with zero also does not have an appreciative gain of performance when being parallelized. On the contrary, the overhead that thread management should impose will probably be greater than the gain. Incidentally, for performance to gain, the amount of workload that each thread has to earn has to be large enough for it to bypass that overhead, and none of the parts you have paralleled have that feature.
I also have two observations:
-
Make main
return 1 when it is not possible to put a queen and 0 when it sounds a bit strange, since it implies that 1 = no and 0 = yes, the opposite of the convention. By moving this, you can remove the checkQueen
operator that is in the !
of the if
function. Better yet, you can rename this function to play
to be clearer.
-
You can allocate a single memory area with queen_ok
cells and use n * n
to access the elements. That way, you only have to allocate memory once to mount the board and also ensures that it is contiguous, which offers better performance.
-
To prevent one thread from interfering with another while changing the board, create one board per thread and use a function to copy the boards.
-
To avoid having to work with queen[linha * n + coluna]
and int *queen
in all places, I suggest you create a int n
for this.
-
To prevent threads from competing with struct
when displaying solutions, use locking. The solution below uses the concept of SOen .
-
Perhaps when using printf
, you will end up having a good performance gain considering that not all partially filled trays have the same weight to be analyzed.
-
These variables schedule(dynamic,1)
, sol
and count
will get in the way of parallelization. But it is not very difficult to eliminate them. The maxQueens
can be placed in the count
I mentioned. struct
is only used in a context where you have lock due to the need to use sol
. printf
can be updated lazily at this point only.
-
Do not forget to give maxQueens
to everything you give free
.
-
In C, most functions use the notation malloc
, not the separada_por_underscores
notation.
I suggest that your code looks like this, paralleling camelCase
:
#include <stdio.h>
#include <stdlib.h>
#include "omp.h"
typedef struct Tabuleiro {
int n;
int *casas;
int queens_count;
} Tabuleiro;
// Funções básicas sobre tabuleiro.
Tabuleiro *criar_tabuleiro(int n);
Tabuleiro *copiar_tabuleiro(Tabuleiro *original);
void destruir_tabuleiro(Tabuleiro *tabuleiro);
void print_tabuleiro(Tabuleiro *tabuleiro);
// Implementação do n queens.
int queen_ok(Tabuleiro *tabuleiro, int linha, int coluna);
void play(Tabuleiro *tabuleiro, int *solucoes, int *max_queens, omp_lock_t *lock);
void play_in(Tabuleiro *tabuleiro, int coluna, int *solucoes, int *max_queens, omp_lock_t *lock);
// Função main.
int main() {
int n = 10, solucoes = 0, max_queens = 0;
omp_lock_t lock;
omp_init_lock(&lock);
Tabuleiro *tabuleiro = criar_tabuleiro(n);
play(tabuleiro, &solucoes, &max_queens, lock);
printf("Maximo de rainhas: %d \n", max_queens);
printf("Numero de solucoes: %d \n", solucoes);
omp_destroy_lock(&lock);
destruir_tabuleiro(tabuleiro);
return 0;
}
Tabuleiro *criar_tabuleiro(int n) {
Tabuleiro *tabuleiro = malloc(sizeof(Tabuleiro));
tabuleiro->n = n;
tabuleiro->queens_count = 0;
tabuleiro->casas = malloc(n * n * sizeof(int));
for (int i = 0; i < n * n; n++) {
tabuleiro->casas[i] = 0;
}
return tabuleiro;
}
Tabuleiro *copiar_tabuleiro(Tabuleiro *original) {
int n = original->n;
Tabuleiro *tabuleiro = malloc(sizeof(Tabuleiro));
tabuleiro->n = n;
tabuleiro->queens_count = original->queens_count;
tabuleiro->casas = malloc(n * n * sizeof(int));
for (int i = 0; i < n * n; n++) {
tabuleiro->casas[i] = original->casas[i];
}
return tabuleiro;
}
void destruir_tabuleiro(Tabuleiro *tabuleiro) {
free(tabuleiro->casas);
free(tabuleiro);
}
void print_tabuleiro(Tabuleiro *tabuleiro) {
int n = tabuleiro->n;
for (int i = 0; i < n; ++i) {
printf("\n");
for (int j = 0; j < n; ++j) {
printf("%c", tabuleiro->casas[i * n + j] ? 'X' : '.');
}
}
printf("\n\n");
}
int queen_ok(Tabuleiro *tabuleiro, int linha, int coluna) {
int n = tabuleiro->n;
int *queens = tabuleiro->casas;
if (queens[linha * n + coluna] == 1) return 0;
// Diagonal para cima e para a esquerda.
for (int i = linha, j = coluna; j >= 0 && i >= 0; --i, --j) {
if (queens[i * n + j]) return 0;
}
// Diagonal para baixo e para a esquerda.
for (int i = linha, j = coluna; j >= 0 && i < n; ++i, --j) {
if (queens[i * n + j]) return 0;
}
// Diagonal para cima e para a direita.
for (int i = linha, j = coluna; j < n && i >= 0; --i, ++j) {
if (queens[i * n + j]) return 0;
}
// Diagonal para baixo e para a direita.
for (int i = linha, j = coluna; j < n && i < n; ++i, ++j) {
if (queens[i * n + j]) return 0;
}
// Para baixo.
for (int i = linha; i < n ; ++i) {
if (queens[i * n + coluna]) return 0;
}
// Para cima.
for (int i = linha; i >= 0; --i) {
if (queens[i * n + coluna]) return 0;
}
// Para a direita.
for (int j = coluna; j < n; ++j) {
if (queens[linha * n + j]) return 0;
}
// Para a esquerda.
for (int j = coluna; j >= 0; --j) {
if (queens[linha * n + j]) return 0;
}
return 1;
}
void play(Tabuleiro *tabuleiro, int *solucoes, int *max_queens, omp_lock_t *lock) {
int n = tabuleiro->n;
#pragma omp parallel num_threads(n) schedule(dynamic,1)
for (int i = 0; i < n; ++i) {
Tabuleiro *copia = copiar_tabuleiro(tabuleiro);
copia->casas[i * n] = 1;
copia->queens_count++;
play_in(copia, 0, solucoes, max_queens, lock);
destruir_tabuleiro(copia);
}
}
void play_in(Tabuleiro *tabuleiro, int coluna, int *solucoes, int *max_queens, omp_lock_t *lock) {
int n = tabuleiro->n;
if (coluna == n) {
omp_set_lock(lock);
printf("\nSolucao #%d\n", ++*solucoes);
print_tabuleiro(tabuleiro);
if (*max_queens < tabuleiro->queens_count) {
*max_queens = tabuleiro->queens_count;
}
omp_unset_lock(lock);
return;
}
for (int i = 0; i < n; ++i) {
if (queen_ok(tabuleiro, i, coluna)) {
Tabuleiro *copia = copiar_tabuleiro(tabuleiro);
copia->casas[i * n + coluna] = 1;
copia->queens_count++;
play_in(copia, coluna + 1, solucoes, max_queens, lock);
destruir_tabuleiro(copia);
}
}
}
Note: I have not tested this code, so I do not know if it will work, maybe there is some silly mistake somewhere. However, even if something goes wrong, the correct result is certainly something close to that.
You could still make an improvement by trying to avoid repeated allocations and deallocations by pre-allocating a pool of trays and getting them / returning them to that pool when needed. However, I'm not sure what the pool size is, but an estimate of the upper limit size of this pool would be something of the order of n² / 4 . Note also that this pool would be a stack.