How to remove duplicate elements from a list in C?

2

I have a list of contacts and need to remove duplicates, I was trying to solve using the code below:

Elemento* remove_duplicados(Elemento* agenda)
{
    Elemento* a = NULL; //POnteiro do Elemento anterior
    Elemento* p;
    Elemento* q;

    for(p = agenda; p != NULL; p = p -> prox){//Analisa do primeiro elem até o ultimo
        for(q = p -> prox; q != NULL; q = q -> prox){//Analisa proximo elem de p até o fim
            if(0 != strcmp(p -> info.nome, q -> info.nome)){
                a = q; 
            }
        }

        if(a == NULL){ //Remove elemento do início e aponta para o 2º
            agenda = p -> prox;
        } 
        else
            a -> prox = p -> prox; //Ponteiro anterior levará para o prox

        free(p); //Libera o espaço
    }
    return agenda;
}

The problem is that when you run the program, when you reach this stage it quits: /

    
asked by anonymous 30.11.2017 / 18:55

1 answer

2

Problems

There are some flaws in the logic that is implemented:

  • You are not removing repeated multiple successively because you only use a at the end of the second for
  • You can not free(p) and then use for to advance to the next with p = p -> prox
  • agenda = p -> prox is not necessary if you remove the duplicates from the first instead of the first one itself. This causes your statement to be agenda->prox = p->prox or p->prox =q->prox since to a be NULL p must be exactly in the next element

Correcting problems

Using the logic and variables I have suggested I do this:

Elemento* remove_duplicados(Elemento* agenda){
    Elemento* a;
    Elemento* p;
    Elemento* q;

    for(p = agenda; p != NULL; p = p -> prox ){
        a = NULL; //a cada varrimento a começa a NULL

        for(q = p -> prox; q != NULL; ){ //sem incremento
            if(0 == strcmp(p -> info.nome, q -> info.nome)){ //teste de igual
                if(a == NULL){ //remove do inicio
                    p -> prox = q -> prox;
                }
                else{ //ou do meio/fim
                    a -> prox = q -> prox;
                }

                Elemento* remover = q;
                q = q -> prox;
                free(remover);
            }
            else { //so atualiza o anterior quando não é igual
                a = q;
                q = q->prox;
            }
        }
    }

    return agenda;
}

I removed the comments that I had to show the ones I put, and with the explanations associated with the changes I made.

See this code working on Ideone

Refactoring for better readability

The variable names you have are very uninspiring and difficult to read. You should always try to give names representative of the content that the variables have. In addition, the first element is never changed since duplicates are removed to the front, so the return type is also not required, and must be void .

Taking this into account we can make the function much clearer:

void remove_duplicados(Elemento* agenda){
    Elemento *anterior, *corrente, *seguinte;

    for(corrente = agenda; corrente != NULL; corrente = corrente -> prox ){
        anterior = NULL;

        for(seguinte = corrente -> prox; seguinte != NULL; ){
            if(0 == strcmp(corrente -> info.nome, seguinte -> info.nome)){
                if(anterior == NULL)
                    corrente -> prox = seguinte -> prox;
                else
                    anterior -> prox = seguinte -> prox;

                Elemento* remover = seguinte;
                seguinte = seguinte -> prox;
                free(remover);
            }
            else {
                anterior = seguinte;
                seguinte = seguinte->prox;
            }
        }
    }
}

Efficiency with Hash Table

If there is concern about efficiency, the solution you presented yourself, which I picked up on, can cause problems if the amount of data is large, since it is a quadratic solution, with time complexity in the order of O (n²).

There are several ways to improve the solution but with a hash table and ensuring few collisions we get a solution in the order of O (n).

We begin with the structure of each hash table entry and the corresponding array that represents it:

typedef struct entrada {
    struct info *dados;
    struct entrada *prox;
} entrada;

#define TAMANHO_HASH 100
entrada *hash_nomes[TAMANHO_HASH];

In order to find the entry for a given value, the hash function is required, which in this case we can use as strings:

unsigned long hash(char *str){
    unsigned long hash = 5381;
    int c;
    while (c = *str++)
        hash = ((hash << 5) + hash) + c;

    return hash;
}
Creating a good hash function is by itself a science, and will reflect on the performance of the algorithm because it will create many or few collisions depending on how it is built. For simplicity I used an already known .

The function that remove_duplicados is now simplified since it is based entirely on the hash table to know if the current value already exists:

void remove_duplicados(Elemento* agenda){
    Elemento *corrente = agenda, *anterior = NULL;

    while (corrente != NULL){
        if (adiciona_se_inexistente(&corrente->info) == 0){ //0 não adicionou por já existir
            anterior->prox = corrente->prox;
            Elemento* remover = corrente;
            corrente = corrente -> prox;
            free(remover);
        }
        else {
            anterior = corrente;
            corrente = corrente -> prox;
        }
    }
}

Here, we used the adiciona_se_inexistente function to add to the hash table if it does not exist:

int adiciona_se_inexistente(struct info* dados){
    unsigned long key = hash(dados->nome) % TAMANHO_HASH; //achar a entrada

    entrada *corrente = hash_nomes[key];
    while (corrente != NULL){ //verificar se já existe
        if (strcmp(corrente->dados->nome, dados->nome) == 0){
            return 0;
        }
        corrente = corrente->prox;
    }

    //não existe adiciona a nova entrada
    entrada *nova = malloc(sizeof(entrada));
    nova->prox = hash_nomes[key];
    nova->dados = dados;
    hash_nomes[key] = nova;
    return 1;
}
The hash table has to be initialized with all the NULL entries, which I did using the memset .

Comments:

  • Implemented the hash table with list of internal collisions instead of open addressing, which ends up using more space, but is potentially simpler to implement.
  • This solution is much more performance-efficient if we ensure that we have few collisions, depending on the hash function and the size of the table. However it will use more memory than its original solution. Note that the size of 100 I chose for the table is good for the number of names I have used ( 5 ), but if you have more names you can already generate many collisions, so it is something that has to adapt to your needs .
  • It is also worth remembering that if you only need this function at a point in the program and you are sure that later it will not be necessary, you must free up the space of all the filled entries in the hash table, with free .

Example of this Ideone implementation

    
01.12.2017 / 02:16