How to compare set of numbers with another set?

3

How to mount query to compare whether one set A of numbers is equal to the other, of B, with the same numbers of set A, and so on.

Example:

  

Table data

id  conjunto    numero      ordem
1   1           1           1
2   1           12          2
3   1           4           3
4   1           6           4
5   2           1           1
6   2           12          2
7   2           4           3
8   2           6           4
9   3           15          1
10  3           17          2
11  3           20          3
12  4           15          1
13  4           17          2

The result after comparison: duplicate sets 1 and 2

I have a table with more than one million sets and with multiple numbers for each set and it's taking too long to return the result. I did not find a way to compare other than using For XML and creating temp table TEMP. Is there another way to efficiently and performatively compare?

Here is a script that compares:

select
    t.*
INTO 
#TABELA
from
    (
        select 1 id, 1 conjunto, 1 numero, 1 ordem
        union all
        select 2 id, 1 conjunto, 12 numero, 2 ordem
        union all
        select 3 id, 1 conjunto, 4 numero, 3 ordem
        union all
        select 4 id, 1 conjunto, 6 numero, 4 ordem
        union all
        select 5 id, 2 conjunto, 1 numero, 1 ordem
        union all
        select 6 id, 2 conjunto, 12 numero, 2 ordem
        union all
        select 7 id, 2 conjunto, 4 numero, 3 ordem
        union all
        select 8 id, 2 conjunto, 6 numero, 4 ordem
        union all
        select 9 id, 3 conjunto, 15 numero, 1 ordem
        union all
        select 10 id, 3 conjunto, 17 numero, 2 ordem
        union all
        select 11 id, 3 conjunto, 20 numero, 3 ordem
        union all
        select 12 id, 4 conjunto, 15 numero, 1 ordem
        union all
        select 13 id, 4 conjunto, 17 numero, 2 ordem
    ) t   


select DISTINCT
T1.conjunto,
(SELECT RIGHT('' + CONVERT(varchar, numero), 2) as [text()] from #TABELA i where i.conjunto=t1.conjunto order by i.conjunto, i.ordem for xml path('')) numeros
INTO
#TEMP
from 
#TABELA T1



SELECT
t1.conjunto
FROM
(
    select
    t1.numeros numeros
    from 
    #TEMP t1
    group by
    t1.numeros
    having
    count(*) > 1
) t2
inner join #TEMP t1 on t2.numeros = t1.numeros
    
asked by anonymous 18.02.2016 / 20:10

2 answers

2

Considering that:

  • the order of the elements (numbers) in the set does not matter;
  • there are no repeated elements (ie, the same number can not occur more than once in the same set); and
  • the values of the numbers range from 0 to 99,

I suggest that you store the sets in bitmap form. For each set a vector of 100 positions is created, where each position represents a number; if the number is present in the set, the position is set to 1, but if the number is not part of the set, the position is 0.

Assuming that each set can have elements ranging from 1 to 10, to represent the set {4, 1, 9} we would have 1001000010, where the first bit represents the number 1, the second bit represents the number 2 and so on up to 10.

Considering that there may be repeating sets, the bitmap is in a table. That is, each existing combination of elements occupies a single bitmap. In another table, a relation indicating which bitmap each set uses.

The initial idea was to implement the bitmap with a column declared as bit (100). That is, a string of bits. But unfortunately, this is not allowed in T-SQL. To work around this restriction, the first version then uses a string of bytes: char (100).

CREATE TABLE Tab_Bloco (
  ID int identity,
  Bloco char(100) not null default replicate('0', 100)
);

The other table contains, for each set, which bitmap it uses.

CREATE TABLE Conjunto_Bloco (
  ID_Conjunto int not null,
  ID_Bloco int not null 
           references Tab_Bloco(ID)
);

For example, sets 1: {4, 1, 9}, 2: {1, 2, 3, 5, 7} and 3: {1, 4, 9} would be represented as:

Tab_Bloco
1 1001000010
2 1110101000

Conjunto_Bloco
1 1
2 2
3 1

Note that sets 1 and 3 share the same bitmap.

At first, if it were possible to use the bit string, each set would occupy only 21 bytes (4 of Set_ID, 4 of Block_ID, Block_13 of Block_Tab). It is obvious that this account does not include indices and other controls. But for now, the proposed solution spends 108 bytes (4 of SetID, 4 of Tab_Block ID, 100 of Block in Tab_Block).

There is a way to implement the bitmap in 4 variables of type int, because each one occupies 32 bits. Or maybe binary (13). But it is necessary to use bitwise operations.

As the initial goal is performance, the new storage structure will ensure fastness to find repeated sets as well as searches such as "such a set exists?".

To validate the new structure it is necessary to know what the main manipulations are with the data. At first, it is not necessary to store the sets in the old form, but only as a bitmap. That is, each new set to be added is stored directly as a bitmap. From each vector it is possible to return the numbers that make up each element. But if it is necessary to keep the sets in the two representations (numbers and bitmap) then it is recommended to use trigger procedure to ensure consistency between the two tables.

To check for duplicate sets, this is a very simple process.

-- código #2
with Repetidos as (
SELECT ID_Bloco, Count(*) as Qtd
  from Conjunto_Bloco
  group by ID_Bloco
  having Count(*) > 1
)
SELECT T1.ID_Conjunto, T1.ID_Bloco
  from Conjunto_Bloco as T1
       inner join Repetidos as T2 on T1.ID_Bloco = T2.ID_Bloco
  order by ID_Bloco;

The following code # 1 converts the current table to the two new tables. I chose to use a cursor to read the data, as it seemed to me to be more efficient in this case than using set based code.

-- código #1
-- cria as duas novas estruturas de dados
CREATE TABLE Tab_Bloco (
  ID int identity,
  Bloco char(100) not null default replicate('0', 100)
);
CREATE clustered index I1_Tab_Bloco on Tab_Bloco (Bloco);
CREATE unique nonclustered index I2_Tab_Bloco on Tab_Bloco (ID);


CREATE TABLE Conjunto_Bloco (
  ID_Conjunto int not null,
  ID_Bloco int not null
               references Tab_Bloco (ID)
);
CREATE clustered index I1_Conjunto_Bloco on Conjunto_Bloco (ID_Conjunto, ID_Bloco);
CREATE unique nonclustered index I2_Conjunto_Bloco on Conjunto_Bloco (ID_Bloco, ID_Conjunto);
go

--
DECLARE Lê_Número CURSOR 
   forward_only
   for SELECT Conjunto, Numero
         from Conjunto order by Conjunto;
declare @ID_Conjunto int, @Número int;
declare @ID_atual int;

declare @ID_Bloco int, @Bloco char(100);

OPEN Lê_Número;

-- obtém primeiro número do primeiro conjunto
FETCH NEXT
  from Lê_Número 
  into @ID_Conjunto, @Número;

while @@fetch_status = 0
   begin
   -- novo conjunto
   set @ID_atual= @ID_conjunto;
   set @Bloco= replicate('0', 100);

   -- coleciona números do conjunto atual
   while @@fetch_status = 0 and @ID_Conjunto = @ID_atual
      begin
      -- liga bit respectivo do número
      set @Bloco= Stuff(@Bloco, @Número+1, 1, '1');

      -- obtém próximo número 
      FETCH NEXT
        from Lê_Número 
        into @ID_Conjunto, @Número;
      end;
      -- fim números do conjunto atual

   -- registra bloco
   set @ID_Bloco= (SELECT ID from Tab_Bloco where Bloco = @Bloco);
   IF @ID_Bloco is null
      begin
      -- novo bloco
      INSERT into Tab_Bloco (Bloco) values (@Bloco);
      set @ID_Bloco= scope_identity();
      end;

   -- estabele ligação de conjunto e bloco
   INSERT into Conjunto_Bloco (ID_Conjunto, ID_Bloco)
     values (@ID_atual, @ID_Bloco);

   end;
   -- fim conjunto

CLOSE Lê_Número;

DEALLOCATE Lê_Número;
go

This template has to be validated with a large amount of data.

    
30.10.2016 / 22:58
2

Despite having doubts about whether performance will improve, another way of comparing the records is by using INTERSECT. Example:

declare @conjunto int, @conjMin int

select  @conjunto = max(conjunto),
        @conjMin = min(conjunto)
from #TABELA

select  *, num_rows = (select count(*) from #TABELA where conjunto = t.conjunto),
        is_duplicated = 0
into #tempTable
from #TABELA t

while @conjunto >= @conjMin begin
    if((select top 1 1
        from(   SELECT numero, ordem, num_rows
                FROM #tempTable t1
                where t1.conjunto = @conjunto
                INTERSECT
                SELECT numero, ordem, num_rows
                FROM #tempTable t2
                where t2.conjunto <> @conjunto)
            t))>0 begin

            update #tempTable
            set is_duplicated = 1
            where conjunto = @conjunto 
        end

    set @conjunto = @conjunto-1
end

select distinct conjunto from #tempTable where is_duplicated = 1
    
27.10.2016 / 16:37