Considering that:
- the order of the elements (numbers) in the set does not matter;
- there are no repeated elements (ie, the same number can not occur
more than once in the same set); and
- the values of the numbers range from 0 to 99,
I suggest that you store the sets in bitmap form. For each set a vector of 100 positions is created, where each position represents a number; if the number is present in the set, the position is set to 1, but if the number is not part of the set, the position is 0.
Assuming that each set can have elements ranging from 1 to 10, to represent the set {4, 1, 9} we would have 1001000010, where the first bit represents the number 1, the second bit represents the number 2 and so on up to 10.
Considering that there may be repeating sets, the bitmap is in a table. That is, each existing combination of elements occupies a single bitmap. In another table, a relation indicating which bitmap each set uses.
The initial idea was to implement the bitmap with a column declared as bit (100). That is, a string of bits. But unfortunately, this is not allowed in T-SQL. To work around this restriction, the first version then uses a string of bytes: char (100).
CREATE TABLE Tab_Bloco (
ID int identity,
Bloco char(100) not null default replicate('0', 100)
);
The other table contains, for each set, which bitmap it uses.
CREATE TABLE Conjunto_Bloco (
ID_Conjunto int not null,
ID_Bloco int not null
references Tab_Bloco(ID)
);
For example, sets 1: {4, 1, 9}, 2: {1, 2, 3, 5, 7} and 3: {1, 4, 9} would be represented as:
Tab_Bloco
1 1001000010
2 1110101000
Conjunto_Bloco
1 1
2 2
3 1
Note that sets 1 and 3 share the same bitmap.
At first, if it were possible to use the bit string, each set would occupy only 21 bytes (4 of Set_ID, 4 of Block_ID, Block_13 of Block_Tab). It is obvious that this account does not include indices and other controls. But for now, the proposed solution spends 108 bytes (4 of SetID, 4 of Tab_Block ID, 100 of Block in Tab_Block).
There is a way to implement the bitmap in 4 variables of type int, because each one occupies 32 bits. Or maybe binary (13). But it is necessary to use bitwise operations.
As the initial goal is performance, the new storage structure will ensure fastness to find repeated sets as well as searches such as "such a set exists?".
To validate the new structure it is necessary to know what the main manipulations are with the data. At first, it is not necessary to store the sets in the old form, but only as a bitmap. That is, each new set to be added is stored directly as a bitmap. From each vector it is possible to return the numbers that make up each element. But if it is necessary to keep the sets in the two representations (numbers and bitmap) then it is recommended to use trigger procedure to ensure consistency between the two tables.
To check for duplicate sets, this is a very simple process.
-- código #2
with Repetidos as (
SELECT ID_Bloco, Count(*) as Qtd
from Conjunto_Bloco
group by ID_Bloco
having Count(*) > 1
)
SELECT T1.ID_Conjunto, T1.ID_Bloco
from Conjunto_Bloco as T1
inner join Repetidos as T2 on T1.ID_Bloco = T2.ID_Bloco
order by ID_Bloco;
The following code # 1 converts the current table to the two new tables. I chose to use a cursor to read the data, as it seemed to me to be more efficient in this case than using set based code.
-- código #1
-- cria as duas novas estruturas de dados
CREATE TABLE Tab_Bloco (
ID int identity,
Bloco char(100) not null default replicate('0', 100)
);
CREATE clustered index I1_Tab_Bloco on Tab_Bloco (Bloco);
CREATE unique nonclustered index I2_Tab_Bloco on Tab_Bloco (ID);
CREATE TABLE Conjunto_Bloco (
ID_Conjunto int not null,
ID_Bloco int not null
references Tab_Bloco (ID)
);
CREATE clustered index I1_Conjunto_Bloco on Conjunto_Bloco (ID_Conjunto, ID_Bloco);
CREATE unique nonclustered index I2_Conjunto_Bloco on Conjunto_Bloco (ID_Bloco, ID_Conjunto);
go
--
DECLARE Lê_Número CURSOR
forward_only
for SELECT Conjunto, Numero
from Conjunto order by Conjunto;
declare @ID_Conjunto int, @Número int;
declare @ID_atual int;
declare @ID_Bloco int, @Bloco char(100);
OPEN Lê_Número;
-- obtém primeiro número do primeiro conjunto
FETCH NEXT
from Lê_Número
into @ID_Conjunto, @Número;
while @@fetch_status = 0
begin
-- novo conjunto
set @ID_atual= @ID_conjunto;
set @Bloco= replicate('0', 100);
-- coleciona números do conjunto atual
while @@fetch_status = 0 and @ID_Conjunto = @ID_atual
begin
-- liga bit respectivo do número
set @Bloco= Stuff(@Bloco, @Número+1, 1, '1');
-- obtém próximo número
FETCH NEXT
from Lê_Número
into @ID_Conjunto, @Número;
end;
-- fim números do conjunto atual
-- registra bloco
set @ID_Bloco= (SELECT ID from Tab_Bloco where Bloco = @Bloco);
IF @ID_Bloco is null
begin
-- novo bloco
INSERT into Tab_Bloco (Bloco) values (@Bloco);
set @ID_Bloco= scope_identity();
end;
-- estabele ligação de conjunto e bloco
INSERT into Conjunto_Bloco (ID_Conjunto, ID_Bloco)
values (@ID_atual, @ID_Bloco);
end;
-- fim conjunto
CLOSE Lê_Número;
DEALLOCATE Lê_Número;
go
This template has to be validated with a large amount of data.