Create a VIEW or a new TABLE in Postgres?

1

Hello! I have a situation ... I have a table that will have about 3 * 10 ^ 12 lines (3 trillion), but with only 3 attributes. In this table you will have the IDs of 2 individuals and the similarity between them (it is a number between 0 and 1 that I multiplied by 100 and put as a smallint to decrease the space). It turns out that I need to perform, for a certain individual that I want to do the research, the summarization of these columns and returning how many individuals have up to 10% similarity, 20%, 30% ... These values are fixed (every 10) in identical individuals. However, as you may know, the query will be very slow, so I thought about: - Create a new table to save summarized values - Create a view to save these values. As individuals are about 1.7 million, the search would not be so time consuming (if indexed, returns quite fast). So, what can I do? I would like to point out that my population will be almost fixed (after the DB is fully populated, it is expected that almost no increase will be made).

    
asked by anonymous 06.03.2018 / 11:55

2 answers

1

In your case an index could perfectly solve your problem, without the need to create a VIEW .

Assuming your data structure is something like:

-- INDIVIDUO
CREATE TABLE tb_individuo
(
    id BIGINT NOT NULL,
    nome TEXT NOT NULL,
    PRIMARY KEY( id )
);

-- SEMELHANCA
CREATE TABLE tb_semelhanca
(
    id_a BIGINT NOT NULL,
    id_b BIGINT NOT NULL,
    valor SMALLINT NOT NULL,
    PRIMARY KEY( id_a, id_b ),
    FOREIGN KEY ( id_a ) REFERENCES tb_individuo ( id ),
    FOREIGN KEY ( id_b ) REFERENCES tb_individuo ( id )
);

Containing the following sample data:

-- INSERE 5 INDIVIDUOS
INSERT INTO tb_individuo ( id, nome ) VALUES ( 1, 'JOSE' );
INSERT INTO tb_individuo ( id, nome ) VALUES ( 2, 'JOAO' );
INSERT INTO tb_individuo ( id, nome ) VALUES ( 3, 'ANTONIO' );
INSERT INTO tb_individuo ( id, nome ) VALUES ( 4, 'MARIO' );
INSERT INTO tb_individuo ( id, nome ) VALUES ( 5, 'PEDRO' );

-- INSERE SEMELHANCAS ENTRE OS INDIVIDUOS 
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 1, 2, 5 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 1, 3, 13 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 1, 4, 27 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 1, 5, 98 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 2, 3, 54 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 2, 4, 63 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 2, 5, 77 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 3, 4, 85 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 3, 5, 42 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 4, 5, 33 );

I suggest creating an index in the valor field of the tb_semelhanca table:

-- CRIACAO DE INDICE NO GRAU DE SEMELHANCA
CREATE INDEX ON tb_semelhanca ( valor );

What would allow queries using a BETWEEN to retrieve the desired integer with enough agility:

-- CONSULTA USANDO O INDICE
SELECT
    a.id,
    a.nome,
    b.id,
    b.nome,
    s.valor
FROM
    tb_semelhanca AS s
JOIN
    tb_individuo AS a ON ( a.id = s.id_a )
JOIN
    tb_individuo AS b ON ( b.id = s.id_b )
WHERE
    s.valor BETWEEN 20 AND 50;

Output:

| id |    nome | id |  nome | valor |
|----|---------|----|-------|-------|
|  1 |    JOSE |  4 | MARIO |    27 |
|  4 |   MARIO |  5 | PEDRO |    33 |
|  3 | ANTONIO |  5 | PEDRO |    42 |

SQLFiddle: link

EDIT

If the need to create a "summary" is mandatory, I still do not see the need to create a VIEW or auxiliary table.

As the cardinality of sumário with semelhanca would be 1-para-1 , simple creation of a field in tb_semelhanca could solve the problem, let's see:

-- INDIVIDUO
CREATE TABLE tb_individuo
(
    id BIGINT NOT NULL,
    nome TEXT NOT NULL,
    PRIMARY KEY( id )
);

-- SEMELHANCA
CREATE TABLE tb_semelhanca
(
    id_a BIGINT NOT NULL,
    id_b BIGINT NOT NULL,
    valor SMALLINT NOT NULL,
    sumario SMALLINT NOT NULL,
    PRIMARY KEY( id_a, id_b ),
    FOREIGN KEY ( id_a ) REFERENCES tb_individuo ( id ),
    FOREIGN KEY ( id_b ) REFERENCES tb_individuo ( id )
);

The index would be created in the sumario field of tb_semelhanca :

-- CRIACAO DE INDICE NO sumario
CREATE INDEX ON tb_semelhanca ( sumario );

A TRIGGER FUNCTION would be responsible for the summary calculation:

CREATE OR REPLACE FUNCTION calcular_sumario() RETURNS TRIGGER AS
$BODY$
BEGIN 
    NEW.sumario = (NEW.valor / 10)::int * 10; 
    RETURN NEW;
END;
$BODY$
LANGUAGE plpgsql;

A TRIGGER would be created to trigger the calcular_sumario() function every time a modification of tb_semelhança occurs:

-- CRIANDO TRIGGER
CREATE TRIGGER trigger_sumario BEFORE INSERT OR UPDATE ON tb_semelhanca FOR EACH ROW EXECUTE PROCEDURE  calcular_sumario();

Populating tables:

-- INSERE 10 INDIVIDUOS
INSERT INTO tb_individuo ( id, nome ) VALUES ( 1, 'JOSE' );
INSERT INTO tb_individuo ( id, nome ) VALUES ( 2, 'JOAO' );
INSERT INTO tb_individuo ( id, nome ) VALUES ( 3, 'ANTONIO' );
INSERT INTO tb_individuo ( id, nome ) VALUES ( 4, 'MARIO' );
INSERT INTO tb_individuo ( id, nome ) VALUES ( 5, 'PEDRO' );

-- INSERE SEMELHANCAS ENTRE OS INDIVIDUOS 
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 1, 2, 5 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 1, 3, 13 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 1, 4, 27 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 1, 5, 98 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 2, 3, 54 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 2, 4, 63 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 2, 5, 77 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 3, 4, 85 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 3, 5, 42 );
INSERT INTO tb_semelhanca ( id_a, id_b, valor ) VALUES ( 4, 5, 33 );

Checking tb_semelhanca :

| id_a | id_b | valor | sumario |
|------|------|-------|---------|
|    1 |    2 |     5 |       0 |
|    1 |    3 |    13 |      10 |
|    1 |    4 |    27 |      20 |
|    1 |    5 |    98 |      90 |
|    2 |    3 |    54 |      50 |
|    2 |    4 |    63 |      60 |
|    2 |    5 |    77 |      70 |
|    3 |    4 |    85 |      80 |
|    3 |    5 |    42 |      40 |
|    4 |    5 |    33 |      30 |

Query using the table of contents:

-- CONSULTA USANDO O SUMARIO
SELECT
    a.id,
    a.nome,
    b.id,
    b.nome,
    s.valor
FROM
    tb_semelhanca AS s
JOIN
    tb_individuo AS a ON ( a.id = s.id_a )
JOIN
    tb_individuo AS b ON ( b.id = s.id_b )
WHERE
    s.sumario = 20;

Output:

| id | nome | id |  nome | valor |
|----|------|----|-------|-------|
|  1 | JOSE |  4 | MARIO |    27 |

SQLFiddle: link

    
06.03.2018 / 16:20
0

I had almost the same problem, but my bank arrives to be much larger than yours and suffers updates and insert every minute, tables that have gone from 1gb I used partition based on a group related column.

The best solution after your reply in the comments, use Table partition

Why? 1 - will split your table into subtables and will not interfere with your select (it will continue the same but much faster).

2 - Since you do not use the update your trigger will not have problem with "on  conflicts, "Trigger will only be used in the parent table to redirect the insert.

3- Table partition does not divide table by date only !!! you can use some elaborate coding, that is "similarity between them."

How will your db be, for example:

TabelaMaster (Tabela original)
TabelaMaster-10 (Tabela gerada apartir do INHERITS )
TabelaMaster-20
...

The selects can be made direct in the partition or in the table "father" the result and almost the same thing, but use the father to avoid extra code.

    
07.03.2018 / 07:17