Why do PostgreSQL cost estimates go so wrong?

Question

Why do PostgreSQL cost estimates go so wrong?

Navigation

24

Someone could help me understand why Postgresql missed several cost estimates in an experiment I did.

I'm doing an experiment with 22 queries from the TPCH Benchmark [1], to check the performance of indexes in queries.

Of 22 queries, only 5 queries were verified that the optimizer used secondary indexes. In another experiment, the 5 queries were executed in a database without the presence of indexes, in order to identify if the execution time would increase due to the absence of indices.

But to my surprise, the experiment without the presence of indexes in the database was faster than using the indexes (for the 22 queries).

I would like to understand, because the total cost parameter was wrong in all cases, ie all the queries that spent the most time, indicated a lower cost - in all 5 cases, which is incorrect in my opinion.

See below, that the first line refers to query 6 that you used index, the cost was 3335809, lower than cost 5255959, the same query 6 that did not use index.

Also see the time spent. The query that used index spent 7 minutes while without using indices took 55 seconds. This behavior has spread to other cases.

My question is, Why does the total cost (execution plan) incorrectly calculate the cost for all cases where you have indexes?

Índices  |Query   |Tempo_Gasto     |Custo_Total
===============================================
Sim       6        00:07:56         3335809.61
Nao       6        00:00:55         5255959.00
Sim       7        00:09:16         5847359.97
Nao       7        00:02:08         6793148.45
Sim       10       00:07:04         40743017.17
Nao       10       00:02:14         41341406.62
Sim       15       00:10:03         6431359.90
Nao       15       00:01:56         9608659.87
Sim       20       00:12:48         8412159.69
Nao       20       00:05:49         9537835.93
=============================================

For issues of using a specific Patch from my research project, I'm using an old version 9.0.1 (from year 2012).

I have not modified the default parameters, only the random_page_cost to 1, since I am using an SSD disk, where the random access cost is less than in an HDD. Here's what's in my postgresql.conf

max_connections = 100 
effective_io_concurrency = 5 
#seq_page_cost = 1.0                    
random_page_cost = 1.0                  
#cpu_tuple_cost = 0.01                  
#cpu_index_tuple_cost = 0.005        
#cpu_operator_cost = 0.0025             
#effective_cache_size = 128MB

Did any of these parameters interfere with this error in cost statistics?

Follow query 6 and its Explain Analyze with and without indexes.

------------- QUERY 6 ---------------------------------------------
select
    sum(l_extendedprice * l_discount) as revenue
from
    lineitem
where
    l_shipdate >= date '1995-01-01'
    and l_shipdate < date '1995-01-01' + interval '1' year
    and l_discount between 0.09 - 0.01 and 0.09 + 0.01
    and l_quantity < 24;

---- COM ÍNDICE (idx_l_shipdatelineitem000)-------------------
Plano Execucao:Aggregate  (cost=3335809.59..3335809.61 rows=1
width=16) (actual time=476033.847..476033.847 rows=1 loops=1)
 ->  Bitmap Heap Scan on lineitem  (cost=376416.20..3330284.29
rows=2210122 width=16) (actual time=375293.183..471695.391
rows=2282333 loops=1)
       Recheck Cond: ((l_shipdate >= _1995-01-01_::date) AND
(l_shipdate < _1996-01-01 00:00:00_::timestamp without time zone))
        Filter: ((l_discount >= 0.08) AND (l_discount <= 0.10) AND
(l_quantity < 24::numeric))
        ->  Bitmap Index Scan on idx_l_shipdatelineitem000
(cost=0.00..375863.67 rows=17925026 width=0) (actual
time=375289.456..375289.456 rows=18211743 loops=1)
              Index Cond: ((l_shipdate >= _1995-01-01_::date) AND
(l_shipdate < _1996-01-01 00:00:00_::timestamp without time
zone))Total runtime: 476034.306 ms


------------------ SEM USO DE ÍNDICE ----------------------------
Plano Execucao:Aggregate  (cost=5255958.99..5255959.00 rows=1
width=16) (actual time=55051.051..55051.051 rows=1 loops=1)
  ->  Seq Scan on lineitem  (cost=0.00..5250433.68 rows=2210122
width=16) (actual time=0.394..52236.276 rows=2282333 loops=1)
        Filter: ((l_shipdate >= _1995-01-01_::date) AND (l_shipdate <
_1996-01-01 00:00:00_::timestamp without time zone)
        AND (l_discount >= 0.08) AND (l_discount <= 0.10) AND
(l_quantity < 24::numeric))Total runtime: 55051.380 ms

Any tip is welcome !!

postgresql

asked by anonymous 23.08.2017 / 06:24

0 answers

What HTTP status should I use for when a parameter is missing from POST? Is there a sudo for Windows?