First you created a scenario similar to the one you described: identical structure and a random data mass:
--
-- MODELO DA TABELA app_event
--
create table app_event
(
ae_id serial primary key,
ae_driver_id integer,
ae_trip_item_id integer,
ae_date timestamp not null
);
--
-- MODELO DA TABELA trip_item
--
create table trip_item
(
ti_id serial primary key
);
--
-- INSERE 3 MILHOES DE REGISTROS NA TABELA app_event
--
INSERT INTO app_event( ae_id, ae_date, ae_driver_id, ae_trip_item_id )
SELECT
generate_series( 1, 3000000 ),
now() - ((random() * 3650) * '1 day'::interval),
(random() * 100)::int,
(random() * 100)::int;
--
-- INSERE 1 MILHAO DE REGISTROS NA TABELA trip_item
--
INSERT INTO trip_item( ti_id )
SELECT
generate_series( 1, 1000000 );
Query under review:
SELECT
ae_id
FROM
app_event
JOIN
trip_item ON (ti_id = ae_trip_item_id)
WHERE
ae_driver_id = 63
ORDER BY
ae_date DESC
LIMIT
10;
The plan of your query looks something like this:
Limit (cost=87050.55..87050.58 rows=10 width=12)
-> Sort (cost=87050.55..87089.49 rows=15574 width=12)
Sort Key: app_event.ae_date
-> Hash Join (cost=28320.00..86714.00 rows=15574 width=12)
Hash Cond: (app_event.ae_trip_item_id = trip_item.ti_id)
-> Seq Scan on app_event (cost=0.00..58043.59 rows=15574 width=16)
Filter: (ae_driver_id = 63)
-> Hash (cost=15045.00..15045.00 rows=1062000 width=4)
-> Seq Scan on trip_item (cost=0.00..15045.00 rows=1062000 width=4)
Notice that the plan analyzer points out two points of attention:
1) A Sequential Scan
with a cost
quite high in the app_event
table because of a filtering by the ae_driver_id
field
2) Other Sequential Scan
with a cost
quite high in the trip_item
table because of the join condition using the ae_trip_item_id
field.
A sequential search on a table with a large volume of data tends to be slow during read operations. That's why indexing is crucial to good performance.
Creating Indexes:
CREATE INDEX idx_ae_trip_item_id ON app_event (ae_trip_item_id);
CREATE INDEX idx_ae_driver_id ON app_event (ae_driver_id);
New query plan:
Limit (cost=24050.07..24050.09 rows=10 width=12)
-> Sort (cost=24050.07..24120.51 rows=28178 width=12)
Sort Key: app_event.ae_date
-> Merge Join (cost=23021.55..23441.15 rows=28178 width=12)
Merge Cond: (trip_item.ti_id = app_event.ae_trip_item_id)
-> Index Only Scan using trip_item_pkey on trip_item (cost=0.42..30408.42 rows=1000000 width=4)
-> Sort (cost=23014.70..23085.14 rows=28178 width=16)
Sort Key: app_event.ae_trip_item_id
-> Bitmap Heap Scan on app_event (cost=530.81..20932.02 rows=28178 width=16)
Recheck Cond: (ae_driver_id = 63)
-> Bitmap Index Scan on idx_ae_driver_id (cost=0.00..523.76 rows=28178 width=0)
Index Cond: (ae_driver_id = 63)
The effect of index creation is the total absence of Sequential Scans
in the query plane and a drastic reduction in cost
of operations.
In practice, in my tests, before the creation of the indexes, the query took 450ms
to execute, after creation of indexes the execution time is around 12ms
.
I hope I have helped!