How to create multiple entries in an index based on columns on the same row?

12

I've never found a good way to index multiple columns of a row as an index entry or to simulate this feature in MySQL.

The problem arises when you have fields working as tags or a similar concept. Ex:. columns with names tag1 , tag2 , tag3 . To search for rows with a specific tag quickly requires that you have 3 indexes and 3 separate queries in the most basic and obvious way.

There is some way to index these three fields as single-entry entries that allow you to perform only one search.

Trying to exemplify how it would look like

ID tag1 tag2 tag3
-- ---- ---- ----
01 abc  xyz  bla
02 foo  bar  ble
03 xyz  bla  bar

Como o índice se comportaria:

abc -> 01
bar -> 02 03
bla -> 01 03
ble -> 02
foo -> 02
xyz -> 01 03

So if I do a search for "bla", I'll have access to the IDs "01" and "03".

Is there any other way to do this efficiently? Even if you change the structure shown.

    
asked by anonymous 30.01.2014 / 05:03

3 answers

11

Some time ago (2013? 2012?) I developed a system very similar to that of the question. I had a few million objects and a dozen tags, and each object could have 0 or more tags associated. I had to filter these search-based objects by tag sets. Similar, no?

As the number of tags per object, in my case, was theoretically unlimited (since new tags could be added to the system after it was released), the solution proposed in the question did not serve me, ie I could not use a table with columns "tag1", "tag2", etc. Also, this scheme does not allow indexing (see my first comment in @mgibsonbr's answer).

Since I needed a lot of performance (i.e. queries answered in "seconds"), at the time I made a comparison between several solutions, including the two proposed by @mgibsonbr in their answer.

Next, my results - if my memory does not fail!

Trade offs ...

The solution "1." of @mgibsonbr has the disadvantage of possibly taking up a lot of disk space (since you will have the "characters" of the repeated tags countless times in the whole table). This is a disadvantage because it forces your database to have to read many "pages" of your hard disk, so you have to turn the disk a lot and move the read head very much, which can have great latency. The advantage is that you only do 1 select to get your result.

Already the solution "2." of @mgibsonbr uses less disk space (because in the giant table only the ids of the tags will be saved - and if you use the numeric type of size appropriate to the maximum number of tags, you can reduce to 4, 2 or even even 1 byte per line). Thus, you can read more lines per page read from the disk reducing latency. In contrast, your select would probably have a join:

select from tags_objects, tags
 where tags.id = tags_objects.tagId
   and (tags.name = 'tag-buscado-1'
     or tags.name = 'tag-buscado-2') -- etc...

This join is to blame for performance issues with this solution.

More efficient solution (in my case of specific use)

At the end of the day, the most effective solution I could get was to use solution "2." with 2 different selects. The first select looks for the ids of the tags, and the second select uses the ids of the tags in the giant table. It's as if I've done the join "manually".

This was advantageous to me because, in my case, it was possible for me to cache the ids of the tags in my application. This cache was updated by a background thread (doing a "full scan" on the lowercase table that contains tags and their ids every "X" seconds). In the end, in practical terms, the "synchronous" calculation was just a select on the giant table with the "tagId" column being some numeric type, so smaller than having to do joins.

Obviously, for performance issues, it is necessary to put an index in the "tagId" column of the giant table.

Before implementing this solution, my queries lasted ~ 1min or ~ 2min with, if I remember correctly, 5 tags. After all that, I managed to shorten the queries time to something around ~ 10s!

Considerations

It is quite complicated to analyze beforehand what will be the best performing solution in this case, because it really depends on the characteristics of your project. I hope this answer can give some guidance to your quest for the most efficient solution for your specific case.

    
30.01.2014 / 08:05
9

I can not speak of efficiency, but one way to query for a value using a single query would be to use IN in an unconventional way - with columns on the right side. Example:

select * from minha_tabela
where 'foo' in (tag1, tag2, tag3)
  and 'bar' in (tag1, tag2, tag3);

Source: "SQL Antipatterns" . >

P.S. According to this same book, modeling the data in this way (Multi-Column Attributes) is a bad practice, and causes problems beyond that of indexing (it also complicates insertion, removal, making each tag appear only once per line , etc). Ideally, create a table (or a intersection / join table , if each tag is more than a single string) to associate the tags with the table in a N to N relationship. Unless you have your reasons for maintaining this design, I suggest you change it.

Update: As the question was edited to allow changing the structure shown, I will complement with examples of dependent table and intersection table:

  • Dependent table

    PK                Índice  FK, Índice
    ID Etc            Tag     ID_Tabela
    -- ---            ---     ---------
    01                abc     01
    02                xyz     01
    03                bla     01
                      foo     02
                      bar     02
                      ...
    

    In this case the tags have been moved to a separate table, where each tag is associated with only one row in the original table. The "tag" column is indexed, so the Tag -> IDs query is fast. And the "table_id" column - the foreign key for the original table - is also indexed, so the query ID -> Tags is also fast.

    (Note: the "tag" column is not UNIQUE , since each tag can appear more than once in the table.)

  • Intersection / join table

    PK                FK        FK              PK
    ID Etc            ID_Tabela ID_Tag          ID Tag
    -- ---            --------- ------          -- ---
    01                01        01              01 abc
    02                01        02              02 xyz
    03                03        02              03 bla
                      01        03              04 foo
                      03        03              05 bar
                      ...                       ...
    

    In this case (useful if you want to store more information about a tag, or perhaps in a single query change the name of the tag without touching the associations) the tags were moved to a separate table, and an intersection table lists the two (in a ratio of N to N). It only remains to add indexes where needed, depending on the specific queries you intend to make.

  • These two techniques have also been adapted from the book, and are generally good practices that also keep their standardized model. For performance issues - which I can not comment on, since I have never operated a scale system - where denormalization or other optimization techniques are allowed, caching , offload at the application layer , etc.) see for example the @Bruno Reis answer .

        
    30.01.2014 / 06:03
    7

    In PostgreSQL if you create a multi-column index as the example of the figure and execute the query also as illustrated it will be able to use the index for the query and will deal with solving the query efficiently when executing a map of bits in the computed partial results. Just knowing the MySQL scheduling engine will treat the query execution in the same way.

        
    30.01.2014 / 13:20