Alternatives to the (anti) standard Entity-Attribute-Value

17

I'm working on two systems that have a common characteristic: entities whose attributes are dynamic (ie can not be "hard-coded"), need to be queried in search operations (result filtering) , but are otherwise decoupled from the rest of the system (ie there is nothing referring to them, nor are they used in any particular calculation). I'm having trouble finding an appropriate way to represent them.

So far, I've been using the (anti-) EAV . There is a Entity table that identifies the entity, a Tag ( NxN with Entity ) table that says the "type" - or "class" - of the entity, a Attribute table Nx1 ) that identifies the property, its type, and whether or not it can be absent, and finally several tables Tag ( XValue , TextValue , IntValue etc) that say "entity E possesses to the attribute A is the value V ".

  

Contextualizing an entity could be a product for sale, its tag the type of product (eg "car"), attributes being the characteristics of a car (eg "brand", "model", " kilometers ") and the values of the application of those characteristics to the product for sale (eg" Fiat "," Palio "," 20000 "). Types of products can be created, altered and deleted (dynamic), you can consult a product type by its characteristics (eg filter cars by the "Fiat" brand), but there is no foreign key for a specific attribute, and ever there will be any "non-generic" calculation involving product features.

     

And, as I said, I currently implement this in the form of EAV templates - in a PostgreSQL database, cross-platform at first but more towards Linux.

Usually using EAV is a "potential WTF alert," but I believe that this would be one of its legitimate uses (the data structures modeled in this way are actually data , not < strong> meta-data ). The vast majority of the bank is formalized, only a few entities were modeled in this way (and I took steps to ensure that it would be possible to formalize any model that might be coupled to the rest of the system.)

However, I'm looking for alternatives to this model - since it's laborious to deal with and full of pitfalls . I thought of representing each entity by a simple XML, but there is the requirement of the filtered search - which I do not know how to do [efficiently]. I have no experience with non-relational DBMS ( "NoSQL" ), so I do not know if its performance characteristics would be acceptable (I am providing for a high number of readings - filtered searches - simultaneous as well as simultaneous writes - even if not in the same entity. Other ideas came to mind, but they seem to me to be "crazy" to use in practice (eg dynamically change the DB schema whenever a "type" is modified).

I ask that answers be based on previous experience involving systems with similar requirements, not just theoretical opinion and / or background, if possible. Suggestions for specific technologies are welcome, but what I'm looking for is a solution strategy , not a software recommendation .

Update: For those not familiar with "Entity-Attribute-Value" (unfortunately I did not find any Portuguese material on the subject)

In a traditional (formalized) modeling, a "Car" entity could be represented as:

Carros
ID      Marca     Modelo     Km
----------------------------------
1       Fiat      Palio      20000
2       Honda     Fit        10000

In an EAV modeling, this is done like this:

Entidades         Atributos            Valores
ID   Tipo         ID    Nome           Entidade Atributo Valor
----------        -------------        -----------------------
1    Carro        1     Marca          1        1        Fiat
2    Carro        2     Modelo         1        2        Palio
                  3     Km             1        3        20000
                                       2        1        Honda
                                       2        2        Fit
                                       2        3        10000

The advantage of this representation is that the "Car" type is not hardcoded: if I want to add a "Color" attribute, I do not need to tinker with the table structure, create new queries to deal with this attribute, etc. - just create one more line in the DateValue table and for each car one more line in the Atributos table. If in addition to cars I want to represent something else (eg real estate), just create lines in the Valores table with this new type, and give it attributes in the same way as "Cars".

The disadvantages are ... well, all possible and imaginable ! Generally, this is a standard to be avoided at all costs (i.e., an "anti-standard"). However, even the greatest critics admit that - in certain cases - it is inevitable. My question is: 1) Does this apply to my case? or is there a formalized way to meet my requirements? 2) if it is inevitable, if there are good alternatives that have acceptable performance.

    
asked by anonymous 12.05.2014 / 08:25

1 answer

12

This is a typical use case for NoSQL tools, especially for tools in the document-oriented database (such as MongoDB ). Other possible / complementary alternatives are enterprise search tools (such as Elasticsearch ).

The key word here is schema-less , a feature with advantages and disadvantages (see this presentation a> by Martin Fowler on the subject).

I've had good experience with large systems:

  • E-commerce ⇒ Generally have very complicated relational models to represent attributes / groups of attributes / category hierarchy / category attributes. etc.
  • Jobs and Resumes ⇒ Usually deal with poorly structured textual information. Algorithms to determine relevancy are extremely important.
  • Telco ⇒ Store a variety of configurations that are difficult to structure in a relational model.

In fact, from a given volume of data, as well as for certain read and write features, enterprise search , NoSQL or both things are practically required to keep the system responsive.

How it works

In both cases it is possible to create documents with various attributes (eg, in JSON format), including complex attributes ( arrays , nested types, etc.). Documents with different attributes can be grouped in the same category (that is, the type of a document does not define its attributes).

When you search for a particular document (for example, by id ), you receive all the information that persisted on that document (i.e., all attributes).

When you make a query using a given attribute engines search only the documents that contain that attribute.

In practice your application ends up knowing and defining the rules for the attributes (eg, a car will always have id , modelo and ano , but only imported cars have the country attribute of origin) and you do "sane" consultations on these rules. This is what Martin Fowlwer calls implicit schemas .

Features

Performance: The correct use of these solutions is far more efficient than the standard EVA in relational databases. These solutions are made to scale and quickly return query results in gigantic data sets .

Queries: Searches for types, ranges, logical searches, sorting, scoring, full text search, be done. Of course, the tools have their own characteristics. It takes a while before you understand how to structure information efficiently, as well as lose addiction to aggregate functions, having clauses, and the like (even though this type of query can be done with MapReduce ).

ACID / BASE : Just like in relational databases, NoSQL tools can usually be configured to support different consistency / availability / fault tolerance characteristics (see CAP Theorem ). Each tool also has its transaction and serialization mechanisms *.

Convenience: Keep your documents and make all kinds of queries (as well as updates, deletions, updates, partial, etc.). Everything is very simple (the learning curve of Relational DBMS, in my opinion, is much more pronounced).

* Specific to the tools mentioned: You will have no problem with competing readings or writes. I only recommend that you keep at least two knots in the air the whole time (especially for the MongoDB). The intuition here is not (only) to guarantee high availability, but to extract the maximum of the tools. I recommend doing this even for the development environment ... This task is much simpler than it looks (I did local configurations with VMs on my laptop, I lost no more than an hour configuring MongoDB and Elasticsearch).

    
22.05.2014 / 10:27