MD5 is good enough?

1

I'm working on a legacy system, which has a database with some 5 years of cumulative records without any normalization. Among other things, its purpose is to allow users to write and post posts in the same style of Twitter, but without the limitation of 140 characters. Some users are complaining that their posts are appearing multiple times. The system had a serious problem at the time the posts were registered, causing the same content to be inserted multiple times.

At system level I was able to fix the problem, but now I have a huge table of posts, many of the records have 2 ... 3 ... up to 6 copies. The correct one would be to remove the duplicates, but at first, what I want to do is simply select a post of each.

To illustrate:

Notethatrecords1and2arefromthesameuser,andhaveexactlythesamebody.

Theonlycolumnthatcanidentifythatonepostisequaltotheotheronewith100%certaintyisthecorpo,whichisoftypeTEXT.Unliketheexample(whichwasjusttoillustrate),thiscolumncaneasilyhaveits1000characters,sousingSELECTDISTINCTON(body)*FROMposts;doesnotseemlikeacoolidea.

SowhatIthoughtwastocreateaSTOREDPROCEDUREtogeneratetheMD5fromthebodyofallposts,whichwouldbestoredinanindexedfield.SoIcoulddoSELECTDISTINCTbasedonthishashMD5.

Thetablethenlookslikethis:

With body_hash being an indexed field, I could do something like

SELECT DISTINCT ON(body_hash) * FROM posts;

I'm in doubt because this table has millions of records. MD5 is good enough for my situation? Or is there something better I can do to select only one record of each?

    
asked by anonymous 25.06.2016 / 14:01

2 answers

1

MD5-Message Digest 5 is a hash algorithm.

Hash algorithms aim to maintain the integrity of the data and work something like this:

There is an M entry, an Hash H function, and an output (hash code) H (M).

M-> H-> H (M)

A Hash algorithm is considered safe if different inputs result in different outputs, ie if you change a bit, the output must be totally different.

The MD5 algorithm some years ago was safe, but today there is better algorithm.

The weak point in the MD5 algorithm is in the generated hash output, which is 128 bits, if small, can generate conflicts, that is, different entries can generate equal outputs, thus breaking the security of the algorithm.

What is the alternative to MD5?

There are several alternatives, but one of the best known is the SHA-2 Family algorithm.

It's safer than MD5. The generated hash code has a longer range, which can be 224, 256, 512 etc bits, with the larger code being harder is different inputs generating equal outputs.

Conclusion

I think it might be a solution to your problem.

    
27.06.2016 / 18:40
2

Reliability of MD5 as Unique

If you want to trust 100% MD5 to find out if it's the same, you can not trust it. The MD5 can be a base for already making sure it is different, it is reliable when it is different. But two MD5 codes may be the same for different content.

I would say that MD5 has well over 99.99999999% chance of succeeding, but not 100%. If it is a record full of problems, it will solve everything or in a very exceptional situation will let something happen, which is already an expressive gain. But if it's just to fix the problem once and it will not be useful for other things, the cost of generating the MD5 does not even outweigh the effort. Generating an MD5 code is "expensive."

Actual solution

Index the same text column, solve the problem, and delete this index. It's easier and more reliable.

You will end up generating false positives with this. What if the user wanted to post the same message more than once? It will erase a legitimate message. And look what I'm considering in this DISTINCT the user identifier will be considered.

If you want to avoid new duplicates

This is more hypothetical, does not seem to be the case.

If you want to prevent new columns from being repeated you can do this to save time. But if you find a duplicate you have to check the full text to make sure there was no collision. This check in practice will almost never occur, after all almost 100% of the time% wont return 0 different lines and you will not have to do anything else, even in this case, it will be quick to check the ones that return, if everything is right, it will only return one line. So it can be efficient. I would not use it.

In fact using the text field will be fairly reasonably efficient in processing as well. Not in space occupied by the index and the time spent on your access. It can cause damage to the cache. This is the real problem of using the very large column.

But understand that any insertion or update of data in this column will be considerably slower because of MD5. But that does not mean it's a problem. Take a test if you can afford it.

It has a trick that can help, but I do not know if that's the case, it will not guarantee 100% anyway. Index the hash along with a short excerpt from the text. It may be the first character (s), or another snippet that is more likely to be different, but it must be a stretch that is guaranteed to be present in all texts to be useful. It is much rarer to collide in texts with equal parts. This only improves the chance. And it will increase the consumption of space, so the utility is questionable.

    
25.06.2016 / 15:46