I'm working on a legacy system, which has a database with some 5 years of cumulative records without any normalization. Among other things, its purpose is to allow users to write and post posts in the same style of Twitter, but without the limitation of 140 characters. Some users are complaining that their posts are appearing multiple times. The system had a serious problem at the time the posts were registered, causing the same content to be inserted multiple times.
At system level I was able to fix the problem, but now I have a huge table of posts, many of the records have 2 ... 3 ... up to 6 copies. The correct one would be to remove the duplicates, but at first, what I want to do is simply select a post of each.
To illustrate:
Notethatrecords1and2arefromthesameuser,andhaveexactlythesamebody.
Theonlycolumnthatcanidentifythatonepostisequaltotheotheronewith100%certaintyisthecorpo
,whichisoftypeTEXT
.Unliketheexample(whichwasjusttoillustrate),thiscolumncaneasilyhaveits1000characters,sousingSELECTDISTINCTON(body)*FROMposts;
doesnotseemlikeacoolidea.
SowhatIthoughtwastocreateaSTOREDPROCEDURE
togeneratetheMD5fromthebodyofallposts,whichwouldbestoredinanindexedfield.SoIcoulddoSELECTDISTINCT
basedonthishashMD5.
Thetablethenlookslikethis:
With body_hash
being an indexed field, I could do something like
SELECT DISTINCT ON(body_hash) * FROM posts;
I'm in doubt because this table has millions of records. MD5 is good enough for my situation? Or is there something better I can do to select only one record of each?