Group query for similar but not identical records

2

I need to migrate a database (from SQL Server to PostgreSQL and I'm creating an ETL interactive in PHP to make this change), which is about 10 years, some 2 million records in the table clientes no has verification or single registration logic, in the case CPF or CNPJ .

And the big problem is that I have the following entries in the cliente table: [ banco bradesco s.a. , banco bradesco , banco bradesco sa , bradesco sa , banco do bradesco , bradesco seguros , bradesco curitiba , bradesco ] and another case: [ walmart , supermercado walmart ] , and none of them has CPF or CNPJ these being an example of thousands possible. My goal is to identify them and show to the user that he can turn them all into one in the new database (but that's another thing).

In the example above is the cliente table, and the value (name) is what is in the razao_social column, this column being the starting point to start the check.

I have already researched, already studied a schematic where I create a table, put the names there and look for the names in that table using LIKE % , but this is very costly. But an alternative.

My question here is, can I get all similar but not identical results without setting a parameter in a query ?

Thank you.

Obs.¹ I have already assessed the risk of this business with the issue with common names, such as panificadora , mercado , supermercado , posto and still need to persist in this solution that I mentioned in topic.

Note:

/ p>     

asked by anonymous 03.09.2016 / 20:32

1 answer

1

In my opinion, the first thing you should get is the list of terms used in a particular column ( whose answer I took from here ):

CREATE FUNCTION dbo.Split
(
    @RowData nvarchar(2000),
    @SplitOn nvarchar(5)
)  
RETURNS @RtnValue table 
(
    Id int identity(1,1),
    Data nvarchar(100)
) 
AS  
BEGIN 
    Declare @Cnt int
    Set @Cnt = 1

    While (Charindex(@SplitOn,@RowData)>0)
    Begin
        Insert Into @RtnValue (data)
        Select 
            Data = ltrim(rtrim(Substring(@RowData,1,Charindex(@SplitOn,@RowData)-1)))

        Set @RowData = Substring(@RowData,Charindex(@SplitOn,@RowData)+1,len(@RowData))
        Set @Cnt = @Cnt + 1
    End

    Insert Into @RtnValue (data)
    Select Data = ltrim(rtrim(@RowData))

    Return
END

CREATE FUNCTION dbo.SplitAll(@SplitOn nvarchar(5))
RETURNS @RtnValue table
(
    Id int identity(1,1),
    Data nvarchar(100)
)
AS
BEGIN
DECLARE My_Cursor CURSOR FOR SELECT Nome FROM dbo.clientes
DECLARE @description varchar(50)

OPEN My_Cursor
FETCH NEXT FROM My_Cursor INTO @description
WHILE @@FETCH_STATUS = 0
BEGIN
    INSERT INTO @RtnValue
    SELECT Data FROM dbo.Split(@description, @SplitOn)
   FETCH NEXT FROM My_Cursor INTO @description
END
CLOSE My_Cursor
DEALLOCATE My_Cursor

RETURN

END

SELECT DISTINCT Data FROM dbo.SplitAll(N' ')

With this word list, you can bring partitions of your table from the terms found. This is a human job, unfortunately, but with it, LIKE operations are more reasonable.

Also, if you do not want to use LIKE , my suggestion is to implement the Levenshtein distance function in SQL Server , which I transcribe below, of the response mentioned:

-- =============================================
-- Computes and returns the Levenshtein edit distance between two strings, i.e. the
-- number of insertion, deletion, and sustitution edits required to transform one
-- string to the other, or NULL if @max is exceeded. Comparisons use the case-
-- sensitivity configured in SQL Server (case-insensitive by default).
-- http://blog.softwx.net/2014/12/optimizing-levenshtein-algorithm-in-tsql.html
-- 
-- Based on Sten Hjelmqvist's "Fast, memory efficient" algorithm, described
-- at http://www.codeproject.com/Articles/13525/Fast-memory-efficient-Levenshtein-algorithm,
-- with some additional optimizations.
-- =============================================
CREATE FUNCTION [dbo].[Levenshtein](
    @s nvarchar(4000)
  , @t nvarchar(4000)
  , @max int
)
RETURNS int
WITH SCHEMABINDING
AS
BEGIN
    DECLARE @distance int = 0 -- return variable
          , @v0 nvarchar(4000)-- running scratchpad for storing computed distances
          , @start int = 1      -- index (1 based) of first non-matching character between the two string
          , @i int, @j int      -- loop counters: i for s string and j for t string
          , @diag int          -- distance in cell diagonally above and left if we were using an m by n matrix
          , @left int          -- distance in cell to the left if we were using an m by n matrix
          , @sChar nchar      -- character at index i from s string
          , @thisJ int          -- temporary storage of @j to allow SELECT combining
          , @jOffset int      -- offset used to calculate starting value for j loop
          , @jEnd int          -- ending value for j loop (stopping point for processing a column)
          -- get input string lengths including any trailing spaces (which SQL Server would otherwise ignore)
          , @sLen int = datalength(@s) / datalength(left(left(@s, 1) + '.', 1))    -- length of smaller string
          , @tLen int = datalength(@t) / datalength(left(left(@t, 1) + '.', 1))    -- length of larger string
          , @lenDiff int      -- difference in length between the two strings
    -- if strings of different lengths, ensure shorter string is in s. This can result in a little
    -- faster speed by spending more time spinning just the inner loop during the main processing.
    IF (@sLen > @tLen) BEGIN
        SELECT @v0 = @s, @i = @sLen -- temporarily use v0 for swap
        SELECT @s = @t, @sLen = @tLen
        SELECT @t = @v0, @tLen = @i
    END
    SELECT @max = ISNULL(@max, @tLen)
         , @lenDiff = @tLen - @sLen
    IF @lenDiff > @max RETURN NULL

    -- suffix common to both strings can be ignored
    WHILE(@sLen > 0 AND SUBSTRING(@s, @sLen, 1) = SUBSTRING(@t, @tLen, 1))
        SELECT @sLen = @sLen - 1, @tLen = @tLen - 1

    IF (@sLen = 0) RETURN @tLen

    -- prefix common to both strings can be ignored
    WHILE (@start < @sLen AND SUBSTRING(@s, @start, 1) = SUBSTRING(@t, @start, 1)) 
        SELECT @start = @start + 1
    IF (@start > 1) BEGIN
        SELECT @sLen = @sLen - (@start - 1)
             , @tLen = @tLen - (@start - 1)

        -- if all of shorter string matches prefix and/or suffix of longer string, then
        -- edit distance is just the delete of additional characters present in longer string
        IF (@sLen <= 0) RETURN @tLen

        SELECT @s = SUBSTRING(@s, @start, @sLen)
             , @t = SUBSTRING(@t, @start, @tLen)
    END

    -- initialize v0 array of distances
    SELECT @v0 = '', @j = 1
    WHILE (@j <= @tLen) BEGIN
        SELECT @v0 = @v0 + NCHAR(CASE WHEN @j > @max THEN @max ELSE @j END)
        SELECT @j = @j + 1
    END

    SELECT @jOffset = @max - @lenDiff
         , @i = 1
    WHILE (@i <= @sLen) BEGIN
        SELECT @distance = @i
             , @diag = @i - 1
             , @sChar = SUBSTRING(@s, @i, 1)
             -- no need to look beyond window of upper left diagonal (@i) + @max cells
             -- and the lower right diagonal (@i - @lenDiff) - @max cells
             , @j = CASE WHEN @i <= @jOffset THEN 1 ELSE @i - @jOffset END
             , @jEnd = CASE WHEN @i + @max >= @tLen THEN @tLen ELSE @i + @max END
        WHILE (@j <= @jEnd) BEGIN
            -- at this point, @distance holds the previous value (the cell above if we were using an m by n matrix)
            SELECT @left = UNICODE(SUBSTRING(@v0, @j, 1))
                 , @thisJ = @j
            SELECT @distance = 
                CASE WHEN (@sChar = SUBSTRING(@t, @j, 1)) THEN @diag                    --match, no change
                     ELSE 1 + CASE WHEN @diag < @left AND @diag < @distance THEN @diag    --substitution
                                   WHEN @left < @distance THEN @left                    -- insertion
                                   ELSE @distance                                        -- deletion
                                END    END
            SELECT @v0 = STUFF(@v0, @thisJ, 1, NCHAR(@distance))
                 , @diag = @left
                 , @j = case when (@distance > @max) AND (@thisJ = @i + @lenDiff) then @jEnd + 2 else @thisJ + 1 end
        END
        SELECT @i = CASE WHEN @j > @jEnd + 1 THEN @sLen + 1 ELSE @i + 1 END
    END
    RETURN CASE WHEN @distance <= @max THEN @distance ELSE NULL END
END

Usage:

... WHERE dbo.Levenshtein(@Coluna1, @Coluna2, 5) <= 5

The maximum coefficient of difference need not be 5. By testing, you will know which coefficient to apply.

    
03.09.2016 / 20:56