SQL Server Full-Text-Search

4

I am developing a search using the full text search of SQL Server 2008. The search is working correctly, but there has been a need to include a feature.

When you search for "Joes", the message "you meant Joseph" has to appear. I did some research but did not find much information about the full text search anyone knows if it is possible to do this with this tool.

    
asked by anonymous 11.08.2015 / 16:09

1 answer

4

I'll put some alternatives that I know.

Alternative 1: Similarity Testing of Strings

I removed this solution from here , which uses diffuse search . CompareText returns an index between 0 and 100, being 100 for two strings identical and 0 for two strings completely different:

if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].[MatchText]') 

and xtype in (N'FN', N'IF', N'TF'))
drop function [dbo].[MatchText]
GO

CREATE function MatchText (@InputString varchar (50))
returns varchar (50)

begin
--function MatchText
--blindman, 7/2005
--Strips a string of all vowels and non-alphanumeric characters.

-- --test parameters
-- declare    @InputString varchar(50)
-- set    @InputString = 'Bruce a. Lindman'

declare @TempString varchar (50)
declare @OutputString varchar (50)
declare @CharNum integer
declare @TestChar CHAR(1)

--Convert to uppercase and remove noise characters
set @TempString = UPPER(@InputString)
set @TempString = replace(@TempString, 'A', '')
set @TempString = replace(@TempString, 'E', '')
set @TempString = replace(@TempString, 'I', '')
set @TempString = replace(@TempString, 'O', '')
set @TempString = replace(@TempString, 'U', '')

--Build @OutputString with only alphanumeric characters
set @CharNum = 1
set @OutputString = ''
while @CharNum <= len(@TempString)
    begin
        set @TestChar = substring(@TempString, @CharNum, 1)
        if (@TestChar between 'A' and 'Z') OR (@TestChar between '0' and '9') 

set @OutputString = @OutputString + @TestChar
        set @CharNum = @CharNum + 1
    end

return @OutputString
end
GO

if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].[CompareText]') 

and xtype in (N'FN', N'IF', N'TF'))
drop function [dbo].[CompareText]
GO

create function CompareText  (@String1 varchar (100), @String2 varchar (100))
returns integer
--Function CompareText
--     This function accepts two string values and returns an integer value between
--zero and one hundred indicating the similarity between the two string.  This
--algorithm was developed specifically to search for similar names, spelling
--variations, and typos in proper names and business names.  For this purpose,
--it is superior to SOUNDEX, which searches only for similar sounding words.
--     Proper name pairs which yield a CompareText value above 80 are very likely to
--represent the same person.  Pair values greater than 60 should be reviewed
--manually to confirm the match.  For greater accuracy and efficiency, run the names
--through the MatchText function to remove spaces and vowels before submitting them
--to comparetext.
--     For efficiency in comparing two large lists of names, it is best to join
--the sets on another column as well, such as zip code, or city name.

--Usage: select dbo.CompareText('Alan Smith', 'Smith, Alan J.')

--blindman 4/2005
--Adapted from MS Access algorithm developed 1997

begin

declare @Possibles integer
declare @Hits integer
declare @Counter integer

set @Possibles = len(@String1) + len(@String2) - 2
set @Hits = 0

set @Counter = len(@String1)-1
while @Counter > 0
    begin
      if charindex(substring(@String1, @Counter, 2), @String2) > 0 set @Hits = 

@Hits + 1
      set @Counter = @Counter - 1
    end

set @Counter = len(@String2)-1
while @Counter > 0
    begin
      if charindex(substring(@String2, @Counter, 2), @String1) > 0 set @Hits = @Hits + 1
      set @Counter = @Counter - 1
    end

return (100*@Hits)/@Possibles
end

Usage:

select dbo.CompareText('José', 'Joes')

Returned 66 in my test. Below are some considerations on how to use this.

Alternative 2: Phonetic Search

These questions can help .

Alternative 3: Distance from Levenshtein

This, until then, was the best algorithm I found. I took it from :

-- =============================================
-- Computes and returns the Levenshtein edit distance between two strings, i.e. the
-- number of insertion, deletion, and sustitution edits required to transform one
-- string to the other, or NULL if @max is exceeded. Comparisons use the case-
-- sensitivity configured in SQL Server (case-insensitive by default).
-- http://blog.softwx.net/2014/12/optimizing-levenshtein-algorithm-in-tsql.html
-- 
-- Based on Sten Hjelmqvist's "Fast, memory efficient" algorithm, described
-- at http://www.codeproject.com/Articles/13525/Fast-memory-efficient-Levenshtein-algorithm,
-- with some additional optimizations.
-- =============================================
CREATE FUNCTION [dbo].[Levenshtein](
    @s nvarchar(4000)
  , @t nvarchar(4000)
  , @max int
)
RETURNS int
WITH SCHEMABINDING
AS
BEGIN
    DECLARE @distance int = 0 -- return variable
          , @v0 nvarchar(4000)-- running scratchpad for storing computed distances
          , @start int = 1      -- index (1 based) of first non-matching character between the two string
          , @i int, @j int      -- loop counters: i for s string and j for t string
          , @diag int          -- distance in cell diagonally above and left if we were using an m by n matrix
          , @left int          -- distance in cell to the left if we were using an m by n matrix
          , @sChar nchar      -- character at index i from s string
          , @thisJ int          -- temporary storage of @j to allow SELECT combining
          , @jOffset int      -- offset used to calculate starting value for j loop
          , @jEnd int          -- ending value for j loop (stopping point for processing a column)
          -- get input string lengths including any trailing spaces (which SQL Server would otherwise ignore)
          , @sLen int = datalength(@s) / datalength(left(left(@s, 1) + '.', 1))    -- length of smaller string
          , @tLen int = datalength(@t) / datalength(left(left(@t, 1) + '.', 1))    -- length of larger string
          , @lenDiff int      -- difference in length between the two strings
    -- if strings of different lengths, ensure shorter string is in s. This can result in a little
    -- faster speed by spending more time spinning just the inner loop during the main processing.
    IF (@sLen > @tLen) BEGIN
        SELECT @v0 = @s, @i = @sLen -- temporarily use v0 for swap
        SELECT @s = @t, @sLen = @tLen
        SELECT @t = @v0, @tLen = @i
    END
    SELECT @max = ISNULL(@max, @tLen)
         , @lenDiff = @tLen - @sLen
    IF @lenDiff > @max RETURN NULL

    -- suffix common to both strings can be ignored
    WHILE(@sLen > 0 AND SUBSTRING(@s, @sLen, 1) = SUBSTRING(@t, @tLen, 1))
        SELECT @sLen = @sLen - 1, @tLen = @tLen - 1

    IF (@sLen = 0) RETURN @tLen

    -- prefix common to both strings can be ignored
    WHILE (@start < @sLen AND SUBSTRING(@s, @start, 1) = SUBSTRING(@t, @start, 1)) 
        SELECT @start = @start + 1
    IF (@start > 1) BEGIN
        SELECT @sLen = @sLen - (@start - 1)
             , @tLen = @tLen - (@start - 1)

        -- if all of shorter string matches prefix and/or suffix of longer string, then
        -- edit distance is just the delete of additional characters present in longer string
        IF (@sLen <= 0) RETURN @tLen

        SELECT @s = SUBSTRING(@s, @start, @sLen)
             , @t = SUBSTRING(@t, @start, @tLen)
    END

    -- initialize v0 array of distances
    SELECT @v0 = '', @j = 1
    WHILE (@j <= @tLen) BEGIN
        SELECT @v0 = @v0 + NCHAR(CASE WHEN @j > @max THEN @max ELSE @j END)
        SELECT @j = @j + 1
    END

    SELECT @jOffset = @max - @lenDiff
         , @i = 1
    WHILE (@i <= @sLen) BEGIN
        SELECT @distance = @i
             , @diag = @i - 1
             , @sChar = SUBSTRING(@s, @i, 1)
             -- no need to look beyond window of upper left diagonal (@i) + @max cells
             -- and the lower right diagonal (@i - @lenDiff) - @max cells
             , @j = CASE WHEN @i <= @jOffset THEN 1 ELSE @i - @jOffset END
             , @jEnd = CASE WHEN @i + @max >= @tLen THEN @tLen ELSE @i + @max END
        WHILE (@j <= @jEnd) BEGIN
            -- at this point, @distance holds the previous value (the cell above if we were using an m by n matrix)
            SELECT @left = UNICODE(SUBSTRING(@v0, @j, 1))
                 , @thisJ = @j
            SELECT @distance = 
                CASE WHEN (@sChar = SUBSTRING(@t, @j, 1)) THEN @diag                    --match, no change
                     ELSE 1 + CASE WHEN @diag < @left AND @diag < @distance THEN @diag    --substitution
                                   WHEN @left < @distance THEN @left                    -- insertion
                                   ELSE @distance                                        -- deletion
                                END    END
            SELECT @v0 = STUFF(@v0, @thisJ, 1, NCHAR(@distance))
                 , @diag = @left
                 , @j = case when (@distance > @max) AND (@thisJ = @i + @lenDiff) then @jEnd + 2 else @thisJ + 1 end
        END
        SELECT @i = CASE WHEN @j > @jEnd + 1 THEN @sLen + 1 ELSE @i + 1 END
    END
    RETURN CASE WHEN @distance <= @max THEN @distance ELSE NULL END
END

Usage:

select dbo.Levenshtein('José', 'Jose', 100)

You returned 1 for this case.

Considerations

For the first alternative, you need to define a similarity coefficient in your application, as well as a list of known words, which should be the column of the table to be compared.

For the second, it's cool to define a phonetic algorithm of your choice to use and go testing. I do not recommend Soundex. It does not work well for Portuguese.

For the third, it's the same idea as the first alternative.

    
11.08.2015 / 16:36