Algorithm to find common word sequence

Question

Algorithm to find common word sequence

Navigation

#1 by (6 votes)

4

I'm looking for a logic where I can find in an exorbitant amount of data, phrases that are common. Sentence in this context is a sequence of words.

For example in these sentences:

I'm traveling to Orlando from the South
I really wanted to go to South Orlando
The day is beautiful here in South Orlando

The phrase, or sequence of words in common would be Orlando of the South

Searching I found this question in StackOverflow EN though I could not understand the answer.

algoritmo

asked by anonymous 07.05.2016 / 17:07

1 answer

How can I create array in SQL Server? How to compare the structure of two SQL Server databases?

score 6 · Answer 1

You do not mention exactly what you did not understand in SOEN's answer. But one thing the author repeats a lot is "linear time." The idea of an algorithm capable of processing the data in linear time is that the time it takes to process increases linearly with the amount of input data. Think ideally in a straightforward proportion: the more data, the longer it takes. It is expected for a sufficiently good algorithm. On the other hand, a bad algorithm would take exponential time to process the data. Thus, the more data, the much longer it would take to process - to the point of becoming unfeasible for a quantity of data that does not even need to be "exorbitant."

To better understand this concept, I suggest this other question and also that other issue .

Well, you have a lot of data (words or characters to process), so you have to worry about your algorithm's performance. The SOEN response suggests an approach based on some algorithms that I do not know, but from what I could understand the essential idea is to sweep the data (in linear time) and mount a $ means the end of the word):

Apparently,usingstructureslikethisyoucanalsobuildtheprefixtree(allpossibleprefixesforyourtext,consideredasastringonly)andthenfindthelongestandmostcommonprefixinit.

Theconceptsinvolveddonotseemtobesocomplex,butwillrequireyoutoreadandstudythemprobablyinEnglish.TheWikipediaarticlemaybeinsufficient,butyoumayfindmoremateriallookingfor"suffix tree".