How to count words from a string ignoring prepositions?

8

Is there a service that recognizes if a certain type of word is a preposition?

I want to make a word ranking of a feeder rss , but ignoring prepositions.

Ignoring words with less than N characters is a good start, but maybe not enough, as there are still a lot of prepositions left. Here are two lists:

Essential prepositions: the ante, after, until, with, against, from, in, between, to, before, through, without, under, behind.

Accidental Prepositions: (= in the quality of), according (= according to), second (= conform), consonant (= conform), during, saved, out, by, tie, except, otherwise, p>

Do you know of any service that does this identification or do you have any idea how to implement a reasonable method, that is, it does not have to be 100% comprehensive, but covers a significant part of the words?

It can be in any language.

Thank you.

Here is a snippet of C # code that I'm using in the prototype, but it has proven to be inefficient:

private static IEnumerable<IGrouping<string, string>> MostCommonWords(string str, int maxNumWords)
{
    var prepositions = new string[] {/*...*/};
    var mostCommonWords =
        Regex.Split(str.ToLower(), @"\W+")
            .Where(s => s.Length > 3 && !prepositions.Contains(s))
            .GroupBy(s => s)
            .OrderByDescending(g => g.Count()).Take(maxNumWords);
    return mostCommonWords;
}
    
asked by anonymous 24.02.2015 / 18:30

2 answers

6

Slightly unimportant version ...

xmllint --xpath '//description'  'http://.../news.rss'  |
grep -Po '(*UTF8)(*UCP)\b[\w\d_][\w\d_\-.*#]*[\w\d_]\b|\w|\.\.\.|[,.:;()[\]?!]|\S' |
grep ... |
sort | uniq -c | sort -nr
  • line 1 - extract the tag description of the file (remote) new.rss (adapt the xpath to the concrete needs, see option referring to name-spaces -setns)
  • line 2 - tokenizer - one token per line (the word notion is more complicated than it sounds)
  • line 3 - select words with a minimum of 3 chars (remove if not relevant)
  • line 4 - count occurrences and sort in reverse order

If you need to add something like

grep -wvf  stopwords.txt  | 

in line 3.5 to remove the words contained in the stopwords.txt file

Issue1 Footnotes: Stop-words

  

@Pedreiro commented: ... one of the points that I wanted to raise with this question was how to cover a good part of the stopwords and I will take to share here this link that I found code.google.com/p/stop-words , (contain a collection of stop-words lists for several languages)

Normally stop-words turn out to be

  • "grammatical" words (eg prepositions, articles, pronouns, some adverbs, conjunctions), - it is useful to start from a list like the one mentioned by @Peter,
  • to which we have added some words too common in the context in question (eg "Folha" and "Paulo" if RSS is news of the "Folha de S.Paulo")
  • and from which we extract informational words in our context (eg "seen" in RSS travel bureaucracies is very important - "you need a visa to enter Iran")
  • In other words, stop-word = 1 + 2 - 3

    Finally: in many cases (1) you should not remove the stopwords! (2) makes sense stop-phrases.

        
    24.02.2015 / 20:25
    1

    This is linked to data mining.

    I found a presentation on the share slide ( link ) that might help you.

    Follow other links that are linked to the subject:

    link

    link

    link

        
    24.02.2015 / 18:43