How to count words from a string ignoring prepositions?

Question

How to count words from a string ignoring prepositions?

Navigation

#1 by (6 votes)
#2 by (1 votes)

8

Is there a service that recognizes if a certain type of word is a preposition?

I want to make a word ranking of a feeder rss , but ignoring prepositions.

Ignoring words with less than N characters is a good start, but maybe not enough, as there are still a lot of prepositions left. Here are two lists:

Essential prepositions: the ante, after, until, with, against, from, in, between, to, before, through, without, under, behind.

Accidental Prepositions: (= in the quality of), according (= according to), second (= conform), consonant (= conform), during, saved, out, by, tie, except, otherwise, p>

Do you know of any service that does this identification or do you have any idea how to implement a reasonable method, that is, it does not have to be 100% comprehensive, but covers a significant part of the words?

It can be in any language.

Thank you.

Here is a snippet of C # code that I'm using in the prototype, but it has proven to be inefficient:

private static IEnumerable<IGrouping<string, string>> MostCommonWords(string str, int maxNumWords)
{
    var prepositions = new string[] {/*...*/};
    var mostCommonWords =
        Regex.Split(str.ToLower(), @"\W+")
            .Where(s => s.Length > 3 && !prepositions.Contains(s))
            .GroupBy(s => s)
            .OrderByDescending(g => g.Count()).Take(maxNumWords);
    return mostCommonWords;
}

algoritmo c#

asked by anonymous 24.02.2015 / 18:30

2 answers

1

This is linked to data mining.

I found a presentation on the share slide ( link ) that might help you.

Follow other links that are linked to the subject:

link

24.02.2015 / 18:43

How to get the type of the generic entity of the upper interface? Deploy queues for WebSocket

score 6 · Accepted Answer

Slightly unimportant version ...

xmllint --xpath '//description'  'http://.../news.rss'  |
grep -Po '(*UTF8)(*UCP)\b[\w\d_][\w\d_\-.*#]*[\w\d_]\b|\w|\.\.\.|[,.:;()[\]?!]|\S' |
grep ... |
sort | uniq -c | sort -nr

line 1 - extract the tag description of the file (remote) new.rss (adapt the xpath to the concrete needs, see option referring to name-spaces -setns)
line 2 - tokenizer - one token per line (the word notion is more complicated than it sounds)
line 3 - select words with a minimum of 3 chars (remove if not relevant)
line 4 - count occurrences and sort in reverse order

If you need to add something like

grep -wvf  stopwords.txt  |

in line 3.5 to remove the words contained in the stopwords.txt file

Issue1 Footnotes: Stop-words

@Pedreiro commented: ... one of the points that I wanted to raise with this question was how to cover a good part of the stopwords and I will take to share here this link that I found code.google.com/p/stop-words , (contain a collection of stop-words lists for several languages)

Normally stop-words turn out to be

"grammatical" words (eg prepositions, articles, pronouns, some adverbs, conjunctions), - it is useful to start from a list like the one mentioned by @Peter,

to which we have added some words too common in the context in question (eg "Folha" and "Paulo" if RSS is news of the "Folha de S.Paulo")

and from which we extract informational words in our context (eg "seen" in RSS travel bureaucracies is very important - "you need a visa to enter Iran")

In other words, stop-word = 1 + 2 - 3

Finally: in many cases (1) you should not remove the stopwords! (2) makes sense stop-phrases.