How to preprocess a text for the application in the Weka classification algorithms in Java?

4

I'm doing my CBT where the idea is roughly part of collecting tweets and training a machine learning algorithm to sort this data.

As I would pre-process this tweet, the idea being to train a machine-learning algorithm with inputs, where it will be fed with tweets meaning purchase and tweets that do not mean do not buy , so that later, from this algorithm trained, I can give as input a tweet and it gives me the output if it refers to yes or no to a purchase.

I already have the database of the collected tweets, and I have already incorporated the Weka API into my project.

    
asked by anonymous 28.04.2014 / 15:57

2 answers

2

WEKA reads a file in the format ARFF .

To create an arff file, you must define the following headers:

Statement of Relation

A name for the relationship, defined in the first line of the file. It is stated:

@relation <nome da relacao>

If the relationship name contains spaces, you must use quotation marks.

Statement of Attributes

Attributes are declared by an ordered sequence of @attributes . Each attribute in the dataset must have its own declaration using @attribute that uniquely identifies the name of this attribute and the data type. The order in which they are declared indicates the order in which they appear in the dataset.

It is stated:

@attribute <nome do atributo> <tipo de dado>

The attribute name must begin with a letter, and if it contains spaces, it must be enclosed in quotation marks.

The data types supported by WEKA are:

  • Numbers (real or integers): Numeric
  • "Free" text: String
  • Nominal attributes (default text)
  • Date: Date []
  • Relational Attributes

Numeric attributes

Serves for both integers and real. It declares:

@attribute idade numeric

Nominal Attributes

Named values are defined when a list of possible values is provided. For example:

@attribute classe {comprador, possivel-comprador, nao-comprador}

Attributes of type String

Used for arbitrary text. It is stated:

@attribute tweet string

Note: Must be enclosed in quotation marks if it contains spaces.

Data set declaration

The data set is declared in a single line. It is stated:

@data

It delimits where the instance data actually begins.

Instance data

The instance data is declared one per line and you must separate the attributes with a comma.

By responding directly to your question, a possible configuration of an ARFF file for your problem would look like this:

% Tudo depois do % é ignorado. Pode-se utilizar para inserir comentários
@relation compradores

@attribute tweet string
@attribute classe {compraria, nao-compraria}

@data
"To e morto Galaxy S5 por R$ 2,600", nao-compraria
"Preciso de um galaxy s5", compraria
"Configurando meu Galaxy s5", compraria
"Prefiro um iphone do que um galaxy s5", nao-compraria
    
29.04.2014 / 16:01
1

So, man, I'm doing something similar and I came across the same problem. I collected the Tweets with Python and saved it in a Json file, when I went to read the json on the weka it did not recognize. I solved it as follows:

I converted json to csv and took all line breaks, commas, single and double quotation marks, took the accent out of words and then tried to open it in weka and it worked.

After opening in Weka you can save your file in arff format, then I needed to open the file to change a line of it, since weka was not recognizing the field text as string, so I had to change a line of the file that was thus at the beginning of the file soon after @relation :

@attribute text string

You can apply Weka filters to the file such as RemoveDuplicates to remove duplicate instances and after doing the above procedure, you can apply the StringToWordVector filter which will help you do a feeling analysis .

    
31.03.2017 / 16:39