How to use deep-learning to parse forms with addresses?

8

I have an application for which I need to import personal data. I often get excel or csv / txt files with fields like name, address, email, phone, etc ... The formatting of the files varies, the order too, and sometimes there are blank fields. What can help an algorithm understand the fields is that every file I get with N entries all have the same column organization. What varies is the format of each file, not within each file.

I can do this by hand, often with algorithms with RegExp but that end up always having a large component of "custom made", ie I need to handle the data manually.

How would I be able to use JavaScript and deep-learning to teach the program to recognize the fields, format them so that I can consume them in my application, and eventually indicate poorly filled fields, and the program is sure what kind of field it should be?

Example input, where each line is an example of how columns can be in a given file:

// nome 1, nome 2, telefone, email, campos de morada
["joao", "pereira", "215548808", "[email protected]", "rua das peras", "2890", "campo alegre"]

// nome 1, nome 2, data de nascimento, email, codigo postal, morada, telefone
["maria", "conceição", "10051978", "[email protected]", "2400", "rua de porto alegre", "98337449"]

// nome completo, morada completa, mail pessoal, mail trabalho, telefone fixo, telemovel
["andreia pires", "rua do jardim nr10 3988 porto", "[email protected]", "[email protected]", "070234382", "013387484"]

And the fields that my application uses are:

nome 1 | nome 2 | email | telefone | morada | codigo postal 
    
asked by anonymous 14.05.2017 / 06:27

1 answer

4

This question is cool, but the answer would be almost a big work project. With deep learning you can solve your problem. With javascript , I do not know how to respond. I'm going to give an answer in R that is easily adaptable to python and then indicate some libs that you might be able to do with javascript .

Come on.

Data collection

As with any machine learning project you will need a database with some information already classified. Fortunately, in your case it should be simple to get a great bank without much effort.

Here, I've done a lot of information:

  • I've got a list of approved names on FUVEST of 2014 and separated in first and last name
  • I've got a list of street names in SP

I created a database that looks like this:

# A tibble: 10 × 2
        tipo                              valor
       <chr>                              <chr>
1        rua rua doutor heitor pereira carrilho
2        rua               rua hipólito vieites
3  sobrenome                     fogaca galdino
4       nome                             rafael
5        rua             rua ida câmara stefani
6  sobrenome                       alves duraes
7  sobrenome                       keiko sonoda
8  sobrenome             barcellos mano rinaldi
9       nome                             victor
10       rua                rua angelo catapano

In the end this database has 60k observations divided between first and last name and street. You can add other types of data as you wish, such as phone, zip code, and so on. I did not do this here for simplicity.

Data processing

As the data is, they are not suitable for consumption by a deep-learning model. We need an array that we'll call X . This array must have 3 dimensions: (n, maxlen, len_char_tab) where:

  • n is the number of observations you have
  • maxlen is the maximum number of characters an observation can have
  • len_char_tab is the number of distinct characters in every database

That is, I transform each sequence of type "abc" into an array of form

1 0 0
0 1 0
0 0 1

I made this database into what I need in the following way:

library(purrr)
char_table <- stringr::str_split(df$valor, "") %>%
  unlist() %>% 
  unique()

vec <- map(
  df$valor, 
  ~unlist(str_split(.x, "")) %>%
    map_int(~which(.x == char_table))
)

maxlen <- max(map_int(vec, length))
vec <- pad_sequences(vec, maxlen = maxlen)
vec <- apply(vec, c(1, 2),function(x) as.integer(x == 0:64))
vec <- aperm(vec, c(2,3,1))

Here my object vec is the array x that I was discussing and has the following dimensions: 60023 58 65 .

We also need an array called Y that will have the following dimension: (n, n_types). n is the size of your sample and n_types is the number of distinct types. The content of this array is 1 if the observation is of type i and 0 otherwise.

I did it this way:

all_res <- unique(df$tipo)
res <- sapply(df$tipo, function(x) as.integer(x == all_res)) %>% t()

The object res is the Y array that I commented and has dimension: 60023, 3

Model definition

Now let's use keras to define an LSTM. I will not try to explain what an LSTM is because it is very difficult and the Colah has already explained 100x better than anyone would explain. Read masi here

The code to set the template is below:

library(keras)
model <- keras_model_sequential()
model %>%
  layer_lstm(units = 128, input_shape = c(58, 65)) %>%
  layer_dense(3) %>%
  layer_activation("softmax")

model %>% compile(
  optimizer = "adam",
  loss = "categorical_crossentropy",
  metrics = "accuracy"
)

Model training

Train the model is the easiest part:

model %>% fit(
  x = vec, y = res,
  validation_split = 0.1,
  shuffle = TRUE,
  batch_size = 32
)

Result:

Train on 54020 samples, validate on 6003 samples
Epoch 1/10
54020/54020 [==============================] - 372s - loss: 0.0966 - acc: 0.9707 - val_loss: 0.0070 - val_acc: 0.9992

In a single epoch the model was able to hit practically all the observations that I left as validation. Of course my database is a lot simpler than yours, it looks cute, and most of the streets are with "street" in front. Which helps a lot. Expect worse results than this, but maybe not so much worse.

Use

In your case, after training the model in a database, I would apply the predictions in each of the columns and see which is the result masi appears, if it is a name, surname or address and would mark that column is this type.

But what about javascript?

  • You can translate trained templates into keras for javascript (only to apply) using this lib: link I've never used it so I do not know if it's good.

  • there's this one too: link but I do not think it trains LSTM's.

Database

I left the database available here at this link link

You can read in R by using df <- readRDS("df.rds") .

    
11.07.2017 / 01:32