This question is cool, but the answer would be almost a big work project. With deep learning you can solve your problem. With javascript
, I do not know how to respond. I'm going to give an answer in R
that is easily adaptable to python
and then indicate some libs that you might be able to do with javascript
Come on.
Data collection
As with any machine learning project you will need a database with some information already classified. Fortunately, in your case it should be simple to get a great bank without much effort.
Here, I've done a lot of information:
- I've got a list of approved names on FUVEST of 2014 and separated in first and last name
- I've got a list of street names in SP
I created a database that looks like this:
# A tibble: 10 × 2
tipo valor
<chr> <chr>
1 rua rua doutor heitor pereira carrilho
2 rua rua hipólito vieites
3 sobrenome fogaca galdino
4 nome rafael
5 rua rua ida câmara stefani
6 sobrenome alves duraes
7 sobrenome keiko sonoda
8 sobrenome barcellos mano rinaldi
9 nome victor
10 rua rua angelo catapano
In the end this database has 60k observations divided between first and last name and street. You can add other types of data as you wish, such as phone, zip code, and so on. I did not do this here for simplicity.
Data processing
As the data is, they are not suitable for consumption by a deep-learning model. We need an array that we'll call X
. This array must have 3 dimensions: (n, maxlen, len_char_tab) where:
- n is the number of observations you have
- maxlen is the maximum number of characters an observation can have
- len_char_tab is the number of distinct characters in every database
That is, I transform each sequence of type "abc" into an array
of form
1 0 0
0 1 0
0 0 1
I made this database into what I need in the following way:
char_table <- stringr::str_split(df$valor, "") %>%
unlist() %>%
vec <- map(
~unlist(str_split(.x, "")) %>%
map_int(~which(.x == char_table))
maxlen <- max(map_int(vec, length))
vec <- pad_sequences(vec, maxlen = maxlen)
vec <- apply(vec, c(1, 2),function(x) as.integer(x == 0:64))
vec <- aperm(vec, c(2,3,1))
Here my object vec
is the array x
that I was discussing and has the following dimensions: 60023 58 65
We also need an array called Y
that will have the following dimension: (n, n_types). n is the size of your sample and n_types is the number of distinct types. The content of this array is 1 if the observation is of type i and 0 otherwise.
I did it this way:
all_res <- unique(df$tipo)
res <- sapply(df$tipo, function(x) as.integer(x == all_res)) %>% t()
The object res
is the Y
array that I commented and has dimension: 60023, 3
Model definition
Now let's use keras
to define an LSTM.
I will not try to explain what an LSTM is because it is very difficult and the Colah has already explained 100x better than anyone would explain. Read masi here
The code to set the template is below:
model <- keras_model_sequential()
model %>%
layer_lstm(units = 128, input_shape = c(58, 65)) %>%
layer_dense(3) %>%
model %>% compile(
optimizer = "adam",
loss = "categorical_crossentropy",
metrics = "accuracy"
Model training
Train the model is the easiest part:
model %>% fit(
x = vec, y = res,
validation_split = 0.1,
shuffle = TRUE,
batch_size = 32
Train on 54020 samples, validate on 6003 samples
Epoch 1/10
54020/54020 [==============================] - 372s - loss: 0.0966 - acc: 0.9707 - val_loss: 0.0070 - val_acc: 0.9992
In a single epoch the model was able to hit practically all the observations that I left as validation. Of course my database is a lot simpler than yours, it looks cute, and most of the streets are with "street" in front. Which helps a lot. Expect worse results than this, but maybe not so much worse.
In your case, after training the model in a database, I would apply the predictions in each of the columns and see which is the result masi appears, if it is a name, surname or address and would mark that column is this type.
But what about javascript?
You can translate trained templates into keras for javascript (only to apply) using this lib: link
I've never used it so I do not know if it's good.
there's this one too: link
but I do not think it trains LSTM's.
I left the database available here at this link link
You can read in R by using df <- readRDS("df.rds")