Simultaneous threads (parallel processing) in R and serialized writing in SQLite

2

Hello everyone. I'm trying to develop a code that makes it possible to parser HTML files using the R language and, consecutively, write the extracted HTML data to the SQLite database in a serialized way.

In order to perform parallel processing, I am using the furrr package (this package is similar to purrr , with a focus on functional programming, yet its advantage is that it allows execution in parallel (multi-thread) of the script. Now to make the connection to SQLite and send the queries, I am using RSQLite and DBI respectively.

In the code below, the execution occurs perfectly if I put the plan(sequential) function that establishes a serialized execution (in my case, it would promote the handling of the HTML files and the recording in SQLite in a serialized way). But when I enable plan(multiprocess) the script only executes about 25% of the writes and returns an error: Error in result_bind(res@ptr, params) : database is locked In addition: Warning message: . Apparently, this indicates, in my view, that some process (thread) gave error after finding SQLite blocked by another process (thread) running.

In the furrr documentation the "plan" can be configured as follows: # You set a "plan" for how the code should run. The easiest is 'multiprocess' # On Mac this picks plan(multicore) and on Windows this picks plan(multisession) ( link )

THE CHALLENGE: My idea is that the handling of HTML files be done in parallel and the recordings serialized in SQLite, since it does not allow simultaneous writes. I believe this will lead to performance gains, especially when there is a need to handle more complex HTML data parser that is slow to complete.

Some possible paths to think about:

1) Do you have to configure the RSQLite::SQLite() parameters to solve the problem?

2) Is it necessary to use some function for error handling like purrr::safely during the execution of the DBI::dbExecute function so that it stays trying to execute the query in SQLite until it can? Or purrr::safely would have to be used during the SQLite connection in drive_sqlite() function?

3) Is there a function of the DBI package to help solve this problem?

4) Any type of caching in SQLite so that it stores the recursed R queries and executes them serialized? ( link )

I've made several attempts and walked different paths ... but to no avail so far. Some of the articles I've browsed on the internet ( link1 , #, , , link4 ).

library(furrr)
library(RSQLite)
library(DBI)

drive_sqlite <- function(){ DBI::dbConnect(RSQLite::SQLite(), dbname = "bd_teste.db") }

DBI::dbExecute(drive_sqlite(),"CREATE TABLE tabela_teste (coluna1  TEXT NOT NULL,
                                                          coluna2  TEXT NOT NULL,
                                                          coluna3  TEXT NOT NULL,
                                                          coluna4  TEXT NOT NULL);")

tabela_dados <- tibble::tibble(coluna1 = c(1:3000),
                               coluna2 = c(1:3000),
                               coluna3 = c(1:3000),
                               coluna4 = c(1:3000))

funcao <- function(coluna1, coluna2, coluna3, coluna4) { 

            DBI::dbExecute(drive_sqlite(), "INSERT INTO tabela_teste VALUES(:coluna1,
                                                                            :coluna2,
                                                                            :coluna3,
                                                                            :coluna4);",
                                         params = list(coluna1 = as.character(coluna1),
                                                       coluna2 = as.character(coluna2),
                                                       coluna3 = as.character(coluna3),
                                                       coluna4 = as.character(coluna4)))

            DBI::dbDisconnect(drive_sqlite())

}

#plan(sequential)

plan(multiprocess)

#plan(multisession)

furrr::future_pmap(tabela_dados, funcao, .progress = TRUE)
    
asked by anonymous 21.10.2018 / 21:22

0 answers