Hello everyone. I'm trying to develop a code that makes it possible to parser HTML files using the R language and, consecutively, write the extracted HTML data to the SQLite database in a serialized way.
In order to perform parallel processing, I am using the furrr
package (this package is similar to purrr
, with a focus on functional programming, yet its advantage is that it allows execution in parallel (multi-thread) of the script.
Now to make the connection to SQLite and send the queries, I am using RSQLite
and DBI
respectively.
In the code below, the execution occurs perfectly if I put the plan(sequential)
function that establishes a serialized execution (in my case, it would promote the handling of the HTML files and the recording in SQLite in a serialized way). But when I enable plan(multiprocess)
the script only executes about 25% of the writes and returns an error:
Error in result_bind(res@ptr, params) : database is locked In addition: Warning message:
. Apparently, this indicates, in my view, that some process (thread) gave error after finding SQLite blocked by another process (thread) running.
In the furrr
documentation the "plan" can be configured as follows:
# You set a "plan" for how the code should run. The easiest is 'multiprocess' # On Mac this picks plan(multicore) and on Windows this picks plan(multisession)
( link )
THE CHALLENGE: My idea is that the handling of HTML files be done in parallel and the recordings serialized in SQLite, since it does not allow simultaneous writes. I believe this will lead to performance gains, especially when there is a need to handle more complex HTML data parser that is slow to complete.
Some possible paths to think about:
1) Do you have to configure the RSQLite::SQLite()
parameters to solve the problem?
2) Is it necessary to use some function for error handling like purrr::safely
during the execution of the DBI::dbExecute
function so that it stays trying to execute the query in SQLite until it can? Or purrr::safely
would have to be used during the SQLite connection in drive_sqlite()
function?
3) Is there a function of the DBI
package to help solve this problem?
4) Any type of caching in SQLite so that it stores the recursed R queries and executes them serialized? ( link )
I've made several attempts and walked different paths ... but to no avail so far. Some of the articles I've browsed on the internet ( link1 , #, , , link4 ).
library(furrr)
library(RSQLite)
library(DBI)
drive_sqlite <- function(){ DBI::dbConnect(RSQLite::SQLite(), dbname = "bd_teste.db") }
DBI::dbExecute(drive_sqlite(),"CREATE TABLE tabela_teste (coluna1 TEXT NOT NULL,
coluna2 TEXT NOT NULL,
coluna3 TEXT NOT NULL,
coluna4 TEXT NOT NULL);")
tabela_dados <- tibble::tibble(coluna1 = c(1:3000),
coluna2 = c(1:3000),
coluna3 = c(1:3000),
coluna4 = c(1:3000))
funcao <- function(coluna1, coluna2, coluna3, coluna4) {
DBI::dbExecute(drive_sqlite(), "INSERT INTO tabela_teste VALUES(:coluna1,
:coluna2,
:coluna3,
:coluna4);",
params = list(coluna1 = as.character(coluna1),
coluna2 = as.character(coluna2),
coluna3 = as.character(coluna3),
coluna4 = as.character(coluna4)))
DBI::dbDisconnect(drive_sqlite())
}
#plan(sequential)
plan(multiprocess)
#plan(multisession)
furrr::future_pmap(tabela_dados, funcao, .progress = TRUE)