Search for words in almost 1 million files

0

I'm looking for methods / ideas that can help me solve my problem. I have a folder / file structure that my program generates, so far so good. But what happens is that this folder already contains more than 900 thousand files. Each file of this is very small, about 1 KB, has a header and a text.

Currently the search used by the software is basic: it actually opens file by file and search by word. But imagine the delay of the search ... on an SSD, searching for the word saude generated me a wait of more than 8 minutes.

I did some testing to see if reducing the number of files would help, but I noticed that it would help little, but the search would take minutes.

The idea is currently (not using any database) to index manually, with an external process responsible for this, words of 3 or more characters per folder, reducing the search from millions to a few thousand but still could take time a few seconds in some cases.

I've also been thinking about how Windows indexing works for file content:

[

Imadesomesmalltestsabout"what takes in the search":

  

Time to open the files + read: 465.948. Found: 2921

     

Looking for: 264,318. Found: 2921

     

Time to open the files + read: 788.992. Found: 2921

     

Temp looking for: 599.093. Found: 2921

     

Time to open the files + read: 834.300. Found: 2921

     

Looking for: 572,496. Found: 2921

     

Time to open the files + read: 709.464. Found: 2921

     

Looking for: 539.053. Found: 2921

     

Time to open the + read files: 857.443. Found: 2921

     

Looking for: 761,121. Found: 2921

     

Time to open the + read files: 909,440. Found: 2921

     

Looking for: 602,000. Found: 2921

     

Time to open the files + read: 865.306. Found: 2921

     

Looking for: 499,046. Found: 2921

The test was done on 1000 files only. In the first result, it takes into account the opening of the file, in the second just the time it is searching for something (in my test I used strstr ).

Is there any method to make this faster without using a database? I do not know if it would solve the case, since on average would have 200 characters per file where there would be millions of characters available to search. If it is not possible without a database, what would be the general idea? Can a database handle this volume of data well?

    
asked by anonymous 19.12.2018 / 20:12

1 answer

0

@Kevin Kouketsu, the best technique to use for this scenario is the full-text search engine. There is a relational database that has this feature: PostgreSQL. There is also Solr and ElasticSearch, which include full-text search and especially real-time indexing as you mentioned above; all the latter two is based on the Lucene project.

    
19.12.2018 / 20:49