My script that searches for a string in another is taking too long

0

My script loads the command1 (file1) and saves the first column of all rows in a vector, later it will look for each substring of that vector in a file 2 (of command2). The problem is that file 1 has about 3,000,000,000 lines and file 2 has about 25,000,000,000 lines. Soon he needed a program that was as fast as possible.

Ex file 1:

sp|Q8EES8|Y2290_SHEON   G0B6XZL01A4P7W_6    67.77   121 39  0   128 248 1   121 3e-06    188

sp|Q8EES8|Y2290_SHEON   GRQ41VZ01ARMHK_3    57.58   132 56  0   169 300 3   134 1e-06    180

sp|Q8EES8|Y2290_SHEON   GRQ41VZ01B0A9N_1    47.37   152 72  3   124 269 1   150 5e-06    150

sp|Q8EES8|Y2290_SHEON   GRQ41VZ01AS06A_2    51.40   107 52  0   173 279 46  152 9e-03    136

sp|Q8EES8|Y2290_SHEON   GRQ41VZ01BI3RW_5    41.10   146 85  1   50  194 10  155 3e-03    129

sp|Q8EES8|Y2290_SHEON   GRQ41VZ01DQILJ_4    45.95   111 60  0   176 286 46  156 1e-02    117

sp|Q8EES8|Y2290_SHEON   GRQ41VZ01ATWAG_1    35.26   156 95  2   19  173 2   152 5e-02    110

sp|Q8EES8|Y2290_SHEON   GRQ41VZ01AYTV2_4    37.88   132 80  2   83  212 32  163 7e-02    102

sp|Q8EES8|Y2290_SHEON   GRQ41VZ01C7I53_6    44.12   102 56  1   112 212 1   102 2e-02   92.4

sp|Q8EES8|Y2290_SHEON   GRQ41VZ01B9TOA_5    42.98   114 64  1   4   117 41  153 1e-01   86.3

sp|Q8EES8|Y2290_SHEON   GRQ41VZ01DQILJ_5    54.93   71  29  2   129 199 1   68  5e-01   84.7

sp|Q8EES8|Y2290_SHEON   GRQ41VZ01E13OT_2    38.10   105 65  0   1   105 6   110 6e-01   84.0

sp|Q8EES8|Y2290_SHEON   G0B6XZL01EGX3B_4    33.56   149 91  4   46  189 1   146 4e-01   79.0

sp|Q8EES8|Y2290_SHEON   GRQ41VZ01EEMHX_3    40.48   84  50  0   173 256 31  114 7e-01   78.6

sp|Q8EES8|Y2290_SHEON   G0B6XZL01BDBAI_3    52.83   53  25  0   241 293 2   54  3e-01   74.7

sp|Q8EES8|Y2290_SHEON   G0B6XZL01ETJ9Y_6    51.67   60  29  0   242 301 1   60  3e-01   75.5

sp|Q8EES8|Y2290_SHEON   GRQ41VZ01ARVZB_6    43.04   79  44  1   216 293 8   86  9e-01   70.5

sp|Q8EES8|Y2290_SHEON   GRQ41VZ01EFORR_1    54.55   55  25  0   219 273 3   57  1e-01   66.6

sp|Q8EES8|Y2290_SHEON   GRQ41VZ01DWDKC_1    47.27   55  29  0   219 273 5   59  1e-01   66.6

sp|Q8EES8|Y2290_SHEON   GRQ41VZ01AL4M3_1    47.27   55  29  0   219 273 5   59  3e-01   65.5

sp|Q8EES8|Y2290_SHEON   GRQ41VZ01B16CL_2    45.83   48  26  0   111 158 66  113 2e-03   57.4

sp|Q8EES8|Y2290_SHEON   G0B6XZL01D8VWQ_6    37.18   78  49  0   169 246 6   83  4e-03   55.5

sp|Q8EES8|Y2290_SHEON   G0B6XZL01D8VWQ_5    61.11   36  14  0   176 211 13  48  5e-03   55.5

sp|Q8EES8|Y2290_SHEON   GRQ41VZ01EN153_2    39.13   69  40  2   104 171 88  155 1e-02   55.5

sp|Q8EES8|Y2290_SHEON   GRQ41VZ01DJ7AX_5    32.65   98  65  1   4   100 96  193 2e-04   55.5

sp|Q8EES8|Y2290_SHEON   GRQ41VZ01AG6GR_1    27.04   159 94  6   5   147 19  171 3e-04   53.9

Ex file 2:

">G0B6XZL01A4P7W_6 length=363 xy=0346_3114 region=1 run=R_2011_04_01_12_14_13_
LVIDTRNEYEVEIGTFAGAVNPHTNSFREFPDWVEQNLDPKKHKKVAXFCTGGIRCEKST
SLLVSRGFEDVWHLKGGILNYLEQTPEEDTRWEGECFVFDSRVAVNHQLEKGSYDQCFAC
R"

">GRQ41VZ01ARMHK_3 length=428 xy=0197_2054 region=1 run=R_2010_11_26_14_54_57_
DKKKRKLQVFCTGGIRCEKASSLMKKEGFENVYHLKGGILKYFESVNEDDSLWSGECFVF
DDRVSVDQNLEKGSYDMCHGCRMPITINDKKTDKYIRGVACPSCFDKTTEEQKNRYMSRQ
KQVDLAKKKKYKNILGPKKRSY"

">GRQ41VZ01B0A9N_1 length=466 xy=0706_2169 region=1 run=R_2010_11_26_14_54_57_
DPDTLVIDTRNSYETAIGSFEGAIDPSTESFRDFPQWAESTLRPLIEEKGSKRIAMFCTG
GIRCEKASSYLQQQGFGEVHHLRGGILKYFEQVPEAESRWQGECFVLINGWR*TTGWNLE
STAFATPAACRCQPSNANCRATSKGVQCVHVRGSLX"

">GRQ41VZ01BWIB5_5 length=457 xy=0663_0835 region=1 run=R_2010_11_26_14_54_57_
RHPHIKDKVPQ*MFHPELFDALSESFVELSLFVSVHLVFRRSSTCLRVLDIFCFSRSECL
YPCLP*NIK*P*IIIVIPKRNNITTPN*SINVDEAMALLVE*PQ*VLTDPH*RVQEQRNQ
QTY**FQHFDYAFLWHHRSSPDTLHE*HNALVX"

">GRQ41VZ01BWIB5_3 length=457 xy=0663_0835 region=1 run=R_2010_11_26_14_54_57_
TSALCHS*RVSGDDR*CHKKA*SKCWNY*YVCWFLCSWTL**GSVKTYWGYSTSSAMASS
TFIDQLGVVMLFLFGITMIIQGHFIFHGKHGYKHSEREKQKMSKTRKQVEDLLKTK*TLT
KRESSTKDSESASNNSGWNIY*GTLSFI*G*R"

">GRQ41VZ01BDBOS_6 length=518 xy=0444_2778 region=1 run=R_2010_11_26_14_54_57_
VVNLVDT*KTLMLYLKAQRNYKAS*KNSCIVW*SYFI*SSY*NRRLS*CRFSSIYLGGRY
*RRVQKYRYVI**TIQDEKIAYATYCYNKNR*A*R*HWFILCS*LFTR*EK*YSYN***P
KYL*RIYGIHR*T*YLGFY*VG*K**RRNKLSYCQRIY*YRCYAIILLKHTIX"

Code:

palavras=[]
paragrafo=[]
palavraslinha=[]
x=[]
z=[]
i=0
j=0
k=[]        
q=[]        
w=[]        
e=['>']
t=int(0)
f=0
p= int (0)
fasta='.fasta'
print('Todos os arquivos devem estar no mesmo diretório. Basta digitar o                         
nome do arquivo que o formato ".fasta" será incluído automaticamente.')
comando1=input('Digite o nome do arquivo de parâmetro: ')
comando2=input('Digite o nome do arquivo de busca: ')
comando3=input('Digite o nome do arquivo a ser criado: ')
paragrafo_fim=[]
ref_arquivo1 = open(comando2+fasta,"r")
linha = ref_arquivo1.read()
with open(comando1+fasta) as f:
    for line in f:
        x.append(line.split())
        z=x[i][1]
        k.append(z)

        i+=1
    x=[]
    z=[]
while linha:
    palavras.append(linha.split('>'))
    while t<len(k):
        w.append(k[t])
        while p<len(palavras[0]):
            if w[t] in palavras[0][p]:
                arquivofim = open(comando3+fasta,'a')
                arquivofim.write(e[0]+palavras[0][p])
                del palavras[0][p]
                p+=1 
                #print (w[t])
                arquivofim.close()
            else:
                p+=1
        print(t)
        t= t+1
        p=0
    linha = ref_arquivo1.readline()
print('O processamento terminou')    
ref_arquivo1.close()
    
asked by anonymous 18.11.2018 / 15:12

1 answer

0

From what I understand you're doing a filter. If so, the problem is that you are putting too much into the memory unnecessarily. Python is not the best thing for that, then search for an ETL like Kettle link

If I understood correctly a cleaned in your code would look like this:

fasta='.fasta'
print('Todos os arquivos devem estar no mesmo diretório. Basta digitar o nome do arquivo que o formato ".fasta" será incluído automaticamente.')
comando1=input('Digite o nome do arquivo de parâmetro: ')
comando2=input('Digite o nome do arquivo de busca: ')
comando3=input('Digite o nome do arquivo a ser criado: ')


novo_f = open(comando3 + fasta, 'a');

# gera uma lista com o que importa pegar
chaves = []
with open(comando1+fasta) as f:
    for linha in f:
        chaves.append(linha.split(maxsplit=2)[1])

salvar = False
with open(comando2+fasta,"r") as f:
    for linha in f:
        # verifica se é um novo bloco de linhas
        if linha.startswith('"'):
            # verifica se o valor está na lista, removendo todo o resto
            # que é desnecessário. Se for, então começa a gravar
            if linha[2:linha.find(' ')] in chaves:
                salvar = True

        if salvar:
            novo_f.write(linha)

            # finaliza o bloco de linhas
            if linha.endswith('"\n') or linha.endswith('"'):
                salvar = False

novo_f.close()
    
22.11.2018 / 17:17