My script loads the command1 (file1) and saves the first column of all rows in a vector, later it will look for each substring of that vector in a file 2 (of command2). The problem is that file 1 has about 3,000,000,000 lines and file 2 has about 25,000,000,000 lines. Soon he needed a program that was as fast as possible.
Ex file 1:
sp|Q8EES8|Y2290_SHEON G0B6XZL01A4P7W_6 67.77 121 39 0 128 248 1 121 3e-06 188
sp|Q8EES8|Y2290_SHEON GRQ41VZ01ARMHK_3 57.58 132 56 0 169 300 3 134 1e-06 180
sp|Q8EES8|Y2290_SHEON GRQ41VZ01B0A9N_1 47.37 152 72 3 124 269 1 150 5e-06 150
sp|Q8EES8|Y2290_SHEON GRQ41VZ01AS06A_2 51.40 107 52 0 173 279 46 152 9e-03 136
sp|Q8EES8|Y2290_SHEON GRQ41VZ01BI3RW_5 41.10 146 85 1 50 194 10 155 3e-03 129
sp|Q8EES8|Y2290_SHEON GRQ41VZ01DQILJ_4 45.95 111 60 0 176 286 46 156 1e-02 117
sp|Q8EES8|Y2290_SHEON GRQ41VZ01ATWAG_1 35.26 156 95 2 19 173 2 152 5e-02 110
sp|Q8EES8|Y2290_SHEON GRQ41VZ01AYTV2_4 37.88 132 80 2 83 212 32 163 7e-02 102
sp|Q8EES8|Y2290_SHEON GRQ41VZ01C7I53_6 44.12 102 56 1 112 212 1 102 2e-02 92.4
sp|Q8EES8|Y2290_SHEON GRQ41VZ01B9TOA_5 42.98 114 64 1 4 117 41 153 1e-01 86.3
sp|Q8EES8|Y2290_SHEON GRQ41VZ01DQILJ_5 54.93 71 29 2 129 199 1 68 5e-01 84.7
sp|Q8EES8|Y2290_SHEON GRQ41VZ01E13OT_2 38.10 105 65 0 1 105 6 110 6e-01 84.0
sp|Q8EES8|Y2290_SHEON G0B6XZL01EGX3B_4 33.56 149 91 4 46 189 1 146 4e-01 79.0
sp|Q8EES8|Y2290_SHEON GRQ41VZ01EEMHX_3 40.48 84 50 0 173 256 31 114 7e-01 78.6
sp|Q8EES8|Y2290_SHEON G0B6XZL01BDBAI_3 52.83 53 25 0 241 293 2 54 3e-01 74.7
sp|Q8EES8|Y2290_SHEON G0B6XZL01ETJ9Y_6 51.67 60 29 0 242 301 1 60 3e-01 75.5
sp|Q8EES8|Y2290_SHEON GRQ41VZ01ARVZB_6 43.04 79 44 1 216 293 8 86 9e-01 70.5
sp|Q8EES8|Y2290_SHEON GRQ41VZ01EFORR_1 54.55 55 25 0 219 273 3 57 1e-01 66.6
sp|Q8EES8|Y2290_SHEON GRQ41VZ01DWDKC_1 47.27 55 29 0 219 273 5 59 1e-01 66.6
sp|Q8EES8|Y2290_SHEON GRQ41VZ01AL4M3_1 47.27 55 29 0 219 273 5 59 3e-01 65.5
sp|Q8EES8|Y2290_SHEON GRQ41VZ01B16CL_2 45.83 48 26 0 111 158 66 113 2e-03 57.4
sp|Q8EES8|Y2290_SHEON G0B6XZL01D8VWQ_6 37.18 78 49 0 169 246 6 83 4e-03 55.5
sp|Q8EES8|Y2290_SHEON G0B6XZL01D8VWQ_5 61.11 36 14 0 176 211 13 48 5e-03 55.5
sp|Q8EES8|Y2290_SHEON GRQ41VZ01EN153_2 39.13 69 40 2 104 171 88 155 1e-02 55.5
sp|Q8EES8|Y2290_SHEON GRQ41VZ01DJ7AX_5 32.65 98 65 1 4 100 96 193 2e-04 55.5
sp|Q8EES8|Y2290_SHEON GRQ41VZ01AG6GR_1 27.04 159 94 6 5 147 19 171 3e-04 53.9
Ex file 2:
">G0B6XZL01A4P7W_6 length=363 xy=0346_3114 region=1 run=R_2011_04_01_12_14_13_
LVIDTRNEYEVEIGTFAGAVNPHTNSFREFPDWVEQNLDPKKHKKVAXFCTGGIRCEKST
SLLVSRGFEDVWHLKGGILNYLEQTPEEDTRWEGECFVFDSRVAVNHQLEKGSYDQCFAC
R"
">GRQ41VZ01ARMHK_3 length=428 xy=0197_2054 region=1 run=R_2010_11_26_14_54_57_
DKKKRKLQVFCTGGIRCEKASSLMKKEGFENVYHLKGGILKYFESVNEDDSLWSGECFVF
DDRVSVDQNLEKGSYDMCHGCRMPITINDKKTDKYIRGVACPSCFDKTTEEQKNRYMSRQ
KQVDLAKKKKYKNILGPKKRSY"
">GRQ41VZ01B0A9N_1 length=466 xy=0706_2169 region=1 run=R_2010_11_26_14_54_57_
DPDTLVIDTRNSYETAIGSFEGAIDPSTESFRDFPQWAESTLRPLIEEKGSKRIAMFCTG
GIRCEKASSYLQQQGFGEVHHLRGGILKYFEQVPEAESRWQGECFVLINGWR*TTGWNLE
STAFATPAACRCQPSNANCRATSKGVQCVHVRGSLX"
">GRQ41VZ01BWIB5_5 length=457 xy=0663_0835 region=1 run=R_2010_11_26_14_54_57_
RHPHIKDKVPQ*MFHPELFDALSESFVELSLFVSVHLVFRRSSTCLRVLDIFCFSRSECL
YPCLP*NIK*P*IIIVIPKRNNITTPN*SINVDEAMALLVE*PQ*VLTDPH*RVQEQRNQ
QTY**FQHFDYAFLWHHRSSPDTLHE*HNALVX"
">GRQ41VZ01BWIB5_3 length=457 xy=0663_0835 region=1 run=R_2010_11_26_14_54_57_
TSALCHS*RVSGDDR*CHKKA*SKCWNY*YVCWFLCSWTL**GSVKTYWGYSTSSAMASS
TFIDQLGVVMLFLFGITMIIQGHFIFHGKHGYKHSEREKQKMSKTRKQVEDLLKTK*TLT
KRESSTKDSESASNNSGWNIY*GTLSFI*G*R"
">GRQ41VZ01BDBOS_6 length=518 xy=0444_2778 region=1 run=R_2010_11_26_14_54_57_
VVNLVDT*KTLMLYLKAQRNYKAS*KNSCIVW*SYFI*SSY*NRRLS*CRFSSIYLGGRY
*RRVQKYRYVI**TIQDEKIAYATYCYNKNR*A*R*HWFILCS*LFTR*EK*YSYN***P
KYL*RIYGIHR*T*YLGFY*VG*K**RRNKLSYCQRIY*YRCYAIILLKHTIX"
Code:
palavras=[]
paragrafo=[]
palavraslinha=[]
x=[]
z=[]
i=0
j=0
k=[]
q=[]
w=[]
e=['>']
t=int(0)
f=0
p= int (0)
fasta='.fasta'
print('Todos os arquivos devem estar no mesmo diretório. Basta digitar o
nome do arquivo que o formato ".fasta" será incluído automaticamente.')
comando1=input('Digite o nome do arquivo de parâmetro: ')
comando2=input('Digite o nome do arquivo de busca: ')
comando3=input('Digite o nome do arquivo a ser criado: ')
paragrafo_fim=[]
ref_arquivo1 = open(comando2+fasta,"r")
linha = ref_arquivo1.read()
with open(comando1+fasta) as f:
for line in f:
x.append(line.split())
z=x[i][1]
k.append(z)
i+=1
x=[]
z=[]
while linha:
palavras.append(linha.split('>'))
while t<len(k):
w.append(k[t])
while p<len(palavras[0]):
if w[t] in palavras[0][p]:
arquivofim = open(comando3+fasta,'a')
arquivofim.write(e[0]+palavras[0][p])
del palavras[0][p]
p+=1
#print (w[t])
arquivofim.close()
else:
p+=1
print(t)
t= t+1
p=0
linha = ref_arquivo1.readline()
print('O processamento terminou')
ref_arquivo1.close()