Reading .CAP files efficiently with Python

2

I have some .CAP files that came from capturing packages with tcpdump. When trying to open with wireshark, the machine gets very slow, as I imagine it will try to load everything into RAM.

I would like to write a program in Python to work more efficiently with the dumps. The first question is: what is the difference between .CAP and .PCAP?

I do not need to read the entire file at once. Imagine that you want to read the .CAP file only from time (time) = 9:15 p.m. to 11:12 p.m. instead of loading it into memory. How to do this in Python? Remembering that the files are .CAP and not .PCAP.

The output of: "tcpdump -r /path/to/ficehiro.cap | less"

09:32:20.107281 IP iskcon.interactivedns.com.http > 192.168.91.34.47651: Flags [S.], seq 63
8820025, ack 2476676485, win 28960, options [mss 1380,sackOK,TS val 3245680284 ecr 42949413
64,nop,wscale 7], length 0
09:32:20.107308 IP 192.168.91.34.47651 > iskcon.interactivedns.com.http: Flags [.], ack 1, 
win 229, options [nop,nop,TS val 4294941466 ecr 3245680284], length 0
09:32:20.107357 IP 192.168.91.34.47651 > iskcon.interactivedns.com.http: Flags [P.], seq 1:
181, ack 1, win 229, options [nop,nop,TS val 4294941466 ecr 3245680284], length 180: HTTP: 
GET / HTTP/1.1
09:32:20.144075 IP ec2-52-73-252-184.compute-1.amazonaws.com.http > 192.168.91.34.47570: Fl
ags [.], seq 831563414:831564782, ack 387706135, win 75, options [nop,nop,TS val 499391566 
ecr 4294941090], length 1368: HTTP
09:32:20.144094 IP 192.168.91.34.47570 > ec2-52-73-252-184.compute-1.amazonaws.com.http: Fl
ags [.], ack 1368, win 816, options [nop,nop,TS val 4294941475 ecr 499391566], length 0
09:32:20.144368 IP ec2-52-73-252-184.compute-1.amazonaws.com.http > 192.168.91.34.47570: Fl
ags [.], seq 1368:2736, ack 1, win 75, options [nop,nop,TS val 499391566 ecr 4294941090], l
ength 1368: HTTP
09:32:20.144376 IP 192.168.91.34.47570 > ec2-52-73-252-184.compute-1.amazonaws.com.http: Fl
ags [.], ack 2736, win 838, options [nop,nop,TS val 4294941475 ecr 499391566], length 0
09:32:20.145197 IP ec2-52-73-252-184.compute-1.amazonaws.com.http > 192.168.91.34.47570: Fl
ags [.], seq 2736:4104, ack 1, win 75, options [nop,nop,TS val 499391566 ecr 4294941090], l
ength 1368: HTTP
09:32:20.145204 IP 192.168.91.34.47570 > ec2-52-73-252-184.compute-1.amazonaws.com.http: Fl
ags [.], ack 4104, win 861, options [nop,nop,TS val 4294941475 ecr 499391566], length 0
09:32:20.145214 IP ec2-52-73-252-184.compute-1.amazonaws.com.http > 192.168.91.34.47570: Fl
ength 1368: HTTP
09:32:20.145218 IP 192.168.91.34.47570 > ec2-52-73-252-184.compute-1.amazonaws.com.http: Fl
ags [.], ack 5472, win 883, options [nop,nop,TS val 4294941475 ecr 499391566], length 0
09:32:20.148032 IP ec2-52-73-252-184.compute-1.amazonaws.com.http > 192.168.91.34.47570: Fl
ags [.], seq 5472:6840, ack 1, win 75, options [nop,nop,TS val 499391566 ecr 4294941090],

Memoryconsumptionofwiresharkwhenopeninga1GBCAP:

    
asked by anonymous 01.03.2017 / 20:55

2 answers

2

I've only seen this question now, the efficient way is to open the file in chunks, with the help of pointers it is possible to set the start and end position of reading a file, in python it is not possible to manipulate pointers memory directly, luckily the open function (this function must be written in C) of python internally manipulates pointers that helps us in the process of reading files, with it it is possible to define the opening of a file by size in bytes , ie it is possible to open the piece piece by piece (every byte) without having to open the whole file, see how it is done:

from scapy.all import *
import dpkt

f = open("capture21dez2016.pcap")


pcap = f.read(4096)
while pcap:

    #processe cada pedaço aqui

    pcap = f.read(4096)

f.close()

The example opens the file every 4096 bytes of data and runs through the whole file to the end of it, is a way to not burst the memory for lack of resources, very useful when you have to walk through giant files, the function f.read() knows the position of the last pointer and starts reading the next bytes from the last known position.

You can still start reading a file from a certain position using seek very useful when you need to start a reading from a given byte see an example:

file.txt

A
B
C
D
E
F
G
H
I
J

Each enter is equivalent to 2bytes=\n ie 1 byte to \ and another byte to n , to exemplify and if I want to read arquivo.txt starting from byte=3 ?

>>> f = open('arquivo.txt', 'r+')
>>> f.seek(3)
>>> f.read(1)
'B'

The seek(3) tells where to point, in this case place the pointer on the third byte and the read(1) says to read 1byte of data from the position pointed.

Then you ask me why the third byte is equal to the letter B ? remember that enter is equivalent to 2bytes and the first line looks like this:

A\n = 3bytes

In other words, the letter B of the file will be in byte=4 which is what we did, we point the reading to start in byte=3 and we have to read the next 1 byte

What if I want the letter F ?

Starting from the same principle and counting the characters and enters, the letter F will be in byte=16 , to position the pointer and get in it is like this:

>>> f.seek(15)
>>> f.read(1)
'F'

For a text file run without enter :

file2.txt

KLMNOPQRS

If I want the fifth-byte letter:

>>> f = open('arquivo2.txt', 'r+')
>>> f.seek(4)
>>> f.read(1)
'O'

What if I want the whole file from the fifth byte?

>>> f.seek(4)
>>> f.read()
'OPQRS'

What if I want to walk the archive backwards? you can set the 2 parameter to seek(X,2) This indicates that it will walk starting from the end of the file.

>>> f.seek(-6,2)
>>> f.read(1)
'N'

With these concepts you will be able to manipulate and walk efficiently inside giant files ...

Now all you have to do is open the file in chunks or start from a certain place and go comparing which lines are within the desired range, after storing the data ends the loop with a break , this way it is it is very likely that you do not have to walk through the entire file unless the data you want is in the last line of the file, and you can still create some trick to see if you should start reading the file from the beginning or end of the file. p>     
05.03.2017 / 16:15
2

To facilitate simulate a file of type .cap, creating a text file cap1.cap where each line has only the first characters (that indicate the schedule), according to what you posted here. It looks like this:

09: 32: 20.107281

09: 32: 20.107357

09: 32: 21.144075

09: 32: 21.144094

09: 32: 21.144368

09: 33: 21.144376

09: 34: 21.145197

09: 35: 00.145204

09: 36: 20.145214

09: 36: 20.145218

09: 37: 20.148032

Then I developed a code to read this file and "start" only the lines between the times: 09:32:21 and 09:35:00, I took the test and it happened as expected, I believe that with some adaptations you solve your problem problem. Code below.

import datetime
import re

inicio = datetime.datetime.strptime('09:32:21', '%H:%M:%S').time()
fim = datetime.datetime.strptime('09:35:00', '%H:%M:%S').time()

startprint = False
with open('cap1.cap', 'r') as f:
    for line in f:
        str_begin = line[:8]
        if re.match(r'^[0-9][0-9]:[0-9][0-9]:[0-9][0-9]$', str_begin) != None:
            t = datetime.datetime.strptime(str_begin, '%H:%M:%S').time()
            startprint = True
        if startprint:    
           if t>fim:
               break
           if t>=inicio and t<=fim:
               print (line)

Result:

09: 32: 21.144075

09: 32: 21.144094

09: 32: 21.144368

09: 33: 21.144376

09: 34: 21.145197

09: 35: 00.145204

    
03.03.2017 / 02:41