Format columns - select specific information

2

Dear users, I have a large file with the following columns

chr10_46938     EXON=28/28      STRAND=-1       ENSP=ENSGALP00000004070 SIFT=tolerated(0.38) 
chr10_46966     EXON=28/28      STRAND=-1       DOMAINS=Low_complexity_(Seg):Seg        SIFT=tolerated(0.66)    ENSP=ENSGALP00000004070   
chr10_46987     EXON=28/28      STRAND=-1       SIFT=tolerated(0.93)    ENSP=ENSGALP00000004070
chr10_47071     ENSP=ENSGALP00000004070 SIFT=tolerated(0.97)    EXON=28/28      STRAND=-1
chr10_47164     EXON=28/28      STRAND=-1       DOMAINS=Low_complexity_(Seg):Seg        SIFT=tolerated(0.37)    ENSP=ENSGALP00000004070
chr10_47466     ENSP=ENSGALP00000004070 SIFT=tolerated(0.11)    STRAND=-1       EXON=28/28    DOMAINS=PROSITE_profiles:PS50196,Pfam_domain:SSF50729

I want to select only the first column and the SIFT=tolerated(..) information, but this is not in fixed columns, example column 2. How to select only this information I want to have for example the following output :

chr10_46938     SIFT=tolerated(0.38)  
chr10_46966     SIFT=tolerated(0.66)   
chr10_46987     SIFT=tolerated(0.93)  
chr10_47071     SIFT=tolerated(0.97)  
chr10_47094     SIFT=tolerated(1)            
chr10_47164     SIFT=tolerated(0.37)    
chr10_47466     SIFT=tolerated(0.11)

What command to use on UNIX to get this list?

    
asked by anonymous 17.03.2015 / 15:14

2 answers

1

You can extract this information in a variety of ways, for example, with cut , and also with the glorious . Perl .

Here is an example using awk :

$ awk 'match($0, /SIFT=tolerated\([0-9.]+\)/) { print $1, "\t", 
substr($0, RSTART, RLENGTH) } ' arquivo

Where:

  • awk : This is the function that will look for the match pattern, this means that it will match the SIFT = tolerated string that contains numbers or a SIFT=tolerated\([0-9.]+\) point in parentheses. It returns the position of the character, or index, from where the corresponding substring begins.
  • . : Returns a substring , substr means the index of substring matched and RSTART size.

Result:

$ awk 'match($0, /SIFT=tolerated\([0-9.]+\)/){ print $1, "\t", substr($0, RSTART, RLENGTH)}' foo.txt
chr10_46938     SIFT=tolerated(0.38)
chr10_46966     SIFT=tolerated(0.66)
chr10_46987     SIFT=tolerated(0.93)
chr10_47071     SIFT=tolerated(0.97)
chr10_47164     SIFT=tolerated(0.37)
chr10_47466     SIFT=tolerated(0.11)
$ 

On other systems it may be that the syntax is different, but nothing that can not be adapted.

    
17.03.2015 / 17:53
1
  

@ Qmechanic73: ... glorious Perl

perl -nE 'say m/(\S+ ).*? (SIFT=\S+)/' foo.txt

And now sed to vary

sed -r 's!(\S+).*(SIFT=\S+).*! !' foo.txt
    
17.03.2015 / 19:15