Check the relationship of two objects in a list

0

I need to extract from pure text the total value of an agreement. I have hundreds of documents with some values, and I realized that usually the highest value is also the total value of the agreement, but in some cases, not.

def ata_values(text):
    padrao = re.findall(r'\$\s*(\d{1,3}(?:\.?\d{1,3})+(?:\,\d{2})?)', text)
    padrao = [p.replace('.', '') for p in padrao]
    padrao = [p.replace(',', '.') for p in padrao]
    padrao = [float(p) for p in padrao]

    return padrao, max(padrao)

This returns me:

([2500.0, 833.33, 833.33, 833.34, 2500.0], 2500.0)
([1000.0, 800.0, 200.0, 1000.0], 1000.0)
([280.0, 14000.0, 21000.0], 21000.0)    21000.0)
([3000.0, 15000.0, 7000.0, 7000.0, 7000.0, 700.0, 700.0, 700.0, 700.0, 700.0, 700.0, 700.0, 700.0, 700.0, 700.0, 750.0, 750.0, 750.0, 1083.33, 1200.0, 1600.0, 1616.67, 140.0], 15000.0)

The first list is standard with all values found, and the second max (default) is the largest value in each list. In this example the first two lines are correct, but the last two are not, the second is the correct one. I noticed that in most lists I have this error, there is a default, the list contains the total value plus a value that corresponds to 2% of the total value.

How can I check before taking the maximum values, if there exists within each list a number X plus a number that corresponds to 0.02 * X?

for x in padrao:
    for y in padrao:
        if x == y*0.02:
            return x
        else: 
            return max(padrao)
    
asked by anonymous 06.03.2018 / 12:56

1 answer

0
padrao.sort(reverse=True)
maior = padrao[0]
if maior* 0.02 in padrao:
   maior = padrao[1]

But as I initially wrote as a comment: This approach is risky - especially if the documents are free text. Does not it have something more consistent, although it is harder to find with a single regular expression? type, the word "total" appears next to the number, or in a specific session of the document, or always near the end of the document? In that case you would first isolate a section where the total value should appear and then only worry about picking up the number.

    
06.03.2018 / 13:41