I need to extract from pure text the total value of an agreement. I have hundreds of documents with some values, and I realized that usually the highest value is also the total value of the agreement, but in some cases, not.
def ata_values(text):
padrao = re.findall(r'\$\s*(\d{1,3}(?:\.?\d{1,3})+(?:\,\d{2})?)', text)
padrao = [p.replace('.', '') for p in padrao]
padrao = [p.replace(',', '.') for p in padrao]
padrao = [float(p) for p in padrao]
return padrao, max(padrao)
This returns me:
([2500.0, 833.33, 833.33, 833.34, 2500.0], 2500.0)
([1000.0, 800.0, 200.0, 1000.0], 1000.0)
([280.0, 14000.0, 21000.0], 21000.0) 21000.0)
([3000.0, 15000.0, 7000.0, 7000.0, 7000.0, 700.0, 700.0, 700.0, 700.0, 700.0, 700.0, 700.0, 700.0, 700.0, 700.0, 750.0, 750.0, 750.0, 1083.33, 1200.0, 1600.0, 1616.67, 140.0], 15000.0)
The first list is standard with all values found, and the second max (default) is the largest value in each list. In this example the first two lines are correct, but the last two are not, the second is the correct one. I noticed that in most lists I have this error, there is a default, the list contains the total value plus a value that corresponds to 2% of the total value.
How can I check before taking the maximum values, if there exists within each list a number X plus a number that corresponds to 0.02 * X?
for x in padrao:
for y in padrao:
if x == y*0.02:
return x
else:
return max(padrao)