The colleague's answer @Priscilla is enough and, in fact, the best choice for the vast majority of cases. However, if your crowler needs to handle money in different formats, it may be helpful to consider the locale / language of the page accessed. One way to do this is by using the locale
package.
Here is an example code:
import re
import locale
#--------------------------------------------------
def extractMonetaryValue(text):
cs = locale.localeconv()['currency_symbol']
expr = '{}[ ]*[0-9.,]+'.format(cs.replace('$', '\$'))
m = re.search(expr, text)
if m:
s = m.group(0).replace(cs, '').replace(' ', '')
return locale.atof(s)
else:
return 0.0
#--------------------------------------------------
s = 'Este teste testa um valor (por exemplo: R$ 560.200,40) expresso em Reais.'
locale.setlocale(locale.LC_ALL, 'ptb_bra') # 'pt_BR' se não estiver no Windows
n = extractMonetaryValue(s)
print('Para "{}" o valor é: {}'.format(s, n))
s = 'This test tests a value (let us say U$ 482,128.33) given in US Dolars.'
locale.setlocale(locale.LC_ALL, 'enu_usa') # 'en_US' se não estiver no Windows
n = extractMonetaryValue(s)
print('Para "{}" o valor é: {}'.format(s, n))
In this code, the principal is the extractMonetaryValue
function. It receives any text and searches for it by a subtext that necessarily contains the monetary symbol of the configured country / language (followed by zero or more spaces), and then a number composed of digits, periods, and commas. To do so, it uses a fairly comprehensive regular expression: it does not care if the numeric "format" is correct or not, as this will be done later by locale.atof
(which throws the ValueError
exception if the format is incorrect according to the configured country / language).
The output of the above code is as follows:
Para "Este teste testa um valor (por exemplo: R$ 560.200,40) expresso em Reais." o valor é: 560200.4
Para "This test tests a value (let us say U$ 482,128.33) given in US Dolars." o valor é: 482128.33
Notice how the numbers printed at the end use both dot as the decimal separator (after all, they are values represented as float
internally, same regardless of the source treated).
Q.:
To detect the default% of operating system%, use locale
To detect the locale.getdefaultlocale()
of a webpage, make sure it has this information in the tag
locale
.
If it does not, you will need to try to infer the language. For yours
(wow! hehe) lucky, there's this size of the Google language detector
to called Python
lang
.