Regex to capture dimensions of a product with unit of measure

4

I have a function in python to capture the dimensions of a product in LxCxA format but I can not make it work for cases where the unit of measure appears between the values, the regex is this here:

def findDimensions(text):
    p = re.compile(r'(?P<l>\d+(\.\d+)?)\s*x\s*(?P<w>\d+(\.\d+)?)\s*x\s*(?P<h>\d+(\.\d+)?)')
    m = p.search(text)
    if (m):
        return m.group("l"), m.group("w"), m.group("h")
    return None

It works for the 2 cases below:

23.6 x 34 x 17.1

14.5 x 55 x 22

But it does not work for this example:

14.5cmx55x22cm

I would like to make it work for situations where any quantities of spaces or letters appear in each group of values separated by x. I tried to use \ w * \ W * but it does not solve for all cases like this:

14.5 cm × 55 cm × 22 cm

Example on regex101: link

I accept suggestions for leaner expression contact that meets the examples shown

    
asked by anonymous 29.06.2018 / 22:04

5 answers

2

Regex

With the following regular expression: ([\d,]+)[\s\D]* you can capture each given value.

And with the regular expression ([\d,]+)[\s\D]*([\d,]+)[\s\D]*([\d,]+)[\s\D]* and the demo , you can get the dimensions.

Explanation

The following regular expression can be repeated three times to get the dimensions in each capture group.

  • 1st Capture Group ([\d,]+)

    • Matches an item in the list between []
    • \d : Corresponds to a digit between 0 and 9
    • , : Corresponds literally to the comma character
    • + : Quantifier that matches from one to unlimited times, as many times as possible (greedy).
  • Followed by [\s\D]*
    • Matches an item in the list between []
    • \s : Matches any whitespace (equal to [\ r \ n \ t \ f \ v])
    • \D : Matches any character that is not a digit (other than [^ 0-9])
    • * : Quantifier that matches zero to unlimited times, as many times as possible (greedy).

Code Dimensions

Here is a sample Python implementation code:

import re

regex_pattern= re.compile(r"([\d,]+)[\s\D]*([\d,]+)[\s\D]*([\d,]+)[\s\D]*")
regex_string="""23,6 x 34 x 17,1
14,5 x 55 x 22
14,5cm x 55 x 22cm
14,5cmx55x22cm
14,5 cmx55 cmx22 cm"""

matches = re.finditer(regex_pattern, regex_string)

for submatch in matches:
    if submatch:
        print("L: " + submatch.group(1) + " C: " + submatch.group(2) + " A: " + submatch.group(3))

Result:

L: 23,6 C: 34 A: 17,1
L: 14,5 C: 55 A: 22
L: 14,5 C: 55 A: 22
L: 14,5 C: 55 A: 22
L: 14,5 C: 55 A: 22

Each Value Code

Or the example for each string value:

import re

regex_pattern= re.compile(r"([\d,]+)[\s\D]*")
regex_string="""23,6 x 34 x 17,1
14,5 x 55 x 22
14,5cm x 55 x 22cm
14,5cmx55x22cm
14,5 cmx55 cmx22 cm"""

matches = re.finditer(regex_pattern, regex_string)

for submatch in matches:
    if submatch:
        print(submatch.groups())

Result

('23,6',)
('34',)
('17,1',)
('14,5',)
('55',)
('22',)
('14,5',)
('55',)
('22',)
('14,5',)
('55',)
('22',)
('14,5',)
('55',)
('22',)
    
29.06.2018 / 22:52
4

"Simply simplify" using (\s+)? for spaces to be optional , regex can not be very simple, but in your case, like this:

(\d+(,\d+)?)(\s+)?(cm)?(\s+)?x(\s+)?(\d+(,\d+)?)(\s+)?(cm)?(\s+)?x(\s+)?(\d+(,\d+)?)(\s+)?(cm)?

Example online in RegEr: link

Explaining the regex

The first part of the regex would be this:

(\d+(,\d+)?)(\s+)?(cm)?
  • The (,\d+)? optionally searches for the number of the comma after the comma

  • The (\s+)? looks for one or more spaces optionally

  • The (cm)? looks for the measurement optionally

Now, after that, just use% w / w between repeating the expression, of course you can do it in other ways, but the result would be almost the same, so it is repetitive but more understanding

If the goal is to fetch one entry at a time then applying the x at the beginning and end should already solve, for example:

\b(\d+(,\d+)?)(\s+)?(cm)?(\s+)?x(\s+)?(\d+(,\d+)?)(\s+)?(cm)?(\s+)?x(\s+)?(\d+(,\d+)?)(\s+)?(cm)?\b

Multiple values

Now if the entry has multiple values so do it this way:

import re

expressao = r'(\d+(,\d+)?)(\s+)?(cm)?(\s+)?x(\s+)?(\d+(,\d+)?)(\s+)?(cm)?(\s+)?x(\s+)?(\d+(,\d+)?)(\s+)?(cm)?'

entrada = '''
23,6 x 34 x 17,1
14,5 x 55 x 22
14,5cm x 55 x 22cm
14,5cmx55x22cm
14,5 cmx55 cmx22 cm
''';

resultados = re.finditer(expressao, entrada)

for resultado in resultados:
    valores = resultado.groups()
    print("Primeiro:", valores[0])
    print("Segundo:", valores[6])
    print("Terceiro:", valores[12])
    print("\n")

Note that the group in the regex is 6 in 6 to get each number between \b , that is, each group returns something like:

('23,6', ',6', ' ', None, None, ' ', '34', None, ' ', None, None, ' ', '17,1', ',1', '\n', None)
('14,5', ',5', ' ', None, None, ' ', '55', None, ' ', None, None, ' ', '22', None, '\n', None)
('14,5', ',5', None, 'cm', ' ', ' ', '55', None, ' ', None, None, ' ', '22', None, None, 'cm')
('14,5', ',5', None, 'cm', None, None, '55', None, None, None, None, None, '22', None, None, 'cm')
('14,5', ',5', ' ', 'cm', None, None, '55', None, ' ', 'cm', None, None, '22', None, ' ', 'cm')

Then you will only use the X , valores[0] and valores[6] , example in repl.it: link

Using values for math operations

Note that valores[12] does not make the number be considered a "number" for Python, so if a mathematical operation is to be converted to , , like this:

float('1000,00001'.replace(',', ','))

It should look something like this:

for resultado in resultados:
    valores = resultado.groups()

    primeiro = float(valores[0].replace(',', '.'))
    segundo = float(valores[6].replace(',', '.'))
    terceiro = float(valores[12].replace(',', '.'))

    print("Primeiro:", primeiro)
    print("Segundo:", segundo)
    print("Terceiro:", terceiro)
    print("Resultado:", primeiro * segundo * terceiro)
    print("\n")
    
29.06.2018 / 22:50
3

You can do this without using regex. Just "clean" the string by removing spaces and "cm", then break in array by "x":

str = "4,5cmx55x22cm";
str = str.replace('cm', '').replace(' ', '')
str = str.split('x')
print str # ['4,5', '55', '22']

See on Ideone

Converting the string to array you have values separated by indexes, and you can use them as you want. If you want the result to be in Lcm x Acm x Ccm format, you can convert the array to string by adding cm x :

str = "4,5cm x55x 22cm ";
str = str.replace('cm', '').replace(' ', '').split('x')
str = 'cm x '.join(str)+"cm"
print str # retorna 4,5cm x 55cm x 22cm

Regex

(?P<l>[\d|,]+)(.*?)x(.*?)(?P<w>[\d|,]+)(.*?)x(.*?)(?P<h>[\d|,]+)(.*?)

The (.*?) checks whether or not there is any character between the number and x . [\d|,]+ captures numbers or comma. Naming the groups you can get the value by name.

Code:

import re
str = "4,5cm x55x 22cm ";
regex = "(?P<l>[\d|,]+)(.*?)x(.*?)(?P<w>[\d|,]+)(.*?)x(.*?)(?P<h>[\d|,]+)(.*?)"
resultado = re.match(regex, str)
print resultado.groupdict()['l'] # retorna 4,5
print resultado.groupdict()['w'] # retorna 55
print resultado.groupdict()['h'] # retorna 22

See on Ideone

    
29.06.2018 / 22:21
3

Regex can actually extract the three values "in a single line of code", but realize that this is an illusion - you are at a point where (?P<l>\d+(\,\d+)?)\s*x\s*(?P<w>\d+(\,\d+)?)\s*x\s*(?P<h>\d+(\,\d+)?) is too simple and has to be even trickier - and even someone who practices regexes every day, you have to read this much more calmly than someone reading 4 or 5 lines of Python code, which separates the values in one step on each line.

But, as you explicitly ask for regex, let's see:

The simplest, instead of repeating the logic of regex 3 times, is to use the "findall" method of regexes in Python - they can already extract all numbers - so we can use:

In [19]: a = ["23,6 x 34 x 17,1", "14,5 x 55x 22", "14,5cmx55x22cm", "23  cmx 12.1cmx 14,36"]
In [20]: [re.findall(r"([\d,.]+)\s*?(?:cm)?", text) for text in a]
Out[20]: 
[['23,6', '34', '17,1'],
 ['14,5', '55', '22'],
 ['14,5', '55', '22'],
 ['23', '12.1', '14,36']]

What allows "cm" to be optional is the part (?: cm) - although this expression does not even need this, it will simply extract all the numbers that have or do not have the "," or "." markers. as decimals.

It is a much simpler expression than its original one - and with findall retrieves 3 numbers, if any - an "if" in Python can ignore the data, or generate an exception if you do not have the 3 numbers .

It has to be borne in mind that regular expressions are literally a language apart from the program language - in this case, the expression has been fairly simple and reasonable to maintain, even though it ignores many corner-cases - in Python, you could get the same result with:

In [21]: a = ["23,6 x 34 x 17,1", "14,5 x 55x 22", "14,5cmx55x22cm", "23  cmx 12.1cmx 14,36"]


In [22]: [[dimensao.replace("cm", "").strip()  for dimensao in dado.split("x")]   for dado in a]
Out[22]: 
[['23,6', '34', '17,1'],
 ['14,5', '55', '22'],
 ['14,5', '55', '22'],
 ['23', '12.1', '14,36']]

(Just as in the case of the example with regexp, the outermost comprehension only traverses all examples of dimensions in "a") - That is, in this case, you draw the numbers using a list comprehension and do not need more than one line of code.

    
29.06.2018 / 23:01
3

You can use the following expression:

[^0-9,.]

That is able to replace everything that is different from numeric digits, periods and commas:

Inafunction:

importredeffindDimensions(text):s=re.sub('[^0-9,.]','',text).replace(',','.').split()returntuple([float(n)fornins])

Testing:

importredeffindDimensions(text):s=re.sub('[^0-9,.]','',text).replace(',','.').split()returntuple([float(n)fornins])print(findDimensions("14,5 x 55 x 22,0"))
print(findDimensions("14,5 x 55cm x 22"))
print(findDimensions("14,5cm x 55 x 22cm"))
print(findDimensions("14,5cmx55x22cm"))
print(findDimensions("14,5 cmx55 cmx22 cm"))
print(findDimensions("14,5 cm x 55.0 x 22.0 cm"))

Output:

(14.5, 55.0, 22.0)
(14.5, 55.0, 22.0)
(14.5, 55.0, 22.0)
(14.5, 55.0, 22.0)
(14.5, 55.0, 22.0)
(14.5, 55.0, 22.0)

See the regular expression working in regex101.com .

See the test code for Ideone.com

    
29.06.2018 / 23:36