Replace string with regex in python 3


I have a code that overrides certain string by whitespace

    dados = '[{"Id":12345,"Date":"2018-11-03T00:00:00","Quality":"Goodão","Name":"X","Description":null,"Url":"","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"},{"Id":12346,"Date":"2018-11-03T00:00:00","Quality":"Good","Name":"YYy","Description":"Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.","Url":"","ParseUrl":"y beautiful","Status":"Ativa","Surveys":0,"KeySearch":"y like","QualityId":3,"Type":"Tecnology"},{"Id":12347,"Date":"2018-11-03T00:00:00","Quçality":"Pending","Name":"z Z","Description":"Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur","Url":"","ParseUrl":null,"Status":"Ativa","Surveys":112,"KeySearch":"z plant","QualityId":4,"Type":"Agro"},{"Id":12335,"Date":"2018-11-03T00:00:00","Quality":"óéGood","Name":"J","Description":null,"Url":"","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"},{"Id":12332,"Date":"2018-11-03T00:00:00","Quality":"óéGood","Name":"J","Description":null,"Url":"","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"}]'
    dados = dados.replace('http://', '')
    dados = dados.replace('https://', '')


[{"Id":12345,"Date":"2018-11-03T00:00:00","Quality":"Good�o","Name":"X","Description":null,"Url":"","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"},{"Id":12346,"Date":"2018-11-03T00:00:00","Quality":"Good","Name":"YYy","Description":"Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.","Url":"","ParseUrl":"y beautiful","Status":"Ativa","Surveys":0,"KeySearch":"y like","QualityId":3,"Type":"Tecnology"},{"Id":12347,"Date":"2018-11-03T00:00:00","Qu�ality":"Pending","Name":"z Z","Description":"Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur","Url":"","ParseUrl":null,"Status":"Ativa","Surveys":112,"KeySearch":"z plant","QualityId":4,"Type":"Agro"},{"Id":12335,"Date":"2018-11-03T00:00:00","Quality":"��Good","Name":"J","Description":null,"Url":"","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"},{"Id":12332,"Date":"2018-11-03T00:00:00","Quality":"��Good","Name":"J","Description":null,"Url":"","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"}]

In this situation the result happens as expected, but when I need to use a regex to do the replace I can not (I've already tried it in several ways).

As you can see below, it only replaces the first element and overwrites the entire data variable, see:

dados = re.sub(re.compile('(/.*)', re.MULTILINE), '', dados)



I understand what happened, but I wonder if there is a way to replace using regex, similar to the replace function.

The goal is to just leave the domain and remove all the junk, eg: for , I consider "garbage" to be the /qweqwe , because only is important.

asked by anonymous 05.11.2018 / 00:40

2 answers


Your input string is a JSON, so it's best to use the right tools to manipulate this data. You can use module json and then manipulate the URL with urllib.parse :

# -*- coding: utf-8 -*-

import re
import json
import urllib.parse

dados = '[{"Id":12345,"Date":"2018-11-03T00:00:00","Quality":"Goodão","Name":"X","Description":null,"Url":"","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"},{"Id":12346,"Date":"2018-11-03T00:00:00","Quality":"Good","Name":"YYy","Description":"Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.","Url":"","ParseUrl":"y beautiful","Status":"Ativa","Surveys":0,"KeySearch":"y like","QualityId":3,"Type":"Tecnology"},{"Id":12347,"Date":"2018-11-03T00:00:00","Quçality":"Pending","Name":"z Z","Description":"Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur","Url":"","ParseUrl":null,"Status":"Ativa","Surveys":112,"KeySearch":"z plant","QualityId":4,"Type":"Agro"},{"Id":12335,"Date":"2018-11-03T00:00:00","Quality":"óéGood","Name":"J","Description":null,"Url":"","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"},{"Id":12332,"Date":"2018-11-03T00:00:00","Quality":"óéGood","Name":"J","Description":null,"Url":"","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"}]'

# converter a string para JSON
jsondata = json.loads(dados)

# regex para verificar se a URL tem o protocolo
r = re.compile(r"^(https?|ftp)://")

# substituir somente o campo URL
for d in jsondata:
    url = d['Url']
    # se não tem o protocolo, adiciona qualquer um, apenas para o parsing ser feito corretamente
    if not r.match(url):
        url = "http://"+ url
    d['Url'] = urllib.parse.urlparse(url).netloc

# converter JSON para string
dados = json.dumps(jsondata, ensure_ascii=False)

The output is:

[{"Id": 12345, "Date": "2018-11-03T00:00:00", "Quality": "Goodão", "Name": "X", "Description": null, "Url": "", "ParseUrl": "x-art", "Status": "Ativa", "Surveys": 0, "KeySearch": "x Art", "QualityId": 3, "Type": "Tecnology"}, {"Id": 12346, "Date": "2018-11-03T00:00:00", "Quality": "Good", "Name": "YYy", "Description": "Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.", "Url": "", "ParseUrl": "y beautiful", "Status": "Ativa", "Surveys": 0, "KeySearch": "y like", "QualityId": 3, "Type": "Tecnology"}, {"Id": 12347, "Date": "2018-11-03T00:00:00", "Quçality": "Pending", "Name": "z Z", "Description": "Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur", "Url": "", "ParseUrl": null, "Status": "Ativa", "Surveys": 112, "KeySearch": "z plant", "QualityId": 4, "Type": "Agro"}, {"Id": 12335, "Date": "2018-11-03T00:00:00", "Quality": "óéGood", "Name": "J", "Description": null, "Url": "", "ParseUrl": "x-art", "Status": "Ativa", "Surveys": 0, "KeySearch": "x Art", "QualityId": 3, "Type": "Tecnology"}, {"Id": 12332, "Date": "2018-11-03T00:00:00", "Quality": "óéGood", "Name": "J", "Description": null, "Url": "", "ParseUrl": "x-art", "Status": "Ativa", "Surveys": 0, "KeySearch": "x Art", "QualityId": 3, "Type": "Tecnology"}]

Note that the keys are in a different order than the input, because JSON is set to a unordered set of name / value pairs (a set of key pairs / value no order ). That's why order is not guaranteed.

Other details:

I used a regex (using the module re ) to check if the URL does not have the protocol (the http:// at the beginning, for example). I used ^(https?|ftp):// , which means:

  • ^ : start of string
  • https? : text "http" or "https" ( s? indicates that the letter "s" is optional)
  • ftp : the text "ftp" "
  • | : means or . So (https?|ftp)
12.11.2018 / 13:13

The problem is in your regex ... '(/.*)' means "a bar and everything that comes after"!

I do not know what you want to do ... If you want to take the http try to use this regex: r'https?://'

EDIT: Now that you've set your goal, I believe the right tool is not regexp, but rather the url-specific functions that are in urllib.parse :

>>> import urllib.parse

>>> url = ''
>>> print(urllib.parse.urlparse(url))
ParseResult(scheme='https', netloc='', path='/sdfsfs', 
            params='', query='', fragment='')
>>> print(urllib.parse.urlparse(url).netloc)

To complete, I'll leave here the complete regexp to parse urls, which actually follows all possible URL rules:

05.11.2018 / 00:44