Replace string with regex in python 3

Question

Replace string with regex in python 3

Navigation

#1 by (3 votes)
#2 by (2 votes)

2

I have a code that overrides certain string by whitespace

    dados = '[{"Id":12345,"Date":"2018-11-03T00:00:00","Quality":"Goodão","Name":"X","Description":null,"Url":"x.com.br/qweqwe","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"},{"Id":12346,"Date":"2018-11-03T00:00:00","Quality":"Good","Name":"YYy","Description":"Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.","Url":"https://www.y.com.br/sdfsfs","ParseUrl":"y beautiful","Status":"Ativa","Surveys":0,"KeySearch":"y like","QualityId":3,"Type":"Tecnology"},{"Id":12347,"Date":"2018-11-03T00:00:00","Quçality":"Pending","Name":"z Z","Description":"Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur","Url":"http://www.z.com.br/asdasdas","ParseUrl":null,"Status":"Ativa","Surveys":112,"KeySearch":"z plant","QualityId":4,"Type":"Agro"},{"Id":12335,"Date":"2018-11-03T00:00:00","Quality":"óéGood","Name":"J","Description":null,"Url":"www.j.com.br","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"},{"Id":12332,"Date":"2018-11-03T00:00:00","Quality":"óéGood","Name":"J","Description":null,"Url":"www.j.com.br/","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"}]'
    dados = dados.replace('http://', '')
    dados = dados.replace('https://', '')
    print(dados)

Result:

[{"Id":12345,"Date":"2018-11-03T00:00:00","Quality":"Good�o","Name":"X","Description":null,"Url":"x.com.br/qweqwe","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"},{"Id":12346,"Date":"2018-11-03T00:00:00","Quality":"Good","Name":"YYy","Description":"Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.","Url":"www.y.com.br/sdfsfs","ParseUrl":"y beautiful","Status":"Ativa","Surveys":0,"KeySearch":"y like","QualityId":3,"Type":"Tecnology"},{"Id":12347,"Date":"2018-11-03T00:00:00","Qu�ality":"Pending","Name":"z Z","Description":"Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur","Url":"www.z.com.br/asdasdas","ParseUrl":null,"Status":"Ativa","Surveys":112,"KeySearch":"z plant","QualityId":4,"Type":"Agro"},{"Id":12335,"Date":"2018-11-03T00:00:00","Quality":"��Good","Name":"J","Description":null,"Url":"www.j.com.br","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"},{"Id":12332,"Date":"2018-11-03T00:00:00","Quality":"��Good","Name":"J","Description":null,"Url":"www.j.com.br/","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"}]

In this situation the result happens as expected, but when I need to use a regex to do the replace I can not (I've already tried it in several ways).

As you can see below, it only replaces the first element and overwrites the entire data variable, see:

dados = re.sub(re.compile('(/.*)', re.MULTILINE), '', dados)
print(dados)

Result:

[{"Id":12345,"Date":"2018-11-03T00:00:00","Quality":"Good�o","Name":"X","Description":null,"Url":"x.com.br

I understand what happened, but I wonder if there is a way to replace using regex, similar to the replace function.

The goal is to just leave the domain and remove all the junk, eg: for x.com.br/qweqwe , I consider "garbage" to be the /qweqwe , because only x.com.br is important.

python regex python-3.x replace

asked by anonymous 05.11.2018 / 00:40

2 answers



                    
        

         
                            Resolved - Boostrap Vertical Alignment
                                        Specify DNS server when resolving address

score 3 · Accepted Answer

Your input string is a JSON, so it's best to use the right tools to manipulate this data. You can use module json and then manipulate the URL with urllib.parse :

# -*- coding: utf-8 -*-

import re
import json
import urllib.parse

dados = '[{"Id":12345,"Date":"2018-11-03T00:00:00","Quality":"Goodão","Name":"X","Description":null,"Url":"x.com.br/qweqwe","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"},{"Id":12346,"Date":"2018-11-03T00:00:00","Quality":"Good","Name":"YYy","Description":"Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.","Url":"https://www.y.com.br/sdfsfs","ParseUrl":"y beautiful","Status":"Ativa","Surveys":0,"KeySearch":"y like","QualityId":3,"Type":"Tecnology"},{"Id":12347,"Date":"2018-11-03T00:00:00","Quçality":"Pending","Name":"z Z","Description":"Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur","Url":"http://www.z.com.br/asdasdas","ParseUrl":null,"Status":"Ativa","Surveys":112,"KeySearch":"z plant","QualityId":4,"Type":"Agro"},{"Id":12335,"Date":"2018-11-03T00:00:00","Quality":"óéGood","Name":"J","Description":null,"Url":"www.j.com.br","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"},{"Id":12332,"Date":"2018-11-03T00:00:00","Quality":"óéGood","Name":"J","Description":null,"Url":"www.j.com.br/","ParseUrl":"x-art","Status":"Ativa","Surveys":0,"KeySearch":"x Art","QualityId":3,"Type":"Tecnology"}]'

# converter a string para JSON
jsondata = json.loads(dados)

# regex para verificar se a URL tem o protocolo
r = re.compile(r"^(https?|ftp)://")

# substituir somente o campo URL
for d in jsondata:
    url = d['Url']
    # se não tem o protocolo, adiciona qualquer um, apenas para o parsing ser feito corretamente
    if not r.match(url):
        url = "http://"+ url
    d['Url'] = urllib.parse.urlparse(url).netloc

# converter JSON para string
dados = json.dumps(jsondata, ensure_ascii=False)
print(dados)

The output is:

[{"Id": 12345, "Date": "2018-11-03T00:00:00", "Quality": "Goodão", "Name": "X", "Description": null, "Url": "x.com.br", "ParseUrl": "x-art", "Status": "Ativa", "Surveys": 0, "KeySearch": "x Art", "QualityId": 3, "Type": "Tecnology"}, {"Id": 12346, "Date": "2018-11-03T00:00:00", "Quality": "Good", "Name": "YYy", "Description": "Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.", "Url": "www.y.com.br", "ParseUrl": "y beautiful", "Status": "Ativa", "Surveys": 0, "KeySearch": "y like", "QualityId": 3, "Type": "Tecnology"}, {"Id": 12347, "Date": "2018-11-03T00:00:00", "Quçality": "Pending", "Name": "z Z", "Description": "Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur", "Url": "www.z.com.br", "ParseUrl": null, "Status": "Ativa", "Surveys": 112, "KeySearch": "z plant", "QualityId": 4, "Type": "Agro"}, {"Id": 12335, "Date": "2018-11-03T00:00:00", "Quality": "óéGood", "Name": "J", "Description": null, "Url": "www.j.com.br", "ParseUrl": "x-art", "Status": "Ativa", "Surveys": 0, "KeySearch": "x Art", "QualityId": 3, "Type": "Tecnology"}, {"Id": 12332, "Date": "2018-11-03T00:00:00", "Quality": "óéGood", "Name": "J", "Description": null, "Url": "www.j.com.br", "ParseUrl": "x-art", "Status": "Ativa", "Surveys": 0, "KeySearch": "x Art", "QualityId": 3, "Type": "Tecnology"}]

Note that the keys are in a different order than the input, because JSON is set to a unordered set of name / value pairs (a set of key pairs / value no order ). That's why order is not guaranteed.

Other details:

I used a regex (using the module re ) to check if the URL does not have the protocol (the http:// at the beginning, for example). I used ^(https?|ftp):// , which means:

^ : start of string
https? : text "http" or "https" ( s? indicates that the letter "s" is optional)
ftp : the text "ftp" "
| : means or . So (https?|ftp)


                                    
                                                                                                                      12.11.2018 / 13:13