What makes join () be so superior compared to other concatenation techniques?

7

It's common to read that the concatenation method join() is much top other techniques in Python (such as + or += ). Starting from this point, I ask some questions:

  • What does join() do so fast?
  • Should I always use it or are there cases where join() would not be welcome?
  • Are methods faster than join() for string concatenation in Python?

References

asked by anonymous 17.09.2016 / 00:08

2 answers

8

Problem

There is a problem that one of the creators of this site (SO) calls Shlemiel the painter's algorithm .

A painter is painting the passing line of a highway. It starts off very well, with high productivity. But each day he produces less, until his work becomes unfeasible. This occurs because it holds the paint can in a fixed place, so it paints a portion of the strip and has to return to the starting point to wet the brush. Each day it is further away from the can and takes longer on the way than in painting.

Imaginethispatterngrowingtens,hundreds,orthousandsoftimes.Quicklyunfeasible.

Thisisoftenthecasewithconcatenationsofdatacollections,especiallystring.Asitgrowsanddoesnotfitintothespacethatexistedforthepreviousversion,anewallocationneedstobemadetosupportthefullsizeofthenewversion.Andyouhavetodeallocatetheoldonethathasbecomerubbish.Anditfragmentsthememory.Allthisiscostoftime.Insomelanguagesthesituationisworsesinceachangethatdoesnotevenbreakthecurrentsizelimitalreadycausesareallocationforagoodcause.ThisisthecasewithPython.

Solution

Therightwayistofigureoutthefinalsize,oratleastanapproximationofit,andallocateeverythingyouneed,youcanputthetextsinthiscrazyarea.Obviouslyitneedstobedoneinastructurethatacceptsthetexttobechanged,whichisnotthecaseoftypestringthatisimmutable,thatis,anychangegeneratesanewobject.

join()doesexactlywhatIdescribed.Itdiscoversthetotalsizeneeded-takingthesizesofallstringsthatwillbeconcatenated-allocatesallnecessaryspaceandthrowsthetextsinthatspacethatisnotyetastring.Attheenditturnsthisintostring.Therethecostisequivalenttothetotalsizeofthetextwhichismuchshorterthanwalkingalloveragainineachconcatenation.

Notethatforafewconcatenations,typicallyupto4,theconcatenationmayperformevenbetterthanjoin().Ofcourseinsuchasmallvolume,whichisfaster.

Alternatives

Ofcourse,join()isnottheonlywaytodothis.Youcandoitmanuallyifyouneedsomethingalittlemorecomplexthanjoin()doesnotmeet.Maybeusinga bytearray * or a default list that are mutable (help, but not great because it may need new allocations, although minimized, does not need every change, depends on the skill of the programmer).

The Python page also shows you how to use %s to get similar results. Formatting occurs in a function that manipulates everything in another data structure and only at the end that the final string is generated.

Some people like to use StringIO to take care of this.

I answered this in more detail for other languages like Java . And also C # .

    
17.09.2016 / 00:18
5

From the article: Efficient String Concatenation

Method 1 (concatenation)

def method1():
  out_str = ''
  for num in xrange(loop_count):
    out_str += 'num'
  return out_str

Method 4 (join)

def method4():
  str_list = []
  for num in xrange(loop_count):
    str_list.append('num')
  return ''.join(str_list)

Method4(join)issignificantlyfasterthanconcatenation.

Thisisbecausestringsareimmutable,thatis,theycannotbechanged.To"change" one, you need to create a new representation (a concatenation of the two) and then destroy the old strings. The join is faster because Python is able to optimize this process.

The text Python: string concatenation VS list join is also very interesting and goes in the source code of the CPython implementation find out the answer:

  

When using the join method, Python allocates memory for the final string only once; but if you concatenate multiple strings in succession, Python has to allocate new memory for each concatenation. Guess what's fastest? ;)

That is, this is identical in terms of performance:

final_str = 'abc' + 'def'
final_str = ''.join('abc', 'def')  # não há diferença de desempenho

If you concatenate more than two strings, the join will be faster:

final_str = 'abc' + 'def' + 'ghi'  # aqui é realizado duas operações sucessivas
final_str = ''.join('abc', 'def', 'ghi')  # aqui é realizado uma só
    
17.09.2016 / 00:18