Python Is Slow

As a matter of intellectual honesty I feel it’s necessary to state unequivocally that Python is slow. Python is absoluetly, incredibly, and sometimes painfully slow. There are always a lot of “buts” and coping mechanisms that couch this simple fact.

  • But Numpy…
  • But Numba…
  • But PyPy…
  • But language X is even slower…
  • But it doesn’t matter most of the time…
  • But selecting a language based on speed is a form of premature optimization…
  • But you can still use Cython or write C plugins…
  • But Python is so good at…
  • But it depends on what you mean by “slow”…

But I say it’s okay. We all intuitively know what it means for Python to be a slow language. We still use and love it. There’s nothing to be scared of — it’s a morally neutral statement. It is both honest and totally satisfactory to state it up front.

Less Python Is Faster Python

By admitting that Python is slow, we learn the main way of writing efficient Python: execute as little code as possible. If there’s two ways to accomplish some task, it’s usually better to choose the path involves less Python code. Less code also means less code to read for others, less code is more maintainable, less code is more extensible, less code can be rewritten more easily. From this we get two guidelines on how to approach efficient Python programming:

  1. Learn idioms, the standard library, and esoterics. Learn succinct coding.

    Try to leverage Python’s powerful expressiveness as much as possible. If you need to reverse a string, just use s[::-1] rather than something more explicit and arguably less “magic” like ''.join(s[i-1] for i in range(len(s), 0, -1))1.

    Knowledge is power. If you want to select random characters from some alphabet a lot of beginners would do [random.choice('ABCDwxyz') for _ in range(10_000)] or similar, perhaps not realizing that there exists random.choices('ABCDwxyz', k=10_000) which is both simpler and faster.

    On encountering anything that involves a loop, have a moment to think if there’s some way to leverage Python’s built-in functionality or the standard library to do the same thing as a single expression, moving the loop to the C-side, or at least to the standard library where the code is likely to be better and faster than yours.

    A good way to gain knowledge like this is to actually look at what people do in code golf. Another option is just plain regular ad hoc scripts, the messier the better. Perhaps counter-intuitively, specialized or “dirty” code like this is much more densely packed with information and learning opportunities for the reader, because it’s unfiltered and has a high expression range. Highly developed generic and abstract code is, conversely, terrible for learning. All the code is a homogenized and consists of doc-strings and methods that simply defer to other methods, there’s hierarchies of plugins and backends and factories and proxies, and little of the code is actually interesting. While maybe good from an API perspective, polished, abstract code is a wasteland when it comes to learning opportunities.

  2. Do not worry about redundancy. Use the general solution.

    Intuitive efficiency does not translate into real efficiency in Python. For example it’s far better to execute highly redundant code on the native C side, like something that loops over a structure in several passes, than to do the loop even once on the Python side.

    If you have a string that consists of a billion words and numbers in some relative proportion, and you want to end up with a list of the numbers and then another list of the words (but uppercased), it’s far better to just uppercase the entire string in a single call, than it is to first do the separation and then to only uppercase the strings that need it. (In fact, this might hold even for C, depending on the proportion, and how the string is processed.)

    Similarly, it tends to be better to simply work with a single large matrix in Numpy and use masks to manipulate parts of it, than it is to work with a ton of individual small matrices where you programmatically select and update “only the ones that need updating,” as long as it leads to less code being executed in Python.

We might refocus these guideslines slightly, and consider what to do if we actually do need “speed” in Python. In that case:

  1. Use an external library that does what you want to do. Examples include cryptography, graphics, event handling, and the like.

  2. Or Numpy. Numpy is the answer to a lot of questions. In my opinion, Numpy ought to be built into Python itself, it ought to come with the standard library. If you have lists or arrays and you’re looping over these these to do some kind of filtering or computation, then in almost all cases you can replace the loops with Numpy indexing expression that translates the loops from the Python side to the C side.

Case Study

Here’s a little exploration I did based on AoC, day 3. The input is a single string that consists of n-length sequences of '0' and '1', each sequence separated by a newline. The output should be a list of numbers where each is the number these substrings that has a '1' character at the given position. Or, rephrased: you get a single string that can be divided up into lines where each line is a row in a matrix, and you want to count how many '1' characters there are in each column of this matrix.

Below is the code using various tricks and idioms.

See if you can predict which performs better in what circumstance. Results are at the bottom.

# The functions below take two arguments:
# `inp` (str) consists of rows of 0s and 1s separated by a newline, e.g. '1011\n1111\n1001\n1000\n'
# `n` (int) is the number of columns in each line/row. Some functions do not need/rely on this.
# Benchmarks are run on two sets:
# A "narrow" set with 10_000 lines and 8 columns, and a "wide" set with 2_000 lines, and 40 columns.

def py_naive(inp, n):
  """Arguably the most obvious and straightforward way to do it.

  Split the string into lines, then for each column, count the number of lines
  that has a '1' in that position.

  lines = inp.split()
  return [sum(line[k] == '1' for line in lines) for k in range(n)]

def py_zip_join(inp, _n):
  """The usual transpose-zip combined with a join.

  return [''.join(x).count('1') for x in zip(*inp.split())]

def py_zip_counter(inp, _n):
  """Like `py_zip_join()` but using a `Counter()` for the count on the
  resulting tuples.

  return [Counter(x)['1'] for x in zip(*inp.split())]

def py_clever(inp, n):
  """Under the assumption that the input format is set in stone, we can cleverly
  slice the string into substrings consisting of the desired columns.

  return [inp[k::n+1].count('1') for k in range(n)]

def ex_resplit(inp, n):
  """This is just to show the difference with `py_clever()` where we cannot assume
  uniformity of input, so we force a `.split()` and then join it up again to add
  artificial overhead.

  inp = ''.join(inp.split())
  return [inp[k::n].count('1') for k in range(n)]

def np_count(inp, n):
  """Elaborating on the `py_clever()` and under the same assumption (that the
  input format is absolute), we can use Numpy to accomplish the same sort of
  count-by-columns directly, though here without the need to create per-column

  mat = np.frombuffer(inp.encode(), 'u1').reshape(-1, n+1)
  return list((mat == ord('1')).sum(0)[:-1])

def np_split(inp, n):
  """This looks ugly but it's just an artificial example to show the case where we
  actually `split()` the incoming string.

  return list(np.array(inp.split(), f'S{n}')[...,None].view('S1').astype('i4').sum(0))

def np_itersplit(inp, n):
  """Fully splitting all the values before passing them off to numpy for the sum.

  mat = np.array(list(map(list, inp.split())), 'i4')
  return list(mat.sum(0))

def py_bigint(inp, n):
  """Over-engineered solution translating the entire string to a single integer
  and masking out the bits in question.

  Although useless here since we still end up converting back to strings, it's a
  generally useful trick in other areas.

  inp = inp.replace('\n', '')
  as_int = int(inp, 2)
  bit_1 = 2**(8 * len(inp)) // (2**n - 1)
  return [bin(as_int & (bit_1 << k)).count('1') for k in range(n)][::-1]

def py_regex(inp, n):
  """Another over-engineered solution using a regex search to count each of the
  patterns of a column having a '1'.

  Also useless here, but a useful trick.

  return [len(re.findall('^' + '.'*k + '1', inp, re.M)) for k in range(n)]

def py_enumcounter(inp, n):
  """Cycle a counter together with the string, counting all occurrence-combinations.

  Note that this counts everything, including `'0'` characters and newlines.

  cnts = Counter(zip(itt.cycle(range(n+1)), inp))
  return [cnts[(k,'1')] for k in range(n)]

def py_num_trick(inp, n):
  """A variant on the numerical tricks, converting the number to
  (pseudo-)base-65536 numbers which can be summed directly without the digits
  overflowing. Then a formatting trick to split the resulting number up by
  column sums.

  cols = sum(int(x, 16) for x in inp.replace('0','0000').replace('1','0001').splitlines())
  return [int(x, 16) for x in f'{cols:_x}'.split('_')]

Beauty as a Side-effect

Python is slow. But in writing efficient Python one starts to develop an intuition that succinct code is usually more efficient code. In my opinion, this is a very fortunate heuristic, one that speaks to the heart of what is beautiful about Python.

This heuristic does not hold for all languages. For most compiled languages, it tends to be the opposite. Performance-optimized Haskell tends to be uglier, more verbose (and more imperative) than succinct or “naive” Haskell. Performant languages like C also tend to blow up to some degree when they’re optimized to the point of striding loops, structs are split into arrays, and compiler hints are inserted everywhere. If you’ve ever seen C code that tries to convey likely and unlikely branches in if-statements you know what I mean.

Many dynamic languages that lack Python’s expressiveness or syntactic freedom also do not show this side-effect. In these languages one would tend to break down the abstractions and write as C-like code as possible, praying the interpreter has an easier time recognizing patterns and/or triggering its JIT compilation. One example is JavaScript2, which offers way less in terms of raw language power or a standard library, and where people still write C-style for-loops.

When you have a very expressive (but slow) dynamic language it leads to a sort of harmony where elegance and efficiency tend to align rather than diverge. Instead of being told to break down all the abstractions and write assembly-like low-level code (such as in performant unboxed Haskell or JIT-friendly JavaScript), you’re actively encouraged to go even higher, and try to pack as much power into every expression as possible.

Ruby would be an example — another slow, dynamic language — that appears to share the same harmony. I don’t know much about it, but I nonetheless expect succinct Ruby to be efficient Ruby.

Saying that Python’s slowness encourages a beautiful succinct aesthetics isn’t an excuse though. Of course I’d much rather have Python as-is be fast. I’m not arguing that slowness is a good thing3.

Although, for me, Python (and Ruby) presents a tenent that I wish was true for all languages: high-level code should not be slower than low level code. I don’t know why so many accept it as inviolable mathematical fact that high level code is necessarily slower than low level code. The relationship is at most =<, not <. Thus I also strongly believe that in natively compiled languages, all abstractions should be zero-cost. If they are not that is a failure of language design, albeit one we might have to live with temporarily.

Benchmark Results

Narrow lines:

------------------------------------------------ benchmark: 12 tests -------------------------------------------------
Name (time in us)            Mean                 Median                StdDev                   OPS            Rounds
test_np_count            201.1564 (1.0)         196.4490 (1.0)         16.4426 (1.0)      4,971.2569 (1.0)       12198
test_clever              376.4197 (1.87)        342.8305 (1.75)        78.1109 (4.75)     2,656.6094 (0.53)       7420
test_ex_resplit          641.5690 (3.19)        624.5760 (3.18)        59.4850 (3.62)     1,558.6789 (0.31)       3925
test_zip_join          2,776.8000 (13.80)     1,845.5670 (9.39)     2,572.5337 (156.46)     360.1268 (0.07)       1360
test_bigint_trick      3,262.8631 (16.22)     3,232.3440 (16.45)      229.3595 (13.95)      306.4793 (0.06)        847
test_zip_counter       4,419.2994 (21.97)     3,496.5080 (17.80)    2,598.6162 (158.04)     226.2802 (0.05)        693
test_naive             4,609.4335 (22.91)     4,600.3455 (23.42)      170.4481 (10.37)      216.9464 (0.04)        686
test_num_trick         4,639.5709 (23.06)     4,566.1065 (23.24)      227.4872 (13.84)      215.5372 (0.04)        646
test_np_split          5,373.7856 (26.71)     5,347.6110 (27.22)      135.4890 (8.24)       186.0886 (0.04)        561
test_enum_counter      7,680.5651 (38.18)     7,634.6120 (38.86)      192.3201 (11.70)      130.1988 (0.03)        379
test_regex             7,758.5097 (38.57)     7,698.3045 (39.19)      433.0626 (26.34)      128.8907 (0.03)        402
test_np_itersplit     11,649.3359 (57.91)    10,608.8820 (54.00)    2,946.2972 (179.19)      85.8418 (0.02)        280

Wider lines:

------------------------------------------------- benchmark: 12 tests -------------------------------------------------
Name (time in us)            Mean                 Median                StdDev                    OPS            Rounds
test_np_count             70.8971 (1.0)          68.6465 (1.0)         11.1233 (1.0)      14,104.9396 (1.0)       25304
test_clever              393.2690 (5.55)        358.8175 (5.23)        83.2271 (7.48)      2,542.7888 (0.18)       7230
test_ex_resplit          517.6856 (7.30)        483.3975 (7.04)        90.4284 (8.13)      1,931.6744 (0.14)       5256
test_zip_join          1,512.8239 (21.34)     1,332.8995 (19.42)    1,037.6603 (93.29)       661.0155 (0.05)       1876
test_zip_counter       2,997.4516 (42.28)     2,860.1275 (41.66)      874.2208 (78.59)       333.6167 (0.02)        276
test_num_trick         3,472.1816 (48.97)     3,397.6545 (49.49)      298.9875 (26.88)       288.0034 (0.02)        834
test_naive             4,335.9662 (61.16)     4,218.6585 (61.45)      446.8619 (40.17)       230.6291 (0.02)        598
test_np_split          4,717.6956 (66.54)     4,682.5300 (68.21)      169.6819 (15.25)       211.9679 (0.02)        677
test_bigint_trick      6,445.4938 (90.91)     6,368.7445 (92.78)      429.7515 (38.64)       155.1472 (0.01)        424
test_np_itersplit      6,954.7329 (98.10)     6,807.0015 (99.16)    1,018.6224 (91.58)       143.7870 (0.01)        422
test_enum_counter      7,935.9839 (111.94)    7,903.2715 (115.13)     272.1491 (24.47)       126.0083 (0.01)        382
test_regex            23,734.6893 (334.78)   23,165.2080 (337.46)   2,301.2013 (206.88)       42.1324 (0.00)        129

As expected np_count() shines, notably because it can exploit Python’s buffer views. But what I really want to highlight is how good py_clever() is. It’s simple, straightforward, and the fastest pure-Python variant. Compared with the naive way which performs about a full order of magnitude worse.

Next comes the zip-join, which is arguably the most idiomatic Python, and it’s still better than the “brute force” loops.

The extra-clever tricks of using numbers or regex don’t perform well here, but they are mostly included for illustrative purposes4. And np_itersplit shows what happens when using libraries in a bad way: doing all the splitting on the Python side circumvents half the point of using Numpy.

  1. The absolute worst is to write explicit C-like for-loops that constantly indexes and updates variables. It’s not more readable. It’s just bad.

  2. I’m actually talking about “naive” JavaScript here that doesn’t have any JIT compilation, but which stays a dynamic interpreted language. Due to JIT compilation, performance in JavaScript is a whole research field in and by itself.

  3. I also will never truly forgive Python for using a homemade ad hoc bigint implementation over something like GMP.

  4. Though py_num_trick() performs pretty well for what it is.