tadhg.com
tadhg.com
 

Python Flatten/Concatenate Comparisons

23:31 Fri 09 Apr 2010. Updated: 15:28 18 Jan 2011
[, ]

I’m going to compare seven different ways in Python to make one list out of several lists containing different numbers of elements, something that strikes me as a common but not necessarily everyday operation.

The philosophy of Python is that generally there should be one obvious and reasonable way to do things. I really like this philosophy—except when the Python way isn’t the way I prefer, of course.

I seem to instinctively like functional programming approaches, or at least an approach of breaking things down into small functions, each of which takes up a small amount of space in terms of code and which does just one thing. More and more I find myself thinking that if I can express succinctly (in English) what a block of code should do[1], the block shouldn’t be more than a couple of lines. I think this drive pushes me away from the Pythonic way of doing things.

So we have a list of lists. We want to make the list of lists into one list. [[1], [2, 3]] should become [1, 2, 3]. While trying to find the “right” way to do this trivial task, I became a little obsessed, hence this post and its seven approaches.

The first one is the one I suspect Python’s Benevolent Dictator For Life would want me (and everyone else) to use:

def flatten_for_loop(l):
    fl = []
    for i in l:
        for j in i:
            fl.append(j)
    return fl

It does the job, certainly. But at six lines, it just bugs me. My instinct tells me that “concatenate the lists” shouldn’t take that many lines. It’s possible that my instinct is simply off, but it’s pretty insistent.

This was my next idea:

from operator import add
def flatten_reduce_add(l):
    return reduce(add, l)

Yes, the much-maligned reduce. This does seem like a completely natural fit for it, however. I like this solution, but I don’t like having to import add, and I don’t like using reduce when in future it’ll need to be imported as well.

def flatten_lc(l):
    return [i for j in l for i in j]

While I love list comprehensions, this one just doesn’t sit well. Even though I know what it does, I have trouble parsing it for some reason—and that’s really a sign that I shouldn’t use it, since someone who doesn’t know what it does would have a significantly harder time.

Before discovering better ways, I went on to create my own function to do it:

def listadd(l, new=[]): return listadd(l, new + l.pop(0)) if l else new
def flatten_custfunc(l, new=[]):
    return listadd(l[:])

This modifies the list, necessitating the creation of a copy, and overall it’s pretty inefficient; I didn’t like it much, which is why I kept looking.

def flatten_sum(l):
    return sum(l, [])

Possibly my favorite. The problem is that since I didn’t know that you could use sum on non-numbers—by giving it a value to start off with, which it normally defaults to 0—I suspect a lot of other people wouldn’t know it either, so it might cause confusion. I’m on the fence on this one—after all, the name gives it away, and so its intent should be quite clear. But there are cases where a reader might think that it was going to get the sum of all the numbers in the lists.

from itertools import chain
def flatten_chain(l):
    return [i for i in chain(*l)]

Another favorite. This one seems quite clear to me (in part because the name helps), but again it requires an import.

def flatten_map_extend(l):
    new = []
    map(new.extend, l)
    return new

Better than the original for-loop version, but it rubs me the wrong way—probably because I’ve seen solutions like the ones laid out above and their concision is exerting a dark influence.

Naturally, the next thing to do was to check the performance of the various options.

So I ran each a million times against a number of different inputs. The first input was a list of tuples and dicts:

[
    [(1, {’type’: ’type1’, ’name’: ’name1’})],
    [(2, {’type’: ’type2’, ’name’: ’name2’})],
    [(3, {’type’: ’type3’, ’name’: ’name3’})],
    [(4, {’type’: ’type4’, ’name’: ’name4’})],
    [(9, {’type’: ’type9’, ’name’: ’name9’}), (10, {’type’: ’type9’, ’name’: ’name9’}), (11, {’type’: ’type9’, ’name’: ’name9’}), (12, {’type’: ’type9’, ’name’: ’name9’})],
    [(13, {’type’: ’type9’, ’name’: ’name9’})],
    [(23, {’type’: ’type9’, ’name’: ’name9’})]
]

And these are the times it took for each function to run 1000000 times with that input:

flatten_map_extend

9.0762510299682617

flatten_lc

9.6673529148101807

flatten_reduce_add

9.8464798927307129

flatten_sum

10.240524053573608

flatten_chain

10.286052942276001

flatten_for_loop

11.377233028411865

flatten_custfunc 16.330123901367188

Next were lists of numbers:

[
    [0],
    [1],
    [2],
    range(3,100),
    range(100,200),
    range(200,300),
    range(300,400),
    range(400,500),
    range(500,600),
    range(600,700),
    range(700,800),
    range(800,900),
    range(900,1000),
]
flatten_map_extend

34.196415901184082

flatten_sum

61.231782913208008

flatten_reduce_add

65.624041080474854

flatten_custfunc

75.283849000930786

flatten_lc

96.376346826553345

flatten_chain

126.78751182556152

flatten_for_loop 213.21602821350098

Finally, lists of strings:

[
    ["abcde"] * 10,
    ["abcde"] * 10,
    ["abcde"] * 10,
    ["abcde"] * 10,
    ["abcde"] * 10,
    ["abcde"] * 10,
    ["abcde"] * 10,
    ["abcde"] * 10,
    ["abcde"] * 10,
    ["abcde"] * 10,
]
flatten_map_extend

11.517388820648193

flatten_reduce_add

15.214270114898682

flatten_sum

15.836833000183105

flatten_lc

20.376528024673462

flatten_chain

22.739732027053833

flatten_custfunc

23.072113990783691

flatten_for_loop 33.499195098876953

I’m running these tests on a 2.5 GHz Intel Core 2 Duo machine, 4GB RAM, OS X, Python 2.6.

I don’t know how informative they were, but using map and list.extend seems like the most performant option. Which is unfortunate from my point of view, since I don’t like the way that looks, but since I should remember to turn it into a function whenever I need it, I should be able to deal with that.

It must be said that neither performance nor code readability (for programmers other than myself) matter in the project I was working on when I decided to explore this question, but I do want to know the right way to do it; it appears that I found a reasonable answer, although I’m definitely going to be tempted to use sum instead some of the time…

[1] Of course, I mean a succinct, specific, and complete explanation, rather than e.g. “it should solve my problem”, as going down that track would lead one to think that all programs should be one line long. They shouldn’t. But the line between concision and obscurity is subjective, and I’m trying to fine-tune my subjective take on that particular question of style.

Leave a Reply