# Test if lists share any items in python

10.5k Views

I want to check if any of the items in one list are present in another list. I can do it simply with the code below, but I suspect there might be a library function to do this. If not, is there a more pythonic method of achieving the same result.

``````In [78]: a = [1, 2, 3, 4, 5]

In [79]: b = [8, 7, 6]

In [80]: c = [8, 7, 6, 5]

In [81]: def lists_overlap(a, b):
....:     for i in a:
....:         if i in b:
....:             return True
....:     return False
....:

In [82]: lists_overlap(a, b)
Out[82]: False

In [83]: lists_overlap(a, c)
Out[83]: True

In [84]: def lists_overlap2(a, b):
....:     return len(set(a).intersection(set(b))) > 0
....:
``````
• 1
• Note that you cannot distinct `True` from `1` and `False` from `0`. `not set([1]).isdisjoint([True])` gets `True`, same with other solutions.
• The only optimizations I can think of is dropping `len(...) > 0` because `bool(set([]))` yields False. And of course if you kept your lists as sets to begin with you'd save set creation overhead.
• 313

Short answer: use `not set(a).isdisjoint(b)`, it's generally the fastest.

There are four common ways to test if two lists `a` and `b` share any items. The first option is to convert both to sets and check their intersection, as such:

``````bool(set(a) & set(b))
``````

Because sets are stored using a hash table in Python, searching them is `O(1)` (see here for more information about complexity of operators in Python). Theoretically, this is `O(n+m)` on average for `n` and `m` objects in lists `a` and `b`. But 1) it must first create sets out of the lists, which can take a non-negligible amount of time, and 2) it supposes that hashing collisions are sparse among your data.

The second way to do it is using a generator expression performing iteration on the lists, such as:

``````any(i in a for i in b)
``````

This allows to search in-place, so no new memory is allocated for intermediary variables. It also bails out on the first find. But the `in` operator is always `O(n)` on lists (see here).

Another proposed option is an hybridto iterate through one of the list, convert the other one in a set and test for membership on this set, like so:

``````a = set(a); any(i in a for i in b)
``````

A fourth approach is to take advantage of the `isdisjoint()` method of the (frozen)sets (see here), for example:

``````not set(a).isdisjoint(b)
``````

If the elements you search are near the beginning of an array (e.g. it is sorted), the generator expression is favored, as the sets intersection method have to allocate new memory for the intermediary variables:

``````from timeit import timeit
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=list(range(1000))", number=100000)
26.077727576019242
>>> timeit('any(i in a for i in b)', setup="a=list(range(1000));b=list(range(1000))", number=100000)
0.16220548999262974
``````

Here's a graph of the execution time for this example in function of list size:

Note that both axes are logarithmic. This represents the best case for the generator expression. As can be seen, the `isdisjoint()` method is better for very small list sizes, whereas the generator expression is better for larger list sizes.

On the other hand, as the search begins with the beginning for the hybrid and generator expression, if the shared element are systematically at the end of the array (or both lists does not share any values), the disjoint and set intersection approaches are then way faster than the generator expression and the hybrid approach.

``````>>> timeit('any(i in a for i in b)', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", number=1000))
13.739536046981812
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", number=1000))
0.08102107048034668
``````

It is interesting to note that the generator expression is way slower for bigger list sizes. This is only for 1000 repetitions, instead of the 100000 for the previous figure. This setup also approximates well when when no elements are shared, and is the best case for the disjoint and set intersection approaches.

Here are two analysis using random numbers (instead of rigging the setup to favor one technique or another):

High chance of sharing: elements are randomly taken from `[1, 2*len(a)]`. Low chance of sharing: elements are randomly taken from `[1, 1000*len(a)]`.

Up to now, this analysis supposed both lists are of the same size. In case of two lists of different sizes, for example `a` is much smaller, `isdisjoint()` is always faster:

Make sure that the `a` list is the smaller, otherwise the performance decreases. In this experiment, the `a` list size was set constant to `5`.

In summary:

• If the lists are very small (< 10 elements), `not set(a).isdisjoint(b)` is always the fastest.
• If the elements in the lists are sorted or have a regular structure that you can take advantage of, the generator expression `any(i in a for i in b)` is the fastest on large list sizes;
• Test the set intersection with `not set(a).isdisjoint(b)`, which is always faster than `bool(set(a) & set(b))`.
• The hybrid "iterate through list, test on set" `a = set(a); any(i in a for i in b)` is generally slower than other methods.
• The generator expression and the hybrid are much slower than the two other approaches when it comes to lists without sharing elements.

In most cases, using the `isdisjoint()` method is the best approach as the generator expression will take much longer to execute, as it is very inefficient when no elements are shared.

• 1
• That's some useful data there, shows that big-O analysis isn't the be all and end all of reasoning about running time.
• what about the worst case scenario? `any` exits at the first non-False value. Using a list where the only matching value is at the end, we get this: `timeit('any(i in a for i in b)', setup="a=list(range(1000));b=[x+998 for x in range(999,-0,-1)]", number=1000) 13.739536046981812` `timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=[x+998 for x in range(999,-0,-1)]", number=1000) 0.08102107048034668` ... and it's with 1000 iterations only.
• 2
• Thanks @RobM for the information. I've updated my answer to reflect this and to take into account the other techniques proposed in this thread.
• It should be `not set(a).isdisjoint(b)` to test if two lists share a member. `set(a).isdisjoint(b)` returns `True` if the two lists do not share a member. The answer should be edited?
• Thanks for the heads up, @Guillochon, it's fixed.
``````def lists_overlap3(a, b):
return bool(set(a) & set(b))
``````

Note: the above assumes that you want a boolean as the answer. If all you need is an expression to use in an `if` statement, just use `if set(a) & set(b):`

• This is worst-case O(n + m). However, the down side is that it creates a new set, and doesn't bail out when a common element is found early.
• 1
• I curious as to why this is `O(n + m)`. My guess would be that sets are implemented using hash-tables, and thus the `in` operator can work in `O(1)` time (except in degenerate cases). Is this correct? If so, given that hash tables have worst case lookup performance of `O(n)`, does this mean that in the unlike worse case it will have `O(n * m)` performance?
• 2
• @fmark: Theoretically, you are right. Practically, nobody cares; read the comments in Objects/dictobject.c in the CPython source (sets are just dicts with only keys, no values) and see if you can generate a list of keys that will cause O(n) lookup performance.
• 1
• Okay, thanks for clarifying, I was wondering if there was some magic going on :) . While I agree that practically I don't need to care, it is trivial to generate a list of keys that will cause `O(n)` lookup performance ;), see pastebin.com/Kn3kAW7u Just for lafs.
``````def lists_overlap(a, b):
sb = set(b)
return any(el in sb for el in a)
``````

This is asymptotically optimal (worst case O(n + m)), and might be better than the intersection approach due to `any`'s short-circuiting.

E.g.:

``````lists_overlap([3,4,5], [1,2,3])
``````

will return True as soon as it gets to `3 in sb`

EDIT: Another variation (with thanks to Dave Kirby):

``````def lists_overlap(a, b):
sb = set(b)
return any(itertools.imap(sb.__contains__, a))
``````

This relies on `imap`'s iterator, which is implemented in C, rather than a generator comprehension. It also uses `sb.__contains__` as the mapping function. I don't know how much performance difference this makes. It will still short-circuit.

• 2
• Loops in intersection approach are all in C code; there's one loop in your approach that includes Python code. The big unknown is whether an empty intersection is likely or unlikely.
• 1
• You could also use `any(itertools.imap(sb.__contains__, a))` which should be faster still since it avoids using a lambda function.
• 1
• Thanks, @Dave. :) I agree removing the lambda is a win.

You could also use `any` with list comprehension:

``````any([item in a for item in b])
``````
• 2
• You could, but the time is O(n * m) whereas the time for the set intersection approach is O(n + m). You could also do it WITHOUT list comprehension (lose the `[]`) and it would run faster and use less memory, but the time would still be O(n * m).
• 2
• While your big O analysis is true, I suspect that for small values of n and m the time it takes to build the underlying hashtables will come into play. Big O ignores the time it takes to compute the hashes.
• Building a "hashtable" is amortised O(n).
• 2
• I get that but the constant you're throwing away is pretty big. It doesn't matter for large values of n, but it does for small ones.

In python 2.6 or later you can do:

``````return not frozenset(a).isdisjoint(frozenset(b))
``````
• 2
• It seems like one doesn't have to supply a set or frozenset as the first argument. I tried with a string and it worked (i.e.: any iterable will do).

You can use the any built in function /w a generator expression:

``````def list_overlap(a,b):
return any(i for i in a if i in b)
``````

As John and Lie have pointed out this gives incorrect results when for every i shared by the two lists bool(i) == False. It should be:

``````return any(i in b for i in a)
``````
• 2
• Amplifying Lie Ryan's comment: will give wrong result for any item x that's in the intersection where `bool(x)` is False. In Lie Ryan's example, x is 0. Only fix is `any(True for i in a if i in b)` which is better written as the already seen `any(i in b for i in a)`.
• 1
• Correction: will give wrong result when all items `x` in the intersection are such that `bool(x)` is `False`.

This question is pretty old, but I noticed that while people were arguing sets vs. lists, that no one thought of using them together. Following Soravux's example,

Worst case for lists:

``````>>> timeit('bool(set(a) & set(b))',  setup="a=list(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
100.91506409645081
>>> timeit('any(i in a for i in b)', setup="a=list(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
19.746716022491455
>>> timeit('any(i in a for i in b)', setup="a= set(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
0.092626094818115234
``````

And the best case for lists:

``````>>> timeit('bool(set(a) & set(b))',  setup="a=list(range(10000)); b=list(range(10000))", number=100000)
154.69790101051331
>>> timeit('any(i in a for i in b)', setup="a=list(range(10000)); b=list(range(10000))", number=100000)
0.082653045654296875
>>> timeit('any(i in a for i in b)', setup="a= set(range(10000)); b=list(range(10000))", number=100000)
0.08434605598449707
``````

So even faster than iterating through two lists is iterating though a list to see if it's in a set, which makes sense since checking if a number is in a set takes constant time while checking by iterating through a list takes time proportional to the length of the list.

Thus, my conclusion is that iterate through a list, and check if it's in a set.

• Using the `isdisjoint()` method on a (frozen)set as indicated by @Toughy is even better: `timeit('any(i in a for i in b)', setup="a= set(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)` => 0.00913715362548828

if you don't care what the overlapping element might be, you can simply check the `len` of the combined list vs. the lists combined as a set. If there are overlapping elements, the set will be shorter:

`len(set(a+b+c))==len(a+b+c)` returns True, if there is no overlap.

• If the first value overlaps it will still convert the entire list to a set, no matter how large.

I'll throw another one in with a functional programming style:

``````any(map(lambda x: x in a, b))
``````

Explanation:

``````map(lambda x: x in a, b)
``````

returns a list of booleans where elements of `b` are found in `a`. That list is then passed to `any`, which simply returns `True` if any elements are `True`.