Is Unique Problem
Is Unique: Implement an algorithm to determine if a string has all unique characters.
There are a few ways to solve this problem.
1. If we assume an ASCII string then we know the alphabet has 128 characters (or 256 for extended ASCII). In this case we can use a buffer of length 128 to indicate whether or not the character at index i
is in the string. The second time we see this character we can immediatly return false.
1
2
3
4
5
6
7
8
9
10
11
def is_unique(s1):
if len(s1) > 128: # string cannot be unique if its length is greater than the total alphabet size (128 for ASCII string)
return False
lst = [None] * 128
for char in s1:
if lst[ord(char)]:
return False
lst[ord(char)] = 1
return True
The time complexity for this is O(n)
where n is the length of the string. The space complexity is O(1)
because we use a fixed size buffer.
2. We can take the original string and remove all duplicates (if any). If the length of the set after removing duplicates is equal to the length of the original string, then the string must be unique.
1
2
def is_unique(s1):
return len(set(s1)) == len(s1)
The time complexity for this is O(n + n + n)
which is just O(n)
. The len()
function is O(n)
because it scans through the list and set(s1)
also takes O(n)
because it scans through every character in the string and adds it to a hash table. Adding to a hash table is usually O(1)
(unless we have collisions, in which case it is O(n)
).
3. If we sort the input string then all characters in the alphabet that are also in the string will be placed next to each other. Based on this fact, we can traverse the sorted string and check neighboring characters. If any two neighboring characters are different then we know the string is not unique.
1
2
3
4
5
6
def is_unique(s1):
s1_sorted = sorted(s1) # returns new list with sorted string
for i in range(len(s1) - 1):
if s1_sorted[i] == s1_sorted[i+1]:
return False
return True
The time complexity in this case is O(nlogn + n)
which is just O(nlogn)
(where n
is the length of s1
). The O(nlogn)
comes from the sorted()
function call which uses Timsort. The O(n)
comes from the fact that we iterate through s1
. Space complexity is O(n)
because the length of the returned (sorted) list s1_sorted
depends on the size of s1
which we call n
.
4. And, finally, my favorite solution: we assume the string only contains characters ‘a’ through ‘z’. Note:
- Python integer is 4 bytes = 32 bits
- English alphabet size is 26 characters
- Thus, we can fit each character into the integer representaion of 32 bits. Namely, each bit position 0-25 indicates a character. If the value at a bit position
i
is 0 then we say the character is not present, otherwise, if the value is 1, then we say it is present in the input string.
We can get the bit position of a character by using ord()
which returns the Unicode code for a character. Since ord('z') - ord('a')
equals 25 we know we have more than enough space using an integer which is 32 bits.
Our bit vector is an integer called checker
which we will set to 0: checker = 0
.
So, we get the bit position of some character (say ‘j’) by doing ord('j') - ord('a')
which is 9
. Next, we need to switch bit 9 in our temporary bit vector from 0
to 1
. Do this with bit shifting 1 nine times to the left: 1 << 9
.
To check if bit 9 was already 1 in our bit vector checker
we simply perform a bitwise AND with the temporary bit vector. If the result is > 0 then we know that bit position 9 was already switched to a 1 because:
10010 & 00010 = 00010
which is greater than 0.
If this is true, then return False
. Else switch bit position 9 in checker
to 1.
Do this with a bitwise OR (denoted as ‘|’ in Python):
10000 OR 00010 = 10010
Repeat this process for every character in the input string.
1
2
3
4
5
6
7
8
def is_unique(s1):
checker = 0
for char in s1:
val = ord(char) - ord('a')
if (checker & (1 << val) > 0):
return False
checker = checker | (1 << val)
return True
See another explanation of approach #4 here.