DSA
Introduction
Data structures provide an organised format for data storage, management and access. Most of the following data structures are used in four basic ways:
- Access - provided a location, return the item at that location, or alternatively, return all the items
- Search - look for a specific item or value, and it’s location
- Insert - add an item into the collection i.e. at the start, end or in between
- Delete - remove an item from the collection
Transformations like sorting and filtering can also be performed. The following are some common and basic data structures:
Arrays
General Arrays
Arrays are collections of items, typically stored in contiguous memory:
| Index | 0 | 1 | 2 | 3 | 4 | 5 |
|-------|---|---|---|---|---|---| | Value | 1 | 2 | 3 | 4 | 5 | 6 |
Efficiencies:
- Access: O(1)
- Search: O(n)
- Insertion: O(n)
- Deletion: O(n)
Unordered Array
Elements are stored in no particular order.
| Index | 0 | 1 | 2 | 3 | 4 | 5 |
|-------|---|---|---|---|---|---| | Value | 7 | 2 | 9 | 4 | 5 | 1 |
- Insertion: O(1) (add to the end)
- Deletion: O(n) (need to shift elements)
- Search: O(n) (linear search)
Ordered (Sorted) Array
Elements are stored in a specific order, typically ascending or descending.
| Index | 0 | 1 | 2 | 3 | 4 | 5 |
|-------|---|---|---|---|---|---| | Value | 1 | 2 | 4 | 5 | 7 | 9 |
- Insertion: O(n) (find position and shift elements)
- Deletion: O(n) (shift elements after deletion)
- Search: O(\log n) (binary search)
Circular Array
The last element is conceptually followed by the first element, forming a circle.
┌───┐
┌───┤ 5 ├───┐
│ 4 │ │ 1 │
├───┤ ├───┤
│ 3 │ │ 2 │ └───┴───┴───┘
- Useful for queue implementations
- Efficient use of fixed-size array
- Operations wrap around the end of the array
Jagged Array (Array of Arrays)
An array where each element is an array, possibly of different lengths.
| Index | Array |
|-------|----------------------|[1, 2, 3] |
| 0 | [4, 5] |
| 1 | [6, 7, 8, 9] |
| 2 | [10] | | 3 |
- Flexible structure for 2D data with varying row lengths
- Efficient for sparse matrices
- Access: O(1) for both dimensions
Sparse Array
An array where most elements have the same value (usually zero).
Logical view:
| 0 | 0 | 0 | 3 | 0 | 0 | 1 | 0 | 0 | 2 | 0 |
Actual storage (index-value pairs):
| 3:3 | 6:1 | 9:2 |
- Space-efficient for sparse data
- Slower access time compared to standard arrays
Sets
Array-based Sets
Array based sets are collections of items, stored in contiguous memory, that do not allow duplicate item to be inserted, and may or may not be sorted:
Unsorted:
| Index | 0 | 1 | 2 | 3 | 4 |
|-------|---|---|---|---|---| | Value | 5 | 2 | 8 | 1 | 3 |
Sorted:
| Index | 0 | 1 | 2 | 3 | 4 |
|-------|---|---|---|---|---| | Value | 1 | 2 | 3 | 5 | 8 |
Every insert requires a search first to determine the value to be inserted does not already exist.
Access: O(1)
Insertion:
- Unsorted: O(1) (append to end)
- Sorted: O(n) (find correct position and shift elements, to maintain order)
Deletion: O(n) (need to shift elements to fill the gap)
Search:
- Unsorted: O(n)
- Sorted: O(\log n) (using binary search)
Hash Table-based Sets
Hash table-based sets use a hash function (Section 1.11) to map elements to indices in an array.
Hash function: h(x) = x \mod 7
i.e. every input is divided by 7 and the remainder is the output of the hash function (output between 0 and 6)
Index | Bucket |
---|---|
0 | 7 |
1 | 1, 8 |
2 | 2 |
3 | 3 |
4 | |
5 | 5 |
6 |
- Collision - in this example, there is a collision at index 1, with values 1 and 8 returning the same output and therefore being assigned to the same bucket.
- Empty bucket - important to monitor (or be aware) of the distribution of data which may indicate the hashing function may need to be altered if there are many empty buckets.
Access: N/A (sets don’t typically support direct access)
Insertion: \theta(1) O(n)
Deletion: \theta(1) O(n)
Search: \theta(1) O(n)
Tree-based Sets
Tree-based sets store elements in a self-balancing binary search tree (BST). A binary search tree is a data structure composed of nodes. Each node has a key, and each key in the left subtree of a node is less than the node’s key, while each key in the right subtree is greater. Standard operations (search, insert, delete) on a BST generally have action time complexity proportional to the height of the tree.
The height of a tree is the length of the longest path from the root node to a leaf node. In a balanced tree, the height is O(\log n). If the tree becomes unbalanced (e.g., all nodes added to one side), the height can become O(n).
A self-balancing BST is a BST that automatically maintains its height after insertions or deletions, which keeps the height as small as possible, ensuring search/insertion/deletion can be performed in O(\log n) time complexity, where n is the number of nodes in the tree. After every insertion or deletion, the tree checks if it has become unbalanced. If it has, the tree undergoes a series of rotations or restructuring operations to restore balance. The two most common implementations are Red-Black Trees and AVL Trees.
4
/ \
2 6
/ \ / \ 1 3 5 7
General characteristics: - Access: N/A (sets don’t typically support direct access) - Insertion: O(\log n) - Deletion: O(\log n) - Search: O(\log n)
Red-Black Trees
Red-Black trees are BSTs with one extra bit of storage per node: its color, which can be either red or black. By constraining the way nodes can be colored, Red-Black trees ensure that no path from the root to a leaf is more than twice as long as any other path.
Properties: 1. Every node is either red or black. 2. The root is black. 3. Every leaf (NIL) is black. 4. If a node is red, then both its children are black i.e. no two reds in a row 5. For each node, all simple paths from the node to descendant leaves contain the same number of black nodes.
Visualization:
4B
/ \
2R 6R
/ \ / \ 1B 3B 5B 7B
(B = Black, R = Red)
Red-Black trees provide faster insertion and removal operations than AVL trees as they require fewer rotations on average. However, AVL trees provide faster lookups as they are more strictly balanced.
AVL Trees
AVL trees (named after Adelson-Velsky and Landis) are self-balancing binary search trees. In an AVL tree, the heights of the two child subtrees of any node differ by at most one. If, after an insertion or deletion, the tree becomes unbalanced (i.e., the height difference becomes greater than one), rotations are performed to restore balance.
Properties: 1. The heights of the left and right subtrees of every node differ by at most one. 2. Every subtree is an AVL tree.
Balance Factor = Height(Left Subtree) - Height(Right Subtree)
For any node in an AVL tree, the balance factor must be -1, 0, or 1.
Visualization:
4 (0)
/ \
2 (-1) 6 (0)
/ \ / \ 1 3 5 7
(Numbers in parentheses represent balance factors)
AVL trees maintain a stricter balance than Red-Black trees, leading to faster lookups but slower insertion and deletion due to more frequent rotations.
B-Trees
B-Trees are generalisations of self-balancing trees that can have more than two children. B-Trees can be used in databases and file systems because they are work well with systems that read and write large blocks of data.
Splay-Trees
A splay tree is a self-balancing BST that performs splaying, a process that moves a given node to the root of the tree by performing a series of tree rotations. Splay trees keep recently accessed elements near the top, making them efficient for specific workload styles.
Summary
- Balance:
- AVL trees are more rigidly balanced
- Red-Black trees may have longer paths (up to 2 times the shortest path)
- Operations:
- Lookups tend to be faster in AVL trees
- Insertions and deletions tend to be faster in Red-Black trees
- Usage:
- AVL trees are preferred for lookup-intensive applications
- Red-Black trees are preferred for insertion/deletion-intensive applications
Both tree types guarantee O(\log n) time complexity for basic operations, making them efficient choices for set and map implementations.
4. Bit Vector Sets
Bit vector sets use a bit array to represent the presence or absence of elements.
Visualization:
Universe: {0, 1, 2, 3, 4, 5, 6, 7}
Set: {1, 2, 4, 7}
Bit Vector:
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 |
- Access: O(1)
- Insertion: O(1)
- Deletion: O(1)
- Search: O(1)
Sets Summary
- Array-based sets are simple but inefficient for large, dynamic sets.
- Hash table-based sets offer great average-case performance.
- Tree-based sets provide balanced performance and maintain order.
- Bit vector sets are highly efficient but limited in applicability.
Choosing between them depends on the expected set size, frequency of operations, memory constraints, and need for maintaining order.
Linked List
A linked list consists of nodes where each node contains a data field and a reference to the next node in the sequence.
[3] -> [7] -> [1] -> [9] -> [4] -> null
Access: O(n) Search: O(n) Insertion: O(1) Deletion: O(1)
Stack
A stack follows the Last In First Out (LIFO) principle.
| 4 | <- Top
| 9 |
| 1 |
| 7 | | 3 |
Push: O(1) Pop: O(1) Peek: O(1)
Queue
A queue follows the First In First Out (FIFO) principle.
[3] -> [7] -> [1] -> [9] -> [4] <- Rear Front ->
Enqueue: O(1) Dequeue: O(1) Front: O(1)
Tree
A tree is a hierarchical structure with a root node and child nodes.
1
/ \
2 3
/ \ \ 4 5 6
Access: O(\log n) (for balanced trees) Search: O(\log n) (for balanced trees) Insertion: O(\log n) (for balanced trees) Deletion: O(\log n) (for balanced trees)
Graph
A graph consists of vertices connected by edges.
A --- B
| / |
| / |
| / |
C --- D
Traversal: O(V + E) where V is the number of vertices and E is the number of edges
Hash Table
A hash table stores key-value pairs and uses a hash function to compute an index into an array of buckets.
Key | Hash | Index | Value |
---|---|---|---|
“apple” | hash(“apple”) | 2 | 5 |
“banana” | hash(“banana”) | 4 | 8 |
“cherry” | hash(“cherry”) | 1 | 3 |
Access: \theta(1), O(n)
Search: \theta(1), O(n)
Insertion: \theta(1), O(n)
Deletion: \theta(1), O(n)
Hash Functions
Division Hashing
h(x) = x \mod n
Take the modulus of some input value by some number, n, where n is usually a prime number (Link).
Multiplication Hashing
The multiplication method involves multiplying the input value x by a constant A (where 0 < A < 1), taking the fractional part of the result, and then multiplying by the size of the hash table m:
h(x) = \left\lfloor m \times (x \times A \mod 1) \right\rfloor
Let’s choose m = 10 and A = \frac{\sqrt{5} - 1}{2} \approx 0.6180339887
For x = 123:
- Multiply: 123 \times 0.6180339887 = 76.01818
- Take fractional part: 0.01818
- Multiply by m: 10 \times 0.01818 = 0.1818
- Floor the result: \lfloor 0.1818 \rfloor = 0
So, h(123) = 0
3. Mid-Square Method
The mid-square method involves squaring the key and taking the middle digits:
- Square the key
- Extract a fixed number of digits from the middle
- Use these digits as the hash value
For key x = 3791 and a 4-digit hash:
- Square: 3791^2 = 14371681
- Extract middle 4 digits: 3716
So, h(3791) = 3716
4. Folding Method
The folding method involves dividing the key into equal-sized parts (except possibly the last part) and combining these parts using addition or XOR:
- Divide the key into equal-sized parts
- Sum or XOR these parts
- Take the result modulo the table size
For key x = 12345678 and 2-digit parts:
- Divide: 12|34|56|78
- Sum: 12 + 34 + 56 + 78 = 180
- If table size is 100: 180 \mod 100 = 80
So, h(12345678) = 80
References
[^1] J. Wengrow, 2017, A Common-Sense Guide to Data Structures and Algorithms