Algorithms and OOD (CSC 207 2014F) : EBoards

# CSC207.01 2014F, Class 45: Implementing Dictionaries with Hash Tables

Overview

• Preliminaries.
• Upcoming Work.
• Extra Credit.
• Questions.
• An introduction to hash tables.
• Hash functions.
• An exercise.
• Handling collisions.
• Hashing in Java.
• Handling removal.

## Preliminaries

• New lab partners!
• PSA!

### Upcoming Work

• Homework: Phase 2 of project (Class design, Algorithm design, I/O subsystem) due Tuesday night. (Yes, a few of you need extra time. If that's the case, let me know.)
• Reading for Monday: Hash Tables

### Extra Credit

• CS Table today: One year after Healthcare.gov
• Learning from Alums, next Tuesday: Alex Cohn '11 (in person)
• CS Extras next Tuesday: Summer Opportunities in CS

#### Peer Support

• Ajuna's Radio show Mondays at 8pm. Listen to African music.
• Charlie's Friday Night "War in Animated Film" ExCo. (Not this week)
• Copehagen, 7:30 p.m. on the 21st, 22nd, afternoon 23rd

#### Miscellaneous

• VP Student Affairs Candidate Open Sessions
• Today, 4:15, JRC 209

## An introduction to hash tables

• We've been implementing the dictionary ADT
• Something that stores key/value pairs. You can access values by key.
• Important operations: put(key,value), lookup(key)
• Might also be useful to have remove(key)
• Use cases: Word Dictionaries, Translation Dictionaries, Information about students to use in making snarky comments
• Implementations
• Associative array (or pair of arrays)
• Asymptotic O(n) get or O(logn) get, depending on whether you store the values in arbitrary order or in sorted order
• Asymptotic O(n) set if it's not full and arbitrary order b/c we have to see if the key is already there. O(n) if it's full b/c we have to build a bigger array.
• Asymptotic O(n) set if it's not full and in order because we have to move lots of values
• Linked list of key/value pairs (association list)
• O(n) get and set because you may have to look at everything
• Skip lists
• O(logn) get because at each level you look at a few elements and throw away half. But analysis is hard, and relies on probabilities. expected O(logn)
• Binary search trees
• Add is O(logn) and get is O(logn) in a balanced search tree because we cut it in half each time.
• Add is O(n) and get is O(n) in a mostly unbalanced search tree
• Add is O(depth) and get is O(depth)
• Can we do better? Can we do O(1) set and get? (Or at least expected O(1) set and get?)
• What techniques do we know for implementing data structures?
• Arrays
• Fill left-to-right
• Sometimes you leave gaps, or rearrange
• Linear, forward-pointing
• More sophisticated (e.g., trees, doubly-linked lists)
• Arrays seem like a better idea
• Great idea in hash tables: Convert our keys to integers in a restricted range [0 ... capacity) in constant time
• Criterion: Two different keys are unlikely to have the same value
• Set: `stuff[convert(key)] = new KVPair(key,value);`
• Problem to address later: What happens if two keys have the same value?
• Get: `return stuff[convert(key)];`

## Hash functions

• Convert a key to a positive integer
• Set: `stuff[convert(key) % stuff.length] = new KVPair(key,value);`
• Get: `return stuff[convert(key) % stuff.length];`
• If we have lots of extra room (e.g., 2*entries) and a good hash function, we're unlikely to have too many collisions
• Writing good hash functions is hard

## An exercise

```A: 1   F: 6   K: 11  P: 16  U: 21  Z: 26
B: 2   G: 7   L: 12  Q: 17  V: 22
C: 3   H: 8   M: 13  R: 18  W: 23
D: 4   I: 9   N: 14  S: 19  X: 24
E: 5   J: 10  O: 15  T: 20  Y: 25
```

Add the first three numbers of your first name and then mod by 40 (about twice the number of students in the class).

• Sample hash table removed for student privacy issues.
• Observation: The first eight or so names distributed well, and then we started seeing collisions.
• But we never got more than three in any cell (20 students plus mentor plus faculty member).

## Handling collisions

• Two approaches:
• Put lots of values in the same array cell.
• Look for another cell.