-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathslide13-22notes.Rmd
2489 lines (1864 loc) · 81.4 KB
/
slide13-22notes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Slide Notes: 13-22"
author: "David J. Hunter"
output:
html_document:
toc: true
toc_float:
collapsed: false
smooth_scroll: true
df_print: paged
theme: spacelab
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
This page contains the content of all the slides for the material that we covered in Chapters 4-6 of [Algorithms](http://jeffe.cs.illinois.edu/teaching/algorithms/book/Algorithms-JeffE.pdf), by Jeff Erickson. If you notice any errors, please [create a new issue](https://github.com/djhunter/algorithms/issues).
# Greedy Algorithms
## The Greedy Choice
- Problem: Find the optimal (e.g., cheapest/best) long-term sequence of choices.
- A *greedy algorithm* makes the cheapest/best short-term choice at each stage of the process.
We saw some examples where the greedy choice *doesn't* work:
- Change-making in Nadirian currency.
- Candy Swap Saga
## Greedy vs. Dynamic Programming
- Both solve *optimization* problems.
- Problems have *recursive substructure.*
- Make a choice, then optimize the remainder.
- There is one **key difference:**
- In the dynamic programming solutions, we had to try all the different choices at each stage.
- Churro cutting: Try all the different cuts, then optimize the rest.
- Candy swap: Try swapping and not swapping, then optimize the rest.
- A greedy algorithm just makes the optimal *short-term* choice.
- Cut off the most expensive piece.
- Never swap for a different candy.
## Isn't this easy?
- Writing greedy algorithms is easy. (And they're usually fast.)
- The hard part is **proving that they actually give you an optimal solution.**
# Scheduling
## Maximize the number of events
- You have a list of potential events to schedule (e.g., Fringe Festival skits.)
- Each event has a starting time $S[i]$ and a finish time $F[i]$.
- You only have one stage.
- The organizer of the festival only cares about *maximizing the number* of events.
- Which events should you choose?
## Table Groups
```{r, echo=FALSE}
library(knitr)
roster <- c("Ethan", "Talia", "Drake", "Jack", "Andrew", "Blake", "Jordan", "Graham", "Kevin", "Logan", "Claire", "Bri", "Trevor", "James", "Kristen", "Levi", "Grace", "John", "Isaac", "Josiah", "Nathan")
set.seed(2122021)
n <- length(roster)
ngps <- 7
maxingp <- ceiling(n/ngps)
# just make random groups
groups <- matrix(c(roster[sample(n)],
rep("",(maxingp - (n %% maxingp)) %% maxingp)),
ncol=maxingp, byrow=FALSE)
rownames(groups) <- paste0("Table #", 1:nrow(groups))
kable(t(groups))
```
## How many can you schedule?
```{r, echo=FALSE}
S <- c(5,1,6,1,4,9 ,2,4)
F <- c(7,5,9,6,7,10,4,6)
rbind(S,F)
```
Find the maximum number of events for this input. (Draw a picture?)
# Greedy Scheduling Algorithm
## First: Recursive Structure?
```{javascript, eval=FALSE}
// There are n events, numbered 1..n
// S[1..n] is a list of starting times
// F[1..n] is a corresponding list of finishing times
MaxNumEvents(X) // the maximum number of events in X ⊆ {1..n} that can be scheduled
if X = ∅ then
return 0
Choose any element k in X
B <- the set of events in X ending BEFORE S[k]
A <- the set of events in X starting AFTER F[k]
return the maximum of:
MaxNumEvents(X \ {k}) // do not use event number k
MaxNumEvents(A) + 1 + MaxNumEvents(B) // use event k
```
It works, by induction.
## Greedy Scheduler
```{javascript, eval=FALSE}
// There are n events, numbered 1..n
// S[1..n] is a list of starting times
// F[1..n] is a corresponding list of finishing times
MaxNumEvents(X) // the maximum number of events in X ⊆ {1..n} that can be scheduled
if X = ∅ then
return 0
Let k in X be the event that ends first // GREEDY CHOICE
A <- the set of events in X starting AFTER F[k]
return MaxNumEvents(A) + 1
```
- Summary: Choose the event $k$ that *ends first*, discard events that conflict with $k$, and recurse.
- How do we know that this works?
# Greedy Proof of Correctness
## Greedy choice property: Idea of proof
To prove that a greedy algorithm works, you need to show that it has the *greedy choice property*.
- Suppose you have a non-greedy (optimal) solution.
- Show that the greedy solution is just as good.
- Find the first place where they differ, and show that you could have used the greedy choice without making the solution less optimal.
## Group Exercise: Find another solution
Suppose the sequence of events continues, but we don't have all the values:
```
S <- [5, 1, 6, 1, 4, 9, 2, 4 ... ]
F <- [7, 5, 9, 6, 7,10, 4, 6, ... ]
```
- Your greedy solution contains 17 events, starting with: 7, 8, 3, 6, ...
- Buford has found another 17-event solution: 7, 8, 13, 11, ...
> Claim: If you replace the 13 in Buford's solution with a 3, you get yet another 17-event solution.
*Write a couple sentences* to justify this claim.
## Greedy Scheduler Correctness
```{javascript, eval=FALSE}
// There are n events, numbered 1..n
// S[1..n] is a list of starting times
// F[1..n] is a corresponding list of finishing times
MaxNumEvents(X) // the maximum number of events in X ⊆ {1..n} that can be scheduled
if X = ∅ then
return 0
Let k in X be the event that ends first // GREEDY CHOICE
A <- the set of events in X starting AFTER F[k]
return MaxNumEvents(A) + 1
```
- Suppose that `MaxNumEvents({1..n})` returns $m$.
- Let $g_1, g_2, \ldots, g_m$ be the events chosen by this greedy algorithm.
- Suppose that $g_1, g_2, \ldots, g_{j-1}, \mathbf{c_j}, c_{j+1}, \ldots, c_m$ is another sequence of compatible events.
- Same length $m$, but different choice for event number $j$.
Now apply the same reasoning as in the Buford example.
## Proof of correctness
**Proof.** Suppose that `MaxNumEvents({1..n})` returns $m$. Let $g_1, g_2, \ldots, g_m$ be the events chosen by this greedy algorithm. Suppose that
$$g_1, g_2, \ldots, g_{j-1}, \mathbf{c_j}, c_{j+1}, \ldots, c_{m'}$$
is another sequence of compatible events, with $m' \geq m$.
Since event $g_j$ is part of the first solution, it must start after event $g_{j-1}$. Since event $g_j$ was chosen by the greedy algorithm, *it must end before* all the events in $X \setminus \{g_1, g_2, \ldots, g_{j-1}\}$. In particular, it must end before event $c_{j+1}$. Therefore, we can replace $c_j$ with $g_j$ and obtain another solution:
$$g_1, g_2, \ldots, g_{j-1}, \mathbf{g_j}, c_{j+1}, \ldots, c_{m'}$$
Continuing in this way, all of the $c_i$'s can be replaced with $g_i$'s. Therefore $m = m'$, and the greedy solution is optimal.
# Refining the algorithm
## Greedy Scheduler Time
```{javascript, eval=FALSE}
// There are n events, numbered 1..n
// S[1..n] is a list of starting times
// F[1..n] is a corresponding list of finishing times
MaxNumEvents(X) // the maximum number of events in X ⊆ {1..n} that can be scheduled
if X = ∅ then
return 0
Let k in X be the event that ends first // GREEDY CHOICE
A <- the set of events in X starting AFTER F[k]
return MaxNumEvents(A) + 1
```
- Non-recursive work: $O(n)$
- Making the greedy choice requires scanning $X$, which is $O(n)$.
- Forming $A$ is also $O(n)$
- The function uses *tail recursion*, so it could be written as a loop.
- The number of remaining events decreases by at least 1 each time, so this "loop" runs $O(n)$ times. **Total time:** $O(n^2)$
## Improve using an efficient sort
```{javascript, eval=FALSE}
MaxNumEvents(S[1..n], F[1 .. n]):
sort F and permute S to match // O(n log(n))
count <- 1
X[count] <- 1 // Because now event 1 has first finish time.
for i <- 2 to n // Scan events in order of finish time.
if S[i] > F [X[count]] // Is event i compatible?
count <- count + 1 // If so, it is the
X[count] <- i // greedy choice.
return count
```
Time:
- $O(n \log n)$ to sort (e.g., use MergeSort)
- After that finishes, there's just a linear time for-loop: $O(n)$.
- **Total time:** $O(n \log n)$
# Greedy Algorithms: Exchange Arguments
# Assignment Comments, Sample Solutions
## Greedy algorithms that don't work
$b)$ Choose the course $x$ that starts first, discard all classes that conflict with $x$, and recurse.
```
S = [1, 2, 3]
F = [9, 3, 4]
```
- Course 1 gets chosen; other two are discarded.
$d)$ Choose the course $x$ with shortest duration, discard all classes that conflict with $x$, and recurse.
```
S = [1, 3, 4]
F = [4, 5, 9]
```
- Course 2 gets chosen; other two are discarded.
## Exchange Arguments: The General Pattern
- Assume that there is an optimal solution that is different from the greedy
solution.
- Find the "first" difference between the two solutions.
- Argue that we can exchange the optimal choice for the greedy choice without making the solution worse (although the exchange might not make it better).
Notice the "scare quotes" around "first".
## Draw a picture first
$c)$ Choose the course $x$ that *starts last*, discard all classes that conflict with $x$, and recurse.
- Compare *another optimal solution* to greedy solution.
- Find the "first" difference.
- *Exchange,* without making it worse.
## Use your own words
Supposing GREEDYSCHEDULE returns $m$, let $g_1, g_2,\ldots,g_m$ be the classes the algorithm chose (in order of being chosen, so in *reverse chronological order*). Suppose also that $g_1,g_2,\ldots, g_{j-1}, c_j, c_{j+1},\ldots, c_{m'}$ is also a compatible list of classes with $m' \geq m$.
Since $g_j$ is part of the first solution, it must end before $g_{j−1}$ and since it was chosen by the algorithm it must start after all the classes excluding $g_1, g_2,\ldots, g_{j-1}$, or specifically it must start after $c_{j+1}$. This means $g_j$ can replace $c_j$, giving another solution: $g_1,g_2,\ldots,g_{j−1},g_j,c_{j+1},\ldots,c_{m'}$.
This can continue, replacing all $c_i$’s with $g_i$’s so that $m=m'$, which proves that the greedy solution is optimal.
## Less confusing? Events in chronological order
Suppose this greedy algorithm returns $m$. Let $g_1, g_2, \ldots, g_m$ be the events chosen by this greedy algorithm.
Suppose that $c_1,c_2, \ldots, c_{j−1},c_j, g_{j+1}, \ldots, g_{m'}$ is another sequence of compatible events, with $m'\geq m$.
Since event $g_j$ is part of the first solution, it must end before $g_{j+1}$. Given the nature of the greedy algorithm, event $g_j$ starts at the same time or after all other events ending before $g_{j+1}$. In particular, it must start after or at the same time as $c_j$. Therefore, we can replace $c_j$ with $g_j$ and obtain another solution $c_1,c_2, \ldots, c_{j−1},g_j,g_{j+1},\ldots, g_{m'}$. Continuing this way, all of the $c_i$'s can be replaced with $g_j$'s. Therefore $m=m'$, and the greedy solution is optimal.
# Process Scheduling
## Scheduling Tasks in Sequence
- Problem: schedule tasks on a single processor, without swapping.
- *minimize* the average completion time of all the tasks.
- Example: tasks $a_1$ and $a_2$ take $t_1=2$ seconds and $t_2=3$ seconds to complete.
- *average completion time:*
- Do $a_1$ before $a_2$: Task $a_1$ would finish at time 2, and $a_2$ would finish at time 5, for an average completion time of $(2+5)/2 = 3.5$.
- Do $a_2$ before $a_1$: The average completion time $= (3+5)/2 = 4$
Given tasks $a_1, a_2, \ldots, a_n$ with running times $t_1, t_2, \ldots t_n$, describe an algorithm that will determine the ordering of these tasks that minimizes the average completion time.
## Table Groups
```{r, echo=FALSE}
library(knitr)
roster <- c("Ethan", "Talia", "Drake", "Jack", "Andrew", "Blake", "Jordan", "Graham", "Kevin", "Logan", "Claire", "Bri", "Trevor", "James", "Kristen", "Levi", "Grace", "John", "Isaac", "Josiah", "Nathan")
set.seed(2152021)
n <- length(roster)
ngps <- 7
maxingp <- ceiling(n/ngps)
# just make random groups
groups <- matrix(c(roster[sample(n)],
rep("",(maxingp - (n %% maxingp)) %% maxingp)),
ncol=maxingp, byrow=FALSE)
rownames(groups) <- paste0("Table #", 1:nrow(groups))
kable(t(groups))
```
## Greedy Process Scheduling
1. Let `T[1..n]` be an array such that $T[i]$ is the time it takes to do task $a_i$. Describe a greedy algorithm that returns a sequence $g_1, g_2, \ldots g_n$ such that executing the tasks in the order $a_{g_1}, a_{g_2}, \ldots a_{g_n}$ has the minimum possible average completion time.
2. Prove your algorithm is correct:
- Suppose that $c_1, c_2, \ldots, c_n$ is another sequencing of these events that has minimal average completion time.
- Let $c_j$ be the first index in this solution that differs from your greedy solution.
- Argue that you can replace $c_j$ with $g_j$, and the solution will still be optimal.
# Binary Character Codes
# Assignment Comments, Sample Solutions
## Interval Stabbing
## Easiest Greedy Choice
Choose the interval with the *smallest right endpoint*, stab at this right endpoint, eliminate the intervals that have been stabbed, and recurse.
(picture)
## Exchange Argument
- Let $g_1, g_2,\ldots,g_m$ be the stabs made by the greedy algorithm (in order from left to right), and suppose $g_1,g_2,\ldots, g_{j-1}, c_j, c_{j+1},\ldots, c_{m'}$ is another set of stabs (in order) that stab all the intervals, with $m' \leq m$.
- $c_j$ must stab the interval $I$ with the leftmost right endpoint, after eliminating those stabbed by $g_1,g_2,\ldots, g_{j-1}$.
- Moving $c_j$ to $g_j$ (the right endpoint of $I$) stabs all of the same intervals, because no other intervals end before interval $I$ does.
- In this way, all of the $c_i$'s can be replaced with $g_i$'s, so $m = m'$.
## Recursive Implementation
```
Algorithm(L[1..n], R[1..n])
if n = 0 //empty list
return 0
else
min = infinity
for i <- 1 to n
if R[i] < min
min = R[i] //finding stabpoint
newL = L[1..n]
newR = R[1..n]
for j <- 1 to n
if min >= L[j] && min <= R[j] //finding if stabpoint overlaps with each brick
newL[] <- newL \ L[j] //takes away that brick from array
newR[] <- newR \ R[j]
return 1 + Algorithm(newL[], newR[])
```
Time: $O(n^2)$ (linear work inside tail recursion)
## Iterative Implementation
```
Stabby(R[1..n], L[1..n]):
counter <- 0
MergeSort R and permute L to match
i <- 1
while i <= n
stab <- R[i]
counter++
while L[i] <= stab \\ then it got stabbed
i++ \\ keep checking and skipping what gets stabbed
return counter
```
Time: $O(n \log n)$, because the loop runs in $O(n)$ time, incrementing $i$ until it exceeds $n$.
**Note:** The second while loop won't remove intervals that have later end times than some unstabbed interval. But these will get removed eventually; at the very latest, at the last stab. Such an interval would never have the leftmost right endpoint, and so would never determine the location of a stab. So it won't affect the greedy choices, or the final value of `counter`.
# Encoding Characters as Binary Strings
## Codes
- A **binary code** is a list of *code words*, which are binary strings.
- Given an *alphabet* of characters, a **binary character code** is a binary code where each character corresponds to a code word.
- e.g., ASCII, UTF-8
Example:
```
character | A | E | I | O | U | Y |
code word | 000 | 001 | 010 | 011 | 100 | 101 |
```
- Given a character string, can *encode* it into binary.
- $\mathtt{YO} \mapsto \mathtt{101011}$
- Given a binary string, can *decode* it to a character string.
- $\mathtt{011100010} \mapsto \mathtt{OUI}$
## Compression
- Goal: Minimize number of bits needed to encode.
- Idea: Use code words of different lengths.
- Problem: How to parse a binary string so you know where the code words start?
Example:
```
character | A | E | I | O | U | Y |
code word | 00 | 01 | 10 | 11 | 100 | 101 |
```
1. Find two different decodings of `10110010`.
## Table Groups
```{r, echo=FALSE}
library(knitr)
roster <- c("Ethan", "Talia", "Drake", "Jack", "Andrew", "Blake", "Jordan", "Graham", "Kevin", "Logan", "Claire", "Bri", "Trevor", "James", "Kristen", "Levi", "Grace", "John", "Isaac", "Josiah", "Nathan")
set.seed(2172021)
n <- length(roster)
ngps <- 7
maxingp <- ceiling(n/ngps)
# just make random groups
groups <- matrix(c(roster[sample(n)],
rep("",(maxingp - (n %% maxingp)) %% maxingp)),
ncol=maxingp, byrow=FALSE)
rownames(groups) <- paste0("Table #", 1:nrow(groups))
kable(t(groups))
```
# Prefix Codes
## Unambiguous encoding
- In a **prefix code**, no code word can start with another code word.
- i.e., no code word can be a *prefix* for some other code word.
Example:
```
character | A | E | I | O | U | Y |
code word | 10 | 111 | 001 | 000 | 110 | 01 |
```
Now there's only one way to decode anything:
$$
\mathtt{101100110} \mapsto \mathtt{AUYA}
$$
## Prefix Codes and Trees
- A binary tree whose *leaves are the characters* determines a prefix code.
- Left edges correspond to 0s. Right edges correspond to 1s.
```
character | A | E | I | O | U | Y |
code word | 10 | 111 | 001 | 000 | 110 | 01 |
#
0/ \1
# #
0/ \1 0/ \1
# Y A #
0/ \1 0/ \1
O I U E
```
# Optimal Prefix Codes
## Optimize the number of bits in a message
Suppose we want to encode the following message:
```{r, echo=FALSE}
mess <- c(rep("E", 80), rep("A", 50), rep("O", 30), rep("I", 30), rep("U", 20), rep("Y", 20))
cat(paste(sample(mess, length(mess)), collapse=""))
```
This message contains 80 Es, 50As, 30 Os, 30Is, 20Us, and 20Ys.
```
character | A | E | I | O | U | Y |
frequency | 50 | 80 | 30 | 30 | 20 | 20 |
```
We want to construct a prefix code that uses the fewest number of bits for the whole message.
## Can you do better?
The message contains 80 Es, 50As, 30 Os, 30Is, 20Us, and 20Ys.
```
character | A | E | I | O | U | Y |
frequency | 50 | 80 | 30 | 30 | 20 | 20 |
```
Encode it using the following prefix code:
```
character | A | E | I | O | U | Y |
code word | 10 | 111 | 001 | 000 | 110 | 01 |
```
This requires `50*2 + 80*3 + 30*3 + 30*3 + 20*3 + 20*2` = `r 50*2 + 80*3 + 30*3 + 30*3 + 20*3 + 20*2` bits.
2. Why?
3. Propose a prefix code that is able to encode the message using fewer bits. How many bits can you save?
# Huffman Encoding
## Huffman's Greedy Algorithm
**Given:** A list of characters `C[1..n]` and their frequencies `F[1..n]` in the message we want to encode.
**Goal:** Construct a binary tree for a prefix code that uses the fewest number of bits to encode the message.
**Algorithm:** Start with a "forest" of leaves: every character is a leaf.
- Find the two trees in this forest with smallest total frequency.
- Join these two trees under a common root.
- Recurse, until the forest has just one tree.
## Huffman's Algorithm: Example
```
character | A | E | I | O | U | Y |
frequency | 50 | 80 | 30 | 30 | 20 | 20 |
```
Start with a "forest" of leaves: every character is a leaf:
```
A:50 E:80 I:30 O:30 U:20 Y:20
```
Find the two trees in this forest with smallest total frequency. Join these two trees under a common root:
```
A:50 E:80 I:30 O:30 #:40
/ \
U Y
```
Recurse:
```
A:50 E:80 #:60 #:40
/ \ / \
I O U Y
```
## Huffman's Algorithm: Example, continued
```
character | A | E | I | O | U | Y |
frequency | 50 | 80 | 30 | 30 | 20 | 20 |
```
```
A:50 E:80 #:60 #:40
/ \ / \
I O U Y
```
Recurse:
```
E:80 #:60 #:90
/ \ / \
I O A #
/ \
U Y
```
Recurse:
```
#:140 #:90
/ \ / \
E # A #
/ \ / \
I O U Y
```
## Huffman's Algorithm: Optimal Code?
```
#:140 #:90
/ \ / \
E # A #
/ \ / \
I O U Y
```
One final recursion:
```
#
/ \
# #
/ \ / \
E # A #
/ \ / \
I O U Y
```
Optimal code?
```
character | A | E | I | O | U | Y |
code word | 10 | 00 | 010 | 011 | 110 | 111 |
```
## Sizes of code words?
4. Apply Huffman's algorithm to produce a prefix code, given the following frequencies. Does this algorithm minimize the length of the longest code word?
```
character | A | E | I | O | U | Y |
frequency | 50 | 100 | 30 | 30 | 20 | 20 |
```
5. Find an assignment of frequencies to characters in order to produce the widest possible assortment of code word lengths using Huffman's algorithm. Is it possible to produce a code where all the code words have different lengths?
```
character | A | E | I | O | U | Y |
frequency | | | | | | |
```
# Huffman Codes
# Assignment Comments, Sample Solutions
## Prefix Codes
```
| letter | b | l | a | c | k | h | w | s |
| codeword | 01 | 11 | 0010 | 0001 | 0011 | 0000 | 010 | 011 |
```
Since 01 is a prefix for more than one letter (b,s,w) this is not a prefix code.
```
#
0/ \1
# e
0/ \1
a #
0/ \1
# g
0/ \1
m n
```
- 10110110100000101 ---> `eggman`
- This is not an optimal code because the letter `g` was used the most but it did not use the least amount of code to be encoded to: `g` and `e` switched in the binary tree would have used fewer bits.
## Huffman's algorithm
Notice that the process is not uniquely determined, because sometimes there are multiple options with the same frequencies.
```
a:100 b:20 c:30 d:20 e:150 f:10 g:20 h:40 i:110
a:100 b:20 c:30 e:150 #:30 g:20 h:40 i:110
/ \
d f
a:100 c:30 e:150 #:30 #:40 h:40 i:110
/ \ / \
d f b g
a:100 e:150 #:60 #:40 h:40 i:110
/ \ / \
c # b g
/ \
d f
a:100 e:150 #:60 #:80 i:110
/ \ / \
c # h #
/ \ / \
d f b g
a:100 e:150 #:140 i:110
/ \
# #
/ \ / \
c # h #
/ \ / \
d f b g
e:150 #:210 #:140
/ \ / \
a i # #
/ \ / \
c # h #
/ \ / \
d f b g
#:210 #:290
/ \ / \
a i e #
/ \
# #
/ \ / \
c # h #
/ \ / \
d f b g
#:500
/ \
# #
/ \ / \
a i e #
/ \
# #
/ \ / \
c # h #
/ \ / \
d f b g
```
## Internal nodes
You can think of internal nodes as being "combo" characters.
```
aifbcdghe
/0 1\
ai fbcdghe
/0 1\ /0 1\
a i fbcdgh e
/0 1\
fbc dgh
/0 1\ /0 1\
fb c dg h
/0 1\ /0 1\
f b d g
```
## Priority Queues
- **Max Heap:** For every node, the value of its parent must be greater than its own value. To fix this tree, we must swap 16 and 17.
```
24
/ \
21 17
/ \ /
9 18 16
```
- `ExtractMax`, `ExtractMin`, `Insert`, are all $O(\log n)$.
- `Peek` is $O(1)$, because you can just read the root value and not change anything.
## {data-background="https://media.giphy.com/media/T3fwN6Pbm3ZPa/giphy.gif" data-background-size="contain"}
## Building a Priority Queue
- Adding elements to the heap takes $O(\log n)$ for each element, but we are doing this for $n$ elements, so we multiply the times together to get $O(n\log n)$.
- This is an upper bound, but it is *not tight*.
- There are at most $\left\lceil n/2^{h+1} \right\rceil$ nodes of height $h$.
$$
\sum_{h=0}^{\lfloor \log_2 n \rfloor} \left\lceil \frac{n}{2^{h+1}}\right\rceil O(h) = O\left( n \sum_{h=0}^{\lfloor \log_2 n \rfloor}\frac{h}{2^{h+1}} \right)= O\left( n \sum_{h=0}^{\infty}\frac{h}{2^{h+1}} \right) = O(n)
$$
- So it turns out that $O(n)$ is a **tight** upper bound on the time needed to build a priority queue.
# Huffman Algorithm Correctness
## Part 1: Exchange Argument
Given a list of characters `C[1..n]` and their frequencies `F[1..n]` in the message we want to encode.
- Let $T_G$ be a tree produced by Huffman's greedy algorithm.
- Suppose $T$ is some other tree giving an optimal prefix code.
- Let $x$ and $y$ be the characters with smallest frequency.
- Let $a$ and $b$ be the leaves of the lowest subtree of of $T$.
- If $a$ and $b$ are $x$ and $y$, we're done.
- Otherwise, swap $a$ with $x$ and $b$ with $y$, and you get a code that is at least as efficient.
**Upshot:** The first greedy choice is optimal.
## Part 2: Structural Inductive Argument (sketch)
- Let $C'$ be the character set $C$ with the least frequent characters $x$ and $y$ replaced with the character $x\!\!\!y$.
- Suppose (inductive hypothesis) $T'$ is an optimal tree for $C'$ that was built by Huffman's algorithm.
- $x\!\!\!y$ is a leaf of $T'$.
- Replace this leaf with: $\begin{array}{ccc} & /\backslash & \ \\ & x \;\;\; y& \end{array}$
- cost of $T$ = cost of $T'$ + frequencies of $x$ and $y$. (details in book)
- Since $T'$ had minimal cost, so must $T$.
## Full Binary Trees
- Usual recursive definition:
- A *full binary tree* is made from two full binary trees joined under new root.
- Alternative recursive definition:
- A *full binary tree* is made by replacing a leaf of a full binary tree with a two-leaf subtree.
Either way, a full binary tree with $n$ leaves has $2n-1$ nodes. (picture)
# Huffman Algorithm Implementation
```{javascript, eval=FALSE}
// indices 1 .. n are the leaves (corresponding to C[1..n])
// index 2n-1 is the root
// P[i] = parent of node i
// L[i] = left child of node i, R[i] = right child
BuildHuffman(F[1..n]):
for i <- 1 to n
L[i] <- 0; R[i] <- 0
Insert(i, F[i]) // put indices in priority queue, frequency = priority
for i <- n + 1 to 2n − 1 // ??typo in book?? (Book starts this loop at n, not n+1.)
x <- ExtractMin() // extract smallest trees in forest
y <- ExtractMin() // from the priority queue
F[i] <- F[x] + F[y] // new internal node
Insert(i, F[i]) // put new node into priority queue
L[i] <- x; P[x] <- i // update children
R[i] <- y; P[y] <- i // and parents
P[2n − 1] <- 0 // root has no parent
```
## Table Groups
```{r, echo=FALSE}
library(knitr)
roster <- c("Ethan", "Talia", "Drake", "Jack", "Andrew", "Blake", "Jordan", "Graham", "Kevin", "Logan", "Claire", "Bri", "Trevor", "James", "Kristen", "Levi", "Grace", "John", "Isaac", "Josiah", "Nathan")
set.seed(2192021)
n <- length(roster)
ngps <- 7
maxingp <- ceiling(n/ngps)
# just make random groups
groups <- matrix(c(roster[sample(n)],
rep("",(maxingp - (n %% maxingp)) %% maxingp)),
ncol=maxingp, byrow=FALSE)
rownames(groups) <- paste0("Table #", 1:nrow(groups))
kable(t(groups))
```
## Explore Huffman's Algorithm
1. Find the running time of this implementation of Huffman's Algorithm.
2. Construct the arrays F, P, L and R for the following table of frequencies. In other words, `F[1..6] = [50, 80, 30, 30, 20, 20]`.
```
character | A | E | I | O | U | Y |
frequency | 50 | 80 | 30 | 30 | 20 | 20 |
```
3. Am I right that there's a typo in the book in the limits on the second for-loop??
# Graphs: Introduction
# Assignment Comments, Sample Solutions
## Huffman Codes
Use R!
```{r}
( b1 <- sum(c(100,20,30,20,150,10,20,40,110)*c(2,5,4,5,2,5,5,4,2)) )
( fixed_length <- ceiling(log2(9)) )
b1/(fixed_length*500)
```
## Optimal trees must be full
```
#
0/ \1
# #
0/ \1 \1
# # #
0/ \1 0/ \1 0/ \1
a b c d e f
| letter | a | b | c | d | e | f |
| code word | 000 | 001 | 010 | 011 | 110 | 111 |
```
- This code is not optimal, because the second bit of `e` and `f` is wasted. You can remove the second bit in the encoding of `e` and `f` without suffering any tradeoffs anywhere else.
- More generally, an optimal binary tree is going to be one where there are no wasted bits, or in other words *every parent node has exactly two child nodes*.
## Greedy Rock Climbing?
> Start with a node furthest away from the root, then travel upwards. If the length of the path is equal to $k$, remove the highest node in this path and all the nodes below it. Add 1 to the count, then recurse to the next furthest node.
See Figure 4.6. Does this algorithm produce more paths?
## Start at top?
> Draw a path from the root of the tree to a node k levels down from the root whose subtree has minimal height. Recurse on any remaining subtrees whose heights are greater than or equal to k.
## Informal Exchange Argument
> Let $G$ be the set of all paths chosen by the greedy algorithm, and suppose $T$ is a different set of paths chosen by a different algorithm. Each path in $T$ that differs from $G$ can be shifted down the tree until it reaches 1) the lowest available leaf in its subtree (available meaning there are no taken nodes between the leaf and the current path in question) or 2) another path. After this process has been completed with each path, the paths of $T$ will all be replaced by greedy choices. So $G$ must have at least as many paths as $T$.
# Graphs: Terminology
## A graph is a set of pairs
- The *vertex set* $V$ of a graph could be anything.
- websites, countries, numbers, game states, etc.
- The edge set $E$ is a set of pairs of vertices.
- If the pairs are *ordered pairs* $(u,v)$, the graph is a *directed* graph.
- If the pairs are *unordered pairs* $\{u,v\}$, the graph is *undirected*.
## Terms to know
Section 5.2 of [E] defines the following terms:
- simple graphs, multigraphs
- neighbor, adjacent
- degree, predecessor, successor, in-degree, out-degree
- subgraph, walk, directed walk, path, closed walk
- reachable, connected, strongly connected, components
- cycle, acyclic
- forest, tree, spanning tree/forest, dag
If you ever encounter an unfamiliar term about graphs, you can probably find the definition in Section 5.2.
# Graphs: Representation
## Graphical representation
- Vertices are *dots*, undirected edges are *lines*, directed edges are *arrows*.
- There are lots of different ways to draw the same graph.
```{r, message=FALSE, echo=FALSE, fig.height=3.5}
library(gridExtra)
library(igraph)
library(ggraph)
pg <- make_graph("Petersen")
g1 <- ggraph(pg, layout="igraph", algorithm="nicely") + geom_node_point() + geom_edge_link()
g2 <- ggraph(pg, layout="igraph", algorithm="nicely") + geom_node_point() + geom_edge_link()
g3 <- ggraph(pg, layout="igraph", algorithm="nicely") + geom_node_point() + geom_edge_link()
g4 <- ggraph(pg, layout="igraph", algorithm="nicely") + geom_node_point() + geom_edge_link()
grid.arrange(g1,g2,g3,g4, nrow=1)
```
## Adjacency Lists
<div class="column-left">
```{r, echo=FALSE, fig.height=5, fig.width=5, fig.align='center'}
set.seed(67423)
par(mar=c(0,0,0,0)+.1)
plot(pg, edge.width=3, edge.color="black")
```
</div>
<div class="column-right">
```{r, echo = FALSE}
pgatt <- as_adj_list(pg)
lapply(pgatt, as.vector)
```
</div>
## Adjacency List Definition
Abuse of notation: $V$ is the number of vertices.
- `A[1..V]` is an *array* of *lists*.
- Undirected: `A[i]` is a list of neighbors of vertex `i`
- Directed: `A[i]` is a list of out-neighbors of vertex `i`
- Can implement as a linked list or hash table.
See Figure 5.9 of [E].
## Table Groups