-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathhuffman_encoder.hpp
177 lines (156 loc) · 6.36 KB
/
huffman_encoder.hpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
#pragma once
#include "fly/coders/coder.hpp"
#include "fly/coders/huffman/types.hpp"
#include <array>
#include <istream>
#include <memory>
namespace fly {
class BitStreamWriter;
} // namespace fly
namespace fly::coders {
class CoderConfig;
/**
* Implementation of the Encoder interface for Huffman coding. Forms length-limted, canonical
* Huffman codes to encode symbols.
*
* @author Timothy Flynn ([email protected])
* @version July 7, 2019
*/
class HuffmanEncoder : public BinaryEncoder
{
public:
/**
* Constructor.
*
* @param config Reference to coder configuration.
*/
explicit HuffmanEncoder(std::shared_ptr<CoderConfig> const &config) noexcept;
protected:
/**
* Huffman encode a stream.
*
* If the input stream is large, in order to limit memory usage, the stream is encoded in
* chunks. Each chunk is treated as its own input stream, and the encoding sequence is repeated
* for each chunk.
*
* The first bytes of the output stream are reserved as a header. Currently, the header
* contains: the incurred BitStream header, the version of the Huffman coder used to encode the
* stream, the maximum chunk length used to to split large streams (in kilobytes), and the
* maximum allowed Huffman code length:
*
* | 8 bits | 8 bits | 16 bits | 8 bits |
* --------------------------------------------------------------------
* | BitStream header | Version | Chunk length (KB) | Max code length |
*
* The sequence to encode a stream is:
*
* 1. Create a Huffman tree from the input stream.
* 2. Generate standard Huffman codes from the Huffman tree.
* 3. Length-limit the standard Huffman codes.
* 4. Convert the length-limited codes to canonical Huffman codes.
* 5. Encode the canonical codes.
* 6. Encode the input stream using the canonical codes.
*
* This sequence involves iterating over the entire input stream twice (to create the Huffman
* tree and to encode the stream).
*
* The coder does not assume the Huffman codes are retained between calls. Thus, the codes are
* encoded before the input stream (step 5) so that they may be learned during decoding.
*
* Length-limiting is performed on the generated Huffman codes to improve decoder performance.
* Worst-case, a Huffman code could have the same length as the maximum number of symbols.
* Limiting the length of Huffman codes awards a significant decoder performance improvement,
* while only incurring a small cost in compression ratio.
*
* Canonical form is used for its property of generally being describable in fewer bits than
* standard form. When in canonical form, the Huffman codes are sorted by code length. With this
* sorting, the count of the number of symbols for each code length is computed: (N0, N1, N2,
* ..., Nn), where N<n> is the number of symbols of code length <n>. Call the length of this
* list NN.
*
* The encoding of canonical Huffman codes then becomes:
*
* NN,N0,N1,N2,...,Nn,S0,S1,S2,...,Sn
*
* Where S<n> is all symbols of code length <n>.
*
* Encoding the input stream (step 6) consists of reading each symbol from the input stream and
* outputting that symbol's canonical Huffman code.
*
* @param decoded Stream holding the contents to encode.
* @param encoded Stream to store the encoded contents.
*
* @return True if the input stream was successfully encoded.
*/
bool encode_binary(std::istream &decoded, fly::BitStreamWriter &encoded) override;
private:
/**
* Read the stream into a buffer, up to a static maximum size, storing bytes in the chunk
* buffer.
*
* @param decoded Stream holding the contents to encode.
*
* @return The number of bytes that were read.
*/
std::uint32_t read_stream(std::istream &decoded) const;
/**
* Create a Huffman tree from the current chunk buffer.
*
* @param chunk_size The number of bytes the chunk buffer holds.
*/
void create_tree(std::uint32_t chunk_size);
/**
* Create a list of Huffman codes from the generated Huffman tree. The list of codes will be in
* canonical form.
*/
void create_codes();
/**
* Insert a new Huffman code into the list of already sorted codes.
*
* @param code The Huffman code to insert.
*/
void insert_code(HuffmanCode &&code);
/**
* Length-limit the generated Huffman codes to a static maximum size, using a method described
* in Charles Bloom's blog, which is based around the Kraft-McMillan inequality:
*
* https://cbloomrants.blogspot.com/2010/07/07-03-10-length-limitted-huffman-codes.html
*/
void limit_code_lengths();
/**
* Convert the generated list of standard Huffman codes into canonical form. It is assumed that
* the codes are already sorted in accordance with canonical form.
*/
void convert_to_canonical_form();
/**
* Encode the header to the output stream.
*
* @param encoded Stream to store the encoded header.
*/
void encode_header(fly::BitStreamWriter &encoded) const;
/**
* Encode the generated Huffman codes to the output stream.
*
* @param encoded Stream to store the encoded codes.
*/
void encode_codes(fly::BitStreamWriter &encoded) const;
/**
* Encode symbols from the current chunk buffer with the generated list of Huffman codes. The
* list of codes is effectively destroyed as its elements are moved to a map for faster lookups.
*
* @param chunk_size The number of bytes the chunk buffer holds.
* @param encoded Stream to store the encoded symbols.
*/
void encode_symbols(std::uint32_t chunk_size, fly::BitStreamWriter &encoded);
// Configuration.
std::uint32_t const m_chunk_size;
length_type const m_max_code_length;
std::unique_ptr<symbol_type[]> m_chunk_buffer;
// Sized to fit 8-bit ASCII symbols.
std::array<HuffmanCode, 1 << 8> m_huffman_codes;
std::uint16_t m_huffman_codes_size;
// Sized to fit a complete Huffman tree. With 8-bit symbols, a complete tree
// will have a height of 9, and 2^9 - 1 = 511 nodes (round to 512).
std::array<HuffmanNode, 1 << 9> m_huffman_tree;
};
} // namespace fly::coders