Skip to content

Commit 3c90969

Browse files
committed
Update documentation for 0.6.0.
1 parent d48636c commit 3c90969

14 files changed

+450
-201
lines changed

NEWS.md

+29
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,34 @@
11
# Parkour News – history of user-visible changes
22

3+
## 0.6.0 / ?
4+
5+
### Breaking changes
6+
7+
- Deprecate direct invocation of source-shaping functions.
8+
- Normalize shuffle & sink type/schema arguments to vectors of such.
9+
- `TextInputFormat` dseq defaults to `:vals` source shape.
10+
- `AvroKeyInputFormat` dseq defaults to `:keys` source shape.
11+
- `AvroKeyOutputFormat` dsink defaults to `:keys` sink shape.
12+
13+
### Other changes
14+
15+
- Allow shorthand partition shuffle to specify only key class.
16+
- Add `dseq/input-paths` for determining dseq input paths.
17+
- Support direct Avro input via Hadoop filesystem paths.
18+
- Add `cser` namespace; de/serialize vars as task arguments.
19+
- Add distributed values (dvals) and documentation.
20+
- Modify file dsinks to allow implicit transient output paths.
21+
- Allow csteps to specify default source/sink shapes.
22+
- Allow in-memory dseqs to specify default source shape.
23+
- Wait for Hadoop 1.x FS cleanup hook to complete on exit.
24+
- Add `fexecute` function to job graph API.
25+
- Use combiner as reducer when reducer not later specified.
26+
- Extend `reducers` namespace of reducer-based helpers.
27+
- Add `toolbox` namespace of common task functions.
28+
- Make tuple sources `r/fold`-able via `map-combine`.
29+
- Allow `pg/input` to handle a vector of `:input` nodes.
30+
- Load task-side the same namespaces loaded locally.
31+
332
## 0.5.4 / 2014-02-08
433

534
- Ensure job-failure clean-up runs only once.

README.md

+20-23
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ Parkour is available on Clojars. Add this `:dependency` to your Leiningen
2020
`project.clj`:
2121

2222
```clj
23-
[com.damballa/parkour "0.5.4"]
23+
[com.damballa/parkour "0.6.0"]
2424
```
2525

2626
## Usage
@@ -29,32 +29,20 @@ The [Parkour introduction][intro] contains an overview of the key concepts, but
2929
here is the classic “word count” example, in Parkour:
3030

3131
```clj
32-
(defn mapper
33-
{::mr/source-as :vals}
34-
[input]
35-
(->> input
32+
(defn word-count-m
33+
[coll]
34+
(->> coll
3635
(r/mapcat #(str/split % #"\s+"))
3736
(r/map #(-> [% 1]))))
3837

39-
(defn reducer
40-
{::mr/source-as :keyvalgroups}
41-
[input]
42-
(r/map (fn [[word counts]]
43-
[word (r/reduce + 0 counts)])
44-
input))
45-
4638
(defn word-count
47-
[conf workdir lines]
48-
(let [wc-path (fs/path workdir "word-count")
49-
wc-dsink (seqf/dsink [Text LongWritable] wc-path)]
50-
(-> (pg/input lines)
51-
(pg/map #'mapper)
52-
(pg/partition [Text LongWritable])
53-
(pg/combine #'reducer)
54-
(pg/reduce #'reducer)
55-
(pg/output wc-dsink)
56-
(pg/execute conf "word-count")
57-
first)))
39+
[conf lines]
40+
(-> (pg/input lines)
41+
(pg/map #'word-count-m)
42+
(pg/partition [Text LongWritable])
43+
(pg/combine #'ptb/keyvalgroups-r #'+)
44+
(pg/output (seqf/dsink [Text LongWritable]))
45+
(pg/fexecute conf `word-count)))
5846
```
5947

6048
## Documentation
@@ -73,6 +61,12 @@ Parkour’s documentation is divided into a number of separate sections:
7361
Parkour uses to run your code in MapReduce jobs.
7462
- [Serialization][serialization] – How Parkour integrates Clojure with Hadoop
7563
serialization mechanisms.
64+
- [Unified I/O][unified-io] – Unified collection-like local and distributed I/O
65+
via Parkour dseqs and dsinks.
66+
- [Distributed values][dvals] – Parkour’s value-oriented interface to the Hadoop
67+
distributed cache.
68+
- [Multiple I/O][multi-io] – Configuring multiple inputs and/or outputs for
69+
single Hadoop MapReduce jobs.
7670
- [Reducers vs seqs][reducers-vs-seqs] – Why Parkour’s default idiom uses
7771
reducers, and when to use seqs instead.
7872
- [Testing][testing] – Patterns for testing Parkour MapReduce jobs.
@@ -102,6 +96,9 @@ Hickey, and is distributed under the Eclipse Public License v1.0.
10296
[repl]: https://github.com/damballa/parkour/blob/master/doc/repl.md
10397
[mr-detailed]: https://github.com/damballa/parkour/blob/master/doc/mr-detailed.md
10498
[serialization]: https://github.com/damballa/parkour/blob/master/doc/serialization.md
99+
[unified-io]: https://github.com/damballa/parkour/blob/master/doc/unified-io.md
100+
[dvals]: https://github.com/damballa/parkour/blob/master/doc/dvals.md
101+
[multi-io]: https://github.com/damballa/parkour/blob/master/doc/multi-io.md
105102
[reducers-vs-seqs]: https://github.com/damballa/parkour/blob/master/doc/reducers-vs-seqs.md
106103
[testing]: https://github.com/damballa/parkour/blob/master/doc/testing.md
107104
[deployment]: https://github.com/damballa/parkour/blob/master/doc/deployment.md

TODO

-28
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,8 @@
11
* Documentation
22

3-
** Multiple input/output documentation
43
** Examples
54
*** Reduce-side join
65

7-
* More execution options
8-
9-
Parkour currently inherits Hadoop’s default job-failure behavior when a job’s
10-
output directory already exists. It would be nice to support other behaviors,
11-
such as skipping existing, overwriting existing, or a make-like model where
12-
outputs are overwritten when their inputs have changed.
13-
146
* Integration for writing other Hadoop classes
157

168
** Input formats
@@ -35,12 +27,6 @@ The ClassLoader approach may not however be possible. It’s not clear if Hadoo
3527
allows tasks to configure an alternative ClassLoader prior to task
3628
initialization.
3729

38-
* Make dseqs locally foldable
39-
40-
Right now dseqs may be locally =reduce=-d, but there’s no reason they shouldn’t
41-
be =fold=-able. The fold implementation should be able to use fork-join to run
42-
the fold in parallel across the input splits.
43-
4430
* Support for EMR
4531

4632
Probably a separate project. Suggested by ztellman, potentially a Leiningen
@@ -62,11 +48,6 @@ record-reader (and context) for the original input split, allowing the same data
6248
to be subsequently re-processed. Not certain this is a useful feature, but is
6349
an interesting idea.
6450

65-
* Extended configuration serialization
66-
67-
Provide an interface which ultimately allows objects to ship themselves via the
68-
distributed cache when they are serialized into a configuration.
69-
7051
* n-record dseq
7152

7253
A dseq which acts like NLineInputFormat, but wraps an arbitrary existing
@@ -77,12 +58,3 @@ records across <m> mappers).
7758

7859
A dseq which distributes input records via the job configuration. Should
7960
probably use extended configuration serialization.
80-
81-
* dseq input path access
82-
83-
Provide a multimethod for accessing a dseq's input paths.
84-
85-
* Provide mapreduce utility namespace
86-
87-
Provide a namespace of common mapreduce operations, such as a =first-p=
88-
partitioner, a =sum-r= reducer, etc.

doc/dvals.md

+6-6
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# dvals
1+
# Distributed values
22

33
Parkour distributed values (dvals) provide a value-oriented interface for using
44
the Hadoop distributed cache in Parkour MapReduce applications.
@@ -53,11 +53,11 @@ the value-oriented API provided by dvals.
5353

5454
Parkour dvals are Clojure reference types similar to delays, but which capture a
5555
function-var plus arguments as explicitly separate and EDN-serializable values.
56-
These values form an executable and serializable “recipe” for the dval’s value,
57-
allowing complete and compact serialization of the dval even if its computed
58-
value supports neither. Dvals may be passed as arguments to MapReduce tasks, in
59-
which case they deserialize task-side as delays over evaluation of their
60-
recipes.
56+
These component values form an executable and serializable “recipe” for the
57+
dval’s value, allowing complete and compact serialization of the dval even if
58+
its computed value supports neither. Dvals may be passed as arguments to
59+
MapReduce tasks, in which case they deserialize task-side as delays over
60+
evaluation of their recipes.
6161

6262
The `parkour.io.dval` namespace provides two base functions for creating dvals:
6363

doc/intro.md

+59-60
Original file line numberDiff line numberDiff line change
@@ -15,13 +15,14 @@ example:
1515
...
1616
:dependencies [...
1717
[org.codehaus.jsr166-mirror/jsr166y "1.7.0"]
18-
[com.damballa/parkour "0.5.4"]
18+
[com.damballa/parkour "0.6.0"]
1919
...]
2020

2121
:profiles {...
2222
:provided
2323
{:dependencies
24-
[[org.apache.hadoop/hadoop-core "1.2.1"]]}
24+
[[org.apache.hadoop/hadoop-client "2.4.1"]
25+
[org.apache.hadoop/hadoop-common "2.4.1"]]}
2526
...}
2627
...)
2728
```
@@ -35,7 +36,7 @@ following are the key ideas Parkour introduces or re-purposes.
3536

3637
### MapReduce via `reducers` (and lazy seqs)
3738

38-
The Clojure 1.5 `clojure.core.reducers` standard library namespace narrows the
39+
The Clojure >=1.5 `clojure.core.reducers` standard library namespace narrows the
3940
idea of a “collection” to “something which may be `reduce`d.” This abstraction
4041
allows the sequences of key-value tuples processed in Hadoop MapReduce tasks to
4142
be represented as collections. MapReduce tasks become functions over
@@ -67,7 +68,10 @@ Hadoop and Java Hadoop libraries typically contain a number of static methods
6768
for configuring Hadoop `Job` objects, handling such tasks as setting input &
6869
output paths, serialization formats, etc. Parkour codifies these as
6970
_configuration steps_, or _csteps_: functions which accept a `Job` object as
70-
their single argument and modify that `Job` to apply some configuration.
71+
their single argument and modify that `Job` to apply some configuration. The
72+
crucial difference is that csteps are themselves first-class values, allowing
73+
dynamic assembly of job configurations from the composition of opaque
74+
configuration elements.
7175

7276
In practice, Parkour configuration steps are implemented via a protocol
7377
(`parkour.cstep/ConfigStep`) and associated public application function
@@ -84,8 +88,9 @@ In practice, Parkour configuration steps are implemented via a protocol
8488
A Parkour distributed sequence configures a job for input from a particular
8589
location and input format, reifying a function calling the underlying Hadoop
8690
`Job#setInputFormatClass` etc methods. In addition to `ConfigStep`, dseqs also
87-
implement the core Clojure `CollReduce` protocol, allowing any Hadoop job input
88-
source to also be treated as a local reducible collection.
91+
implement the core Clojure `CollReduce` and `CollFold` protocols, allowing any
92+
Hadoop job input source to also be treated as a local reducible and foldable
93+
collection.
8994

9095
#### Distributed sinks (dsinks)
9196

@@ -121,46 +126,35 @@ allows adding arbitrary configuration steps to a job node in any stage.
121126
Here’s the complete classic “word count” example, written using Parkour:
122127

123128
```clj
124-
(ns parkour.examples.word-count
129+
(ns parkour.example.word-count
125130
(:require [clojure.string :as str]
126131
[clojure.core.reducers :as r]
127132
[parkour (conf :as conf) (fs :as fs) (mapreduce :as mr)
128-
, (graph :as pg) (tool :as tool)]
133+
, (graph :as pg) (toolbox :as ptb) (tool :as tool)]
129134
[parkour.io (text :as text) (seqf :as seqf)])
130135
(:import [org.apache.hadoop.io Text LongWritable]))
131136

132-
(defn mapper
133-
{::mr/source-as :vals}
134-
[input]
135-
(->> input
137+
(defn word-count-m
138+
[coll]
139+
(->> coll
136140
(r/mapcat #(str/split % #"\s+"))
137141
(r/map #(-> [% 1]))))
138142

139-
(defn reducer
140-
{::mr/source-as :keyvalgroups}
141-
[input]
142-
(r/map (fn [[word counts]]
143-
[word (r/reduce + 0 counts)])
144-
input))
145-
146143
(defn word-count
147-
[conf workdir lines]
148-
(let [wc-path (fs/path workdir "word-count")
149-
wc-dsink (seqf/dsink [Text LongWritable] wc-path)]
150-
(-> (pg/input lines)
151-
(pg/map #'mapper)
152-
(pg/partition [Text LongWritable])
153-
(pg/combine #'reducer)
154-
(pg/reduce #'reducer)
155-
(pg/output wc-dsink)
156-
(pg/execute conf "word-count")
157-
first)))
144+
[conf lines]
145+
(-> (pg/input lines)
146+
(pg/map #'word-count-m)
147+
(pg/partition [Text LongWritable])
148+
(pg/combine #'ptb/keyvalgroups-r #'+)
149+
(pg/output (seqf/dsink [Text LongWritable]))
150+
(pg/fexecute conf `word-count)))
158151

159152
(defn tool
160-
[conf & args]
161-
(let [[workdir & inpaths] args
162-
lines (apply text/dseq inpaths)]
163-
(->> (word-count conf workdir lines) (into {}) prn)))
153+
[conf & inpaths]
154+
(->> (apply text/dseq inpaths)
155+
(word-count conf)
156+
(into {})
157+
(prn)))
164158

165159
(defn -main
166160
[& args] (System/exit (tool/run tool args)))
@@ -170,8 +164,8 @@ Let’s walk through some important features of this example.
170164

171165
### Task vars & adapters
172166

173-
The remote task vars (the arguments to the `map`, `combine`, and `reduce` calls)
174-
have complete control over execution of their associated tasks. The underlying
167+
The remote task vars (the arguments to the `map` and `combine` calls) have
168+
complete control over execution of their associated tasks. The underlying
175169
interface Parkour exposes models the Hadoop `Mapper` and `Reducer` classes as
176170
higher-order function, with construction/configuration invoking the initial
177171
function, then task-execution invoking the function returned by the former.
@@ -187,23 +181,28 @@ Parkour-Hadoop interface.
187181
### Inputs as collections
188182

189183
In the default Hadoop Java interface, Hadoop calls a user-supplied method for
190-
each input tuple. Parkour instead calls the task function with the entire set
191-
of local input tuples as a single reducible collection, and expects a reducible
192-
output collection as the result.
193-
194-
The input collections are directly reducible as vectors of key/value pairs, but
195-
the `parkour.mapreduce` namespace contains functions to efficiently reshape the
196-
task-inputs, including `vals` to access just the input values and (reduce-side)
197-
`keyvalgroups` to access grouping keys and grouped sub-collections of values.
198-
This model also allows access to more esoteric shapes generally not considered
199-
available from the raw Java interface, such as `keykeygroups`. These functions
200-
may be invoked directly, or passed to the `collfn` adapter via `::mr/source-as`
201-
metadata as in the example.
202-
203-
Parkour also defaults to emitting the result collection as key/value pairs, but
204-
`pakour.mapreduce` contains a `sink-as` function (and supports `collfn` adapter
205-
`::mr/sink-as` metadata) for specifying alternative shapes for task output. The
206-
`sink` function allows explicit sinking to context objects or other sinks.
184+
each input tuple, which is then expected to write zero or more output tuples.
185+
Parkour instead calls task functions with the entire set of local input tuples
186+
as a single reducible collection, and expects a reducible output collection as
187+
the result.
188+
189+
In the underlying Hadoop interfaces, all input and output happens in terms of
190+
key/value pairs. Parkour’s input collections are directly reducible as vectors
191+
of those key/value pairs, but also support efficient “reshaping” to iterate over
192+
just the relevant data for a particular task. These shapes include `vals` to
193+
access just the input values and (reduce-side) `keyvalgroups` to access grouping
194+
keys and grouped sub-collections of values. The `collfn` adapter allows
195+
specifying these shapes as keywords via `::mr/source-as` metadata on task vars.
196+
197+
Parkour usually defaults to emitting the result collection as key/value pairs,
198+
but the `collfn` adapter supports `::mr/sink-as` metadata for specifying
199+
alternative shapes for task output.
200+
201+
Additionally, dseqs and dsinks may supply different default input and output
202+
shapes. The text dseq used in the word count example specifies `vals` as the
203+
default, allowing unadorned task vars to receive the input collection as a
204+
collection of the input text lines (versus the underlying Hadoop tuples of file
205+
offset and text line).
207206

208207
### Automatic wrapping & unwrapping
209208

@@ -222,10 +221,10 @@ with Hadoop’s serialization containers.
222221

223222
### Results
224223

225-
The return value of the `execute` function is a vector of dseqs for the job
226-
graph leaf node results. These dseqs may be consumed locally as in the example,
227-
or used as inputs for additional jobs. When locally `reduce`d, dseqs yield
228-
key-value vectors of the `unwrap`ed values of the objects produced by the
229-
backing Hadoop input format. The `parkour.io.dseq/source-for` function can
230-
provide direct access to the raw wrapper objects, as well as allowing dseqs to
231-
be realized as lazy sequences instead of reducers.
224+
The return value of the `fexecute` function is a dseq for the job graph leaf
225+
node result. That dseq may be consumed locally as in the example, or used as
226+
the input for additional jobs. When locally `reduce`d, dseqs yield key-value
227+
vectors of the `unwrap`ed values of the objects produced by the backing Hadoop
228+
input format. The `parkour.io.dseq/source-for` function can provide direct
229+
access to the raw wrapper objects, as well as allowing dseqs to be realized as
230+
lazy sequences instead of reducers.

doc/motivation.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ code re-writing).
6969
Parkour pushes most composition of computation back to the language layer, as
7070
explicit composition of Clojure functions within MapReduce task functions. Task
7171
functions act on the portion of a distributed collection available within an
72-
individual task. The prevents Parkour from providing explicit cross-task
72+
individual task. This prevents Parkour from providing explicit cross-task
7373
operations, but allows task functions to call any Clojure collection function,
7474
not just the subset of methods provided by a distributed collection type. Users
7575
must manually divide computations into tasks, but those tasks may combine into
@@ -79,7 +79,7 @@ build.
7979
### Cascalog
8080

8181
[Cascalog][cascalog] is the elephant in the room. Why Parkour when Cascalog
82-
exists? And especially when Cascalog 2 is right around the corner?
82+
exists?
8383

8484
Cascalog and Cascading are both excellent pieces of engineering, but introduce
8585
significant complexity. Fundamentally, Cascalog is not an integration layer for

0 commit comments

Comments
 (0)