damballa
diff --git a/‎NEWS.md
+29 b/‎NEWS.md
+29
diff --git a/‎README.md
+20-23 b/‎README.md
+20-23
diff --git a/‎TODO
-28 b/‎TODO
-28
diff --git a/‎doc/dvals.md
+6-6 b/‎doc/dvals.md
+6-6
diff --git a/‎doc/intro.md
+59-60 b/‎doc/intro.md
+59-60
diff --git a/‎doc/motivation.md
+2-2 b/‎doc/motivation.md
+2-2
@@ -1,5 +1,34 @@
 # Parkour News – history of user-visible changes
 
+## 0.6.0 / ?
+
+### Breaking changes
+
+- Deprecate direct invocation of source-shaping functions.
+- Normalize shuffle & sink type/schema arguments to vectors of such.
+- `TextInputFormat` dseq defaults to `:vals` source shape.
+- `AvroKeyInputFormat` dseq defaults to `:keys` source shape.
+- `AvroKeyOutputFormat` dsink defaults to `:keys` sink shape.
+
+### Other changes
+
+- Allow shorthand partition shuffle to specify only key class.
+- Add `dseq/input-paths` for determining dseq input paths.
+- Support direct Avro input via Hadoop filesystem paths.
+- Add `cser` namespace; de/serialize vars as task arguments.
+- Add distributed values (dvals) and documentation.
+- Modify file dsinks to allow implicit transient output paths.
+- Allow csteps to specify default source/sink shapes.
+- Allow in-memory dseqs to specify default source shape.
+- Wait for Hadoop 1.x FS cleanup hook to complete on exit.
+- Add `fexecute` function to job graph API.
+- Use combiner as reducer when reducer not later specified.
+- Extend `reducers` namespace of reducer-based helpers.
+- Add `toolbox` namespace of common task functions.
+- Make tuple sources `r/fold`-able via `map-combine`.
+- Allow `pg/input` to handle a vector of `:input` nodes.
+- Load task-side the same namespaces loaded locally.
+
 ## 0.5.4 / 2014-02-08
 
 - Ensure job-failure clean-up runs only once.
 
@@ -20,7 +20,7 @@ Parkour is available on Clojars.  Add this `:dependency` to your Leiningen
 `project.clj`:
 
 ```clj
-[com.damballa/parkour "0.5.4"]
+[com.damballa/parkour "0.6.0"]
 ```
 
 ## Usage
@@ -29,32 +29,20 @@ The [Parkour introduction][intro] contains an overview of the key concepts, but
 here is the classic “word count” example, in Parkour:
 
 ```clj
-(defn mapper
-  {::mr/source-as :vals}
-  [input]
-  (->> input
+(defn word-count-m
+  [coll]
+  (->> coll
        (r/mapcat #(str/split % #"\s+"))
        (r/map #(-> [% 1]))))
 
-(defn reducer
-  {::mr/source-as :keyvalgroups}
-  [input]
-  (r/map (fn [[word counts]]
-           [word (r/reduce + 0 counts)])
-         input))
-
 (defn word-count
-  [conf workdir lines]
-  (let [wc-path (fs/path workdir "word-count")
-        wc-dsink (seqf/dsink [Text LongWritable] wc-path)]
-    (-> (pg/input lines)
-        (pg/map #'mapper)
-        (pg/partition [Text LongWritable])
-        (pg/combine #'reducer)
-        (pg/reduce #'reducer)
-        (pg/output wc-dsink)
-        (pg/execute conf "word-count")
-        first)))
+  [conf lines]
+  (-> (pg/input lines)
+      (pg/map #'word-count-m)
+      (pg/partition [Text LongWritable])
+      (pg/combine #'ptb/keyvalgroups-r #'+)
+      (pg/output (seqf/dsink [Text LongWritable]))
+      (pg/fexecute conf `word-count)))
 ```
 
 ## Documentation
@@ -73,6 +61,12 @@ Parkour’s documentation is divided into a number of separate sections:
   Parkour uses to run your code in MapReduce jobs.
 - [Serialization][serialization] – How Parkour integrates Clojure with Hadoop
   serialization mechanisms.
+- [Unified I/O][unified-io] – Unified collection-like local and distributed I/O
+  via Parkour dseqs and dsinks.
+- [Distributed values][dvals] – Parkour’s value-oriented interface to the Hadoop
+  distributed cache.
+- [Multiple I/O][multi-io] – Configuring multiple inputs and/or outputs for
+  single Hadoop MapReduce jobs.
 - [Reducers vs seqs][reducers-vs-seqs] – Why Parkour’s default idiom uses
   reducers, and when to use seqs instead.
 - [Testing][testing] – Patterns for testing Parkour MapReduce jobs.
@@ -102,6 +96,9 @@ Hickey, and is distributed under the Eclipse Public License v1.0.
 [repl]: https://github.com/damballa/parkour/blob/master/doc/repl.md
 [mr-detailed]: https://github.com/damballa/parkour/blob/master/doc/mr-detailed.md
 [serialization]: https://github.com/damballa/parkour/blob/master/doc/serialization.md
+[unified-io]: https://github.com/damballa/parkour/blob/master/doc/unified-io.md
+[dvals]: https://github.com/damballa/parkour/blob/master/doc/dvals.md
+[multi-io]: https://github.com/damballa/parkour/blob/master/doc/multi-io.md
 [reducers-vs-seqs]: https://github.com/damballa/parkour/blob/master/doc/reducers-vs-seqs.md
 [testing]: https://github.com/damballa/parkour/blob/master/doc/testing.md
 [deployment]: https://github.com/damballa/parkour/blob/master/doc/deployment.md
 
@@ -1,16 +1,8 @@
 * Documentation
 
-** Multiple input/output documentation
 ** Examples
 *** Reduce-side join
 
-* More execution options
-
-Parkour currently inherits Hadoop’s default job-failure behavior when a job’s
-output directory already exists.  It would be nice to support other behaviors,
-such as skipping existing, overwriting existing, or a make-like model where
-outputs are overwritten when their inputs have changed.
-
 * Integration for writing other Hadoop classes
 
 ** Input formats
@@ -35,12 +27,6 @@ The ClassLoader approach may not however be possible.  It’s not clear if Hadoo
 allows tasks to configure an alternative ClassLoader prior to task
 initialization.
 
-* Make dseqs locally foldable
-
-Right now dseqs may be locally =reduce=-d, but there’s no reason they shouldn’t
-be =fold=-able.  The fold implementation should be able to use fork-join to run
-the fold in parallel across the input splits.
-
 * Support for EMR
 
 Probably a separate project.  Suggested by ztellman, potentially a Leiningen
@@ -62,11 +48,6 @@ record-reader (and context) for the original input split, allowing the same data
 to be subsequently re-processed.  Not certain this is a useful feature, but is
 an interesting idea.
 
-* Extended configuration serialization
-
-Provide an interface which ultimately allows objects to ship themselves via the
-distributed cache when they are serialized into a configuration.
-
 * n-record dseq
 
 A dseq which acts like NLineInputFormat, but wraps an arbitrary existing
@@ -77,12 +58,3 @@ records across <m> mappers).
 
 A dseq which distributes input records via the job configuration.  Should
 probably use extended configuration serialization.
-
-* dseq input path access
-
-Provide a multimethod for accessing a dseq's input paths.
-
-* Provide mapreduce utility namespace
-
-Provide a namespace of common mapreduce operations, such as a =first-p=
-partitioner, a =sum-r= reducer, etc.
@@ -1,4 +1,4 @@
-# dvals
+# Distributed values
 
 Parkour distributed values (dvals) provide a value-oriented interface for using
 the Hadoop distributed cache in Parkour MapReduce applications.
@@ -53,11 +53,11 @@ the value-oriented API provided by dvals.
 
 Parkour dvals are Clojure reference types similar to delays, but which capture a
 function-var plus arguments as explicitly separate and EDN-serializable values.
-These values form an executable and serializable “recipe” for the dval’s value,
-allowing complete and compact serialization of the dval even if its computed
-value supports neither.  Dvals may be passed as arguments to MapReduce tasks, in
-which case they deserialize task-side as delays over evaluation of their
-recipes.
+These component values form an executable and serializable “recipe” for the
+dval’s value, allowing complete and compact serialization of the dval even if
+its computed value supports neither.  Dvals may be passed as arguments to
+MapReduce tasks, in which case they deserialize task-side as delays over
+evaluation of their recipes.
 
 The `parkour.io.dval` namespace provides two base functions for creating dvals:
 
 
@@ -15,13 +15,14 @@ example:
   ...
   :dependencies [...
                  [org.codehaus.jsr166-mirror/jsr166y "1.7.0"]
-                 [com.damballa/parkour "0.5.4"]
+                 [com.damballa/parkour "0.6.0"]
                  ...]
 
   :profiles {...
              :provided
              {:dependencies
-              [[org.apache.hadoop/hadoop-core "1.2.1"]]}
+              [[org.apache.hadoop/hadoop-client "2.4.1"]
+               [org.apache.hadoop/hadoop-common "2.4.1"]]}
              ...}
   ...)
 ```
@@ -35,7 +36,7 @@ following are the key ideas Parkour introduces or re-purposes.
 
 ### MapReduce via `reducers` (and lazy seqs)
 
-The Clojure 1.5 `clojure.core.reducers` standard library namespace narrows the
+The Clojure >=1.5 `clojure.core.reducers` standard library namespace narrows the
 idea of a “collection” to “something which may be `reduce`d.”  This abstraction
 allows the sequences of key-value tuples processed in Hadoop MapReduce tasks to
 be represented as collections.  MapReduce tasks become functions over
@@ -67,7 +68,10 @@ Hadoop and Java Hadoop libraries typically contain a number of static methods
 for configuring Hadoop `Job` objects, handling such tasks as setting input &
 output paths, serialization formats, etc.  Parkour codifies these as
 _configuration steps_, or _csteps_: functions which accept a `Job` object as
-their single argument and modify that `Job` to apply some configuration.
+their single argument and modify that `Job` to apply some configuration.  The
+crucial difference is that csteps are themselves first-class values, allowing
+dynamic assembly of job configurations from the composition of opaque
+configuration elements.
 
 In practice, Parkour configuration steps are implemented via a protocol
 (`parkour.cstep/ConfigStep`) and associated public application function
@@ -84,8 +88,9 @@ In practice, Parkour configuration steps are implemented via a protocol
 A Parkour distributed sequence configures a job for input from a particular
 location and input format, reifying a function calling the underlying Hadoop
 `Job#setInputFormatClass` etc methods.  In addition to `ConfigStep`, dseqs also
-implement the core Clojure `CollReduce` protocol, allowing any Hadoop job input
-source to also be treated as a local reducible collection.
+implement the core Clojure `CollReduce` and `CollFold` protocols, allowing any
+Hadoop job input source to also be treated as a local reducible and foldable
+collection.
 
 #### Distributed sinks (dsinks)
 
@@ -121,46 +126,35 @@ allows adding arbitrary configuration steps to a job node in any stage.
 Here’s the complete classic “word count” example, written using Parkour:
 
 ```clj
-(ns parkour.examples.word-count
+(ns parkour.example.word-count
   (:require [clojure.string :as str]
             [clojure.core.reducers :as r]
             [parkour (conf :as conf) (fs :as fs) (mapreduce :as mr)
-             ,       (graph :as pg) (tool :as tool)]
+             ,       (graph :as pg) (toolbox :as ptb) (tool :as tool)]
             [parkour.io (text :as text) (seqf :as seqf)])
   (:import [org.apache.hadoop.io Text LongWritable]))
 
-(defn mapper
-  {::mr/source-as :vals}
-  [input]
-  (->> input
+(defn word-count-m
+  [coll]
+  (->> coll
        (r/mapcat #(str/split % #"\s+"))
        (r/map #(-> [% 1]))))
 
-(defn reducer
-  {::mr/source-as :keyvalgroups}
-  [input]
-  (r/map (fn [[word counts]]
-           [word (r/reduce + 0 counts)])
-         input))
-
 (defn word-count
-  [conf workdir lines]
-  (let [wc-path (fs/path workdir "word-count")
-        wc-dsink (seqf/dsink [Text LongWritable] wc-path)]
-    (-> (pg/input lines)
-        (pg/map #'mapper)
-        (pg/partition [Text LongWritable])
-        (pg/combine #'reducer)
-        (pg/reduce #'reducer)
-        (pg/output wc-dsink)
-        (pg/execute conf "word-count")
-        first)))
+  [conf lines]
+  (-> (pg/input lines)
+      (pg/map #'word-count-m)
+      (pg/partition [Text LongWritable])
+      (pg/combine #'ptb/keyvalgroups-r #'+)
+      (pg/output (seqf/dsink [Text LongWritable]))
+      (pg/fexecute conf `word-count)))
 
 (defn tool
-  [conf & args]
-  (let [[workdir & inpaths] args
-        lines (apply text/dseq inpaths)]
-    (->> (word-count conf workdir lines) (into {}) prn)))
+  [conf & inpaths]
+  (->> (apply text/dseq inpaths)
+       (word-count conf)
+       (into {})
+       (prn)))
 
 (defn -main
   [& args] (System/exit (tool/run tool args)))
@@ -170,8 +164,8 @@ Let’s walk through some important features of this example.
 
 ### Task vars & adapters
 
-The remote task vars (the arguments to the `map`, `combine`, and `reduce` calls)
-have complete control over execution of their associated tasks.  The underlying
+The remote task vars (the arguments to the `map` and `combine` calls) have
+complete control over execution of their associated tasks.  The underlying
 interface Parkour exposes models the Hadoop `Mapper` and `Reducer` classes as
 higher-order function, with construction/configuration invoking the initial
 function, then task-execution invoking the function returned by the former.
@@ -187,23 +181,28 @@ Parkour-Hadoop interface.
 ### Inputs as collections
 
 In the default Hadoop Java interface, Hadoop calls a user-supplied method for
-each input tuple.  Parkour instead calls the task function with the entire set
-of local input tuples as a single reducible collection, and expects a reducible
-output collection as the result.
-
-The input collections are directly reducible as vectors of key/value pairs, but
-the `parkour.mapreduce` namespace contains functions to efficiently reshape the
-task-inputs, including `vals` to access just the input values and (reduce-side)
-`keyvalgroups` to access grouping keys and grouped sub-collections of values.
-This model also allows access to more esoteric shapes generally not considered
-available from the raw Java interface, such as `keykeygroups`.  These functions
-may be invoked directly, or passed to the `collfn` adapter via `::mr/source-as`
-metadata as in the example.
-
-Parkour also defaults to emitting the result collection as key/value pairs, but
-`pakour.mapreduce` contains a `sink-as` function (and supports `collfn` adapter
-`::mr/sink-as` metadata) for specifying alternative shapes for task output.  The
-`sink` function allows explicit sinking to context objects or other sinks.
+each input tuple, which is then expected to write zero or more output tuples.
+Parkour instead calls task functions with the entire set of local input tuples
+as a single reducible collection, and expects a reducible output collection as
+the result.
+
+In the underlying Hadoop interfaces, all input and output happens in terms of
+key/value pairs.  Parkour’s input collections are directly reducible as vectors
+of those key/value pairs, but also support efficient “reshaping” to iterate over
+just the relevant data for a particular task.  These shapes include `vals` to
+access just the input values and (reduce-side) `keyvalgroups` to access grouping
+keys and grouped sub-collections of values.  The `collfn` adapter allows
+specifying these shapes as keywords via `::mr/source-as` metadata on task vars.
+
+Parkour usually defaults to emitting the result collection as key/value pairs,
+but the `collfn` adapter supports `::mr/sink-as` metadata for specifying
+alternative shapes for task output.
+
+Additionally, dseqs and dsinks may supply different default input and output
+shapes.  The text dseq used in the word count example specifies `vals` as the
+default, allowing unadorned task vars to receive the input collection as a
+collection of the input text lines (versus the underlying Hadoop tuples of file
+offset and text line).
 
 ### Automatic wrapping & unwrapping
 
@@ -222,10 +221,10 @@ with Hadoop’s serialization containers.
 
 ### Results
 
-The return value of the `execute` function is a vector of dseqs for the job
-graph leaf node results.  These dseqs may be consumed locally as in the example,
-or used as inputs for additional jobs.  When locally `reduce`d, dseqs yield
-key-value vectors of the `unwrap`ed values of the objects produced by the
-backing Hadoop input format.  The `parkour.io.dseq/source-for` function can
-provide direct access to the raw wrapper objects, as well as allowing dseqs to
-be realized as lazy sequences instead of reducers.
+The return value of the `fexecute` function is a dseq for the job graph leaf
+node result.  That dseq may be consumed locally as in the example, or used as
+the input for additional jobs.  When locally `reduce`d, dseqs yield key-value
+vectors of the `unwrap`ed values of the objects produced by the backing Hadoop
+input format.  The `parkour.io.dseq/source-for` function can provide direct
+access to the raw wrapper objects, as well as allowing dseqs to be realized as
+lazy sequences instead of reducers.
@@ -69,7 +69,7 @@ code re-writing).
 Parkour pushes most composition of computation back to the language layer, as
 explicit composition of Clojure functions within MapReduce task functions.  Task
 functions act on the portion of a distributed collection available within an
-individual task.  The prevents Parkour from providing explicit cross-task
+individual task.  This prevents Parkour from providing explicit cross-task
 operations, but allows task functions to call any Clojure collection function,
 not just the subset of methods provided by a distributed collection type.  Users
 must manually divide computations into tasks, but those tasks may combine into
@@ -79,7 +79,7 @@ build.
 ### Cascalog
 
 [Cascalog][cascalog] is the elephant in the room.  Why Parkour when Cascalog
-exists?  And especially when Cascalog 2 is right around the corner?
+exists?
 
 Cascalog and Cascading are both excellent pieces of engineering, but introduce
 significant complexity.  Fundamentally, Cascalog is not an integration layer for