diff --git a/docs/docset.yml b/docs/docset.yml index fb4ebee39..c104c444a 100644 --- a/docs/docset.yml +++ b/docs/docset.yml @@ -6,6 +6,7 @@ toc: - toc: reference - toc: release-notes subs: + version: "9.0.0" es: "Elasticsearch" esh: "ES-Hadoop" esh-full: "Elasticsearch for Apache Hadoop" diff --git a/docs/reference/apache-spark-support.md b/docs/reference/apache-spark-support.md index 74c96cbde..b09952a8b 100644 --- a/docs/reference/apache-spark-support.md +++ b/docs/reference/apache-spark-support.md @@ -510,7 +510,11 @@ In case where the results from {{es}} need to be in JSON format (typically to be #### Type conversion [spark-type-conversion] ::::{important} -When dealing with multi-value/array fields, please see [this](/reference/mapping-types.md#mapping-multi-values) section and in particular [these](/reference/configuration.md#cfg-field-info) configuration options. IMPORTANT: If automatic index creation is used, please review [this](/reference/mapping-types.md#auto-mapping-type-loss) section for more information. +When dealing with multi-value/array fields, please see [this](/reference/mapping-types.md#mapping-multi-values) section and in particular [these](/reference/configuration.md#cfg-field-info) configuration options. +:::: + +::::{important} +If automatic index creation is used, please review [this](/reference/mapping-types.md#auto-mapping-type-loss) section for more information. :::: @@ -562,7 +566,7 @@ Added in 5.0. :::: -[TBC: FANCY QUOTE] +% [TBC: FANCY QUOTE] Spark Streaming is an extension on top of the core Spark functionality that allows near real time processing of stream data. Spark Streaming works around the idea of `DStream`s, or *Discretized Streams*. `DStreams` operate by collecting newly arrived records into a small `RDD` and executing it. This repeats every few seconds with a new `RDD` in a process called *microbatching*. The `DStream` api includes many of the same processing operations as the `RDD` api, plus a few other streaming specific methods. elasticsearch-hadoop provides native integration with Spark Streaming as of version 5.0. When using the elasticsearch-hadoop Spark Streaming support, {{es}} can be targeted as an output location to index data into from a Spark Streaming job in the same way that one might persist the results from an `RDD`. Though, unlike `RDD`s, you are unable to read data out of {{es}} using a `DStream` due to the continuous nature of it. @@ -1074,7 +1078,7 @@ Added in 2.1. :::: -[TBC: FANCY QUOTE] +% [TBC: FANCY QUOTE] On top of the core Spark support, elasticsearch-hadoop also provides integration with Spark SQL. In other words, {{es}} becomes a *native* source for Spark SQL so that data can be indexed and queried from Spark SQL *transparently*. ::::{important} @@ -1210,7 +1214,7 @@ val df = sql.load( <1> ``` 1. `SQLContext` *experimental* `load` method for arbitrary data sources -2. path or resource to load - in this case the index/type in {es} +2. path or resource to load - in this case the index/type in {{es}} 3. the data source provider - `org.elasticsearch.spark.sql` @@ -1225,7 +1229,7 @@ val df = sql.read <1> 1. `SQLContext` *experimental* `read` method for arbitrary data sources 2. the data source provider - `org.elasticsearch.spark.sql` -3. path or resource to load - in this case the index/type in {es} +3. path or resource to load - in this case the index/type in {{es}} In Spark 1.5, this can be further simplified to: @@ -1441,8 +1445,8 @@ println(people.schema.treeString) <4> 1. Spark SQL Scala imports 2. elasticsearch-hadoop SQL Scala imports -3. create a `DataFrame` backed by the `spark/people` index in {es} -4. the `DataFrame` associated schema discovered from {es} +3. create a `DataFrame` backed by the `spark/people` index in {{es}} +4. the `DataFrame` associated schema discovered from {{es}} 5. notice how the `age` field was transformed into a `Long` when using the default {{es}} mapping as discussed in the [*Mapping and Types*](/reference/mapping-types.md) chapter. @@ -1506,7 +1510,11 @@ DataFrame people = JavaEsSparkSQL.esDF(sql, "spark/people", "?q=Smith"); <1> #### Spark SQL Type conversion [spark-sql-type-conversion] ::::{important} -When dealing with multi-value/array fields, please see [this](/reference/mapping-types.md#mapping-multi-values) section and in particular [these](/reference/configuration.md#cfg-field-info) configuration options. IMPORTANT: If automatic index creation is used, please review [this](/reference/mapping-types.md#auto-mapping-type-loss) section for more information. +When dealing with multi-value/array fields, please see [this](/reference/mapping-types.md#mapping-multi-values) section and in particular [these](/reference/configuration.md#cfg-field-info) configuration options. +:::: + +::::{important} +If automatic index creation is used, please review [this](/reference/mapping-types.md#auto-mapping-type-loss) section for more information. :::: @@ -1547,7 +1555,7 @@ Added in 6.0. :::: -[TBC: FANCY QUOTE] +% [TBC: FANCY QUOTE] Released as an experimental feature in Spark 2.0, Spark Structured Streaming provides a unified streaming and batch interface built into the Spark SQL integration. As of elasticsearch-hadoop 6.0, we provide native functionality to index streaming data into {{es}}. ::::{important} @@ -1601,7 +1609,7 @@ people.writeStream 3. Instead of calling `read`, call `readStream` to get instance of `DataStreamReader` 4. Read a directory of text files continuously and convert them into `Person` objects 5. Provide a location to save the offsets and commit logs for the streaming query -6. Start the stream using the `"es"` format to index the contents of the `Dataset` continuously to {es} +6. Start the stream using the `"es"` format to index the contents of the `Dataset` continuously to {{es}} ::::{warning} diff --git a/docs/reference/configuration.md b/docs/reference/configuration.md index f19cd9eb0..4d4acac21 100644 --- a/docs/reference/configuration.md +++ b/docs/reference/configuration.md @@ -730,7 +730,11 @@ Added in 2.2. : Whether the use the system Socks proxy properties (namely `socksProxyHost` and `socksProxyHost`) or not ::::{note} -elasticsearch-hadoop allows proxy settings to be applied only to its connection using the setting above. Take extra care when there is already a JVM-wide proxy setting (typically through system properties) to avoid unexpected behavior. IMPORTANT: The semantics of these properties are described in the JVM [docs](http://docs.oracle.com/javase/8/docs/api/java/net/doc-files/net-properties.md#Proxies). In some cases, setting up the JVM property `java.net.useSystemProxies` to `true` works better then setting these properties manually. +elasticsearch-hadoop allows proxy settings to be applied only to its connection using the setting above. Take extra care when there is already a JVM-wide proxy setting (typically through system properties) to avoid unexpected behavior. +:::: + +::::{important} +The semantics of these properties are described in the JVM [docs](http://docs.oracle.com/javase/8/docs/api/java/net/doc-files/net-properties.md#Proxies). In some cases, setting up the JVM property `java.net.useSystemProxies` to `true` works better then setting these properties manually. :::: diff --git a/docs/reference/error-handlers.md b/docs/reference/error-handlers.md index 94456b8de..c82caff1f 100644 --- a/docs/reference/error-handlers.md +++ b/docs/reference/error-handlers.md @@ -28,7 +28,7 @@ Elasticsearch for Apache Hadoop provides an API to handle document level errors * The raw JSON bulk entry that was tried * Error message * HTTP status code for the document -* Number of times that the current document has been sent to {es} +* Number of times that the current document has been sent to {{es}} There are a few default error handlers provided by the connector: @@ -622,7 +622,9 @@ Elasticsearch for Apache Hadoop provides an API to handle document level deseria * The raw JSON search result that was tried * Exception encountered -Note: Deserialization Error Handlers only allow handling of errors that occur when parsing documents from scroll responses. It may be possible that a search result can be successfully read, but is still malformed, thus causing an exception when it is used in a completely different part of the framework. This Error Handler is called from the top of the most reasonable place to handle exceptions in the scroll reading process, but this does not encapsulate all logic for each integration. +::::{note} +Deserialization Error Handlers only allow handling of errors that occur when parsing documents from scroll responses. It may be possible that a search result can be successfully read, but is still malformed, thus causing an exception when it is used in a completely different part of the framework. This Error Handler is called from the top of the most reasonable place to handle exceptions in the scroll reading process, but this does not encapsulate all logic for each integration. +:::: There are a few default error handlers provided by the connector: diff --git a/docs/reference/installation.md b/docs/reference/installation.md index c71ef5cc7..8bffeedfc 100644 --- a/docs/reference/installation.md +++ b/docs/reference/installation.md @@ -7,11 +7,11 @@ navigation_title: Installation elasticsearch-hadoop binaries can be obtained either by downloading them from the [elastic.co](http://elastic.co) site as a ZIP (containing project jars, sources and documentation) or by using any [Maven](http://maven.apache.org/)-compatible tool with the following dependency: -```xml +```xml subs=true org.elasticsearch elasticsearch-hadoop - 9.0.0-beta1 + {{version}} ``` @@ -24,33 +24,33 @@ elasticsearch-hadoop binary is suitable for Hadoop 2.x (also known as YARN) envi In addition to the *uber* jar, elasticsearch-hadoop provides minimalistic jars for each integration, tailored for those who use just *one* module (in all other situations the `uber` jar is recommended); the jars are smaller in size and use a dedicated pom, covering only the needed dependencies. These are available under the same `groupId`, using an `artifactId` with the pattern `elasticsearch-hadoop-{{integration}}`: -```xml +```xml subs=true org.elasticsearch elasticsearch-hadoop-mr <1> - 9.0.0-beta1 + {{version}} ``` 1. *mr* artifact -```xml +```xml subs=true org.elasticsearch elasticsearch-hadoop-hive <1> - 9.0.0-beta1 + {{version}} ``` 1. *hive* artifact -```xml +```xml subs=true org.elasticsearch elasticsearch-spark-30_2.12 <1> - 9.0.0-beta1 + {{version}} ``` diff --git a/docs/reference/kerberos.md b/docs/reference/kerberos.md index fc30789d5..bfea696d5 100644 --- a/docs/reference/kerberos.md +++ b/docs/reference/kerberos.md @@ -145,7 +145,7 @@ if (!job.waitForCompletion(true)) { <3> ``` 1. Creating a new job instance -2. EsMapReduceUtil obtains job delegation tokens for {es} +2. EsMapReduceUtil obtains job delegation tokens for {{es}} 3. Submit the job to the cluster diff --git a/docs/reference/runtime-options.md b/docs/reference/runtime-options.md index 61aaba5e5..097123c91 100644 --- a/docs/reference/runtime-options.md +++ b/docs/reference/runtime-options.md @@ -15,7 +15,7 @@ Unfortunately, these settings need to be setup **manually** **before** the job / ## Speculative execution [_speculative_execution] -[TBC: FANCY QUOTE] +% [TBC: FANCY QUOTE] In other words, speculative execution is an **optimization**, enabled by default, that allows Hadoop to create duplicates tasks of those which it considers hanged or slowed down. When doing data crunching or reading resources, having duplicate tasks is harmless and means at most a waste of computation resources; however when writing data to an external store, this can cause data corruption through duplicates or unnecessary updates. Since the *speculative execution* behavior can be triggered by external factors (such as network or CPU load which in turn cause false positive) even in stable environments (virtualized clusters are particularly prone to this) and has a direct impact on data, elasticsearch-hadoop disables this optimization for data safety. Please check your library setting and disable this feature. If you encounter more data then expected, double and triple check this setting.