Skip to content

Commit 64c421a

Browse files
authored
Merge pull request #264 from JohnSnowLabs/162-release-candidate
162 release candidate
2 parents 4cfae9d + 0a55b3f commit 64c421a

File tree

9 files changed

+92
-48
lines changed

9 files changed

+92
-48
lines changed

CHANGELOG

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,47 @@
1+
========
2+
1.6.2
3+
========
4+
---------------
5+
Overview
6+
---------------
7+
In this release, we focused on reviewing out streaming performance, buy measuring our amount of sentences processed by second, through a LightPipeline.
8+
We increased Norvig Spell Checker by more than 300% by disabling DoubleVariants and improving algorithm orders. It is now reported capable of 42K sentences per second.
9+
Symmetric Delete Spell checker is more performance, although it has been reported to process 2K sentences per second.
10+
NerCRF has been reported to process 300 hundred sentences per second, while NerDL can do twice fast (about 700 sentences per second).
11+
Vivekn Sentiment Analysis was improved and is now capable to processing 100K sentences per sentence (before it was below 500).
12+
Finally, SentenceDetector performance was improved by a 40% from ~30K rows processed per second to ~40K. But, we have now enabled Abbreviation processing by default which reduces final speed to 22K rows per second with a negative net but better accuracy.
13+
Again, thanks for the community for helping with feedback. We welcome everyone asking questions or giving feedback in our Slack channel or reporting issues on Github.
14+
15+
---------------
16+
Enhancements
17+
---------------
18+
* OCR now features kernel segmentation. Significantly improves image based PDF processing
19+
* Vivekn Sentiment Analysis prediction performance improved by better data structures
20+
* Both Norvig and Symmetric Delete spell checkers now have improved performance
21+
* SentenceDetector improved accuracy by better handling abbreviations. UseAbbreviations now also by default turned ON
22+
* SentenceDetector improved performance significantly by improved preloading of rules
23+
24+
---------------
25+
Bug fixes
26+
---------------
27+
* Fixed NerDL not training correctly (broken since 1.6.0). Pretrained models not affected
28+
* Fixed NerConverter not properly considering multiple sentences per row (after using SentenceDetector), causing an unhandled exception to occur in some scenarios.
29+
* Tensorflow sessions now all support allow_soft_placement, supporting GPU based graphs to work with and without GPU
30+
* Norvig Spell Checker fixed a missing step from the algorithm to check for additional variants. May improve accuracy
31+
* Norvig Spell Checker disabled DoubleVariants by default. Was not improving accuracy significantly and was hitting performance very hard
32+
33+
---------------
34+
Developer API
35+
---------------
36+
* New FeatureSet allows HashSet params
37+
38+
---------------
39+
Models
40+
---------------
41+
* Vivekn Sentiment Pipeline doesn't have Spell Checker anymore
42+
* Fixed Vivekn Sentiment pretrained improved accuracy
43+
44+
145
========
246
1.6.1
347
========

README.md

Lines changed: 20 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -14,18 +14,18 @@ Questions? Feedback? Request access sending an email to [email protected]
1414

1515
This library has been uploaded to the spark-packages repository https://spark-packages.org/package/JohnSnowLabs/spark-nlp .
1616

17-
To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:1.6.1` to you spark command
17+
To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:1.6.2` to you spark command
1818

1919
```sh
20-
spark-shell --packages JohnSnowLabs:spark-nlp:1.6.1
20+
spark-shell --packages JohnSnowLabs:spark-nlp:1.6.2
2121
```
2222

2323
```sh
24-
pyspark --packages JohnSnowLabs:spark-nlp:1.6.1
24+
pyspark --packages JohnSnowLabs:spark-nlp:1.6.2
2525
```
2626

2727
```sh
28-
spark-submit --packages JohnSnowLabs:spark-nlp:1.6.1
28+
spark-submit --packages JohnSnowLabs:spark-nlp:1.6.2
2929
```
3030

3131
## Jupyter Notebook
@@ -35,23 +35,23 @@ export SPARK_HOME=/path/to/your/spark/folder
3535
export PYSPARK_DRIVER_PYTHON=jupyter
3636
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
3737
38-
pyspark --packages JohnSnowLabs:spark-nlp:1.6.1
38+
pyspark --packages JohnSnowLabs:spark-nlp:1.6.2
3939
```
4040

4141
## Apache Zeppelin
4242
This way will work for both Scala and Python
4343
```
44-
export SPARK_SUBMIT_OPTIONS="--packages JohnSnowLabs:spark-nlp:1.6.1"
44+
export SPARK_SUBMIT_OPTIONS="--packages JohnSnowLabs:spark-nlp:1.6.2"
4545
```
4646
Alternatively, add the following Maven Coordinates to the interpreter's library list
4747
```
48-
com.johnsnowlabs.nlp:spark-nlp_2.11:1.6.1
48+
com.johnsnowlabs.nlp:spark-nlp_2.11:1.6.2
4949
```
5050

5151
## Python without explicit Spark installation
5252
If you installed pyspark through pip, you can now install sparknlp through pip
5353
```
54-
pip install --index-url https://test.pypi.org/simple/ spark-nlp==1.6.1
54+
pip install --index-url https://test.pypi.org/simple/ spark-nlp==1.6.2
5555
```
5656
Then you'll have to create a SparkSession manually, for example:
5757
```
@@ -84,11 +84,11 @@ sparknlp {
8484

8585
## Pre-compiled Spark-NLP and Spark-NLP-OCR
8686
You may download fat-jar from here:
87-
[Spark-NLP 1.6.1 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/spark-nlp-assembly-1.6.1.jar)
87+
[Spark-NLP 1.6.2 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/spark-nlp-assembly-1.6.2.jar)
8888
or non-fat from here
89-
[Spark-NLP 1.6.1 PKG JAR](http://repo1.maven.org/maven2/com/johnsnowlabs/nlp/spark-nlp_2.11/1.6.1/spark-nlp_2.11-1.6.1.jar)
89+
[Spark-NLP 1.6.2 PKG JAR](http://repo1.maven.org/maven2/com/johnsnowlabs/nlp/spark-nlp_2.11/1.6.2/spark-nlp_2.11-1.6.2.jar)
9090
Spark-NLP-OCR Module (Requires native Tesseract 4.x+ for image based OCR. Does not require Spark-NLP to work but highly suggested)
91-
[Spark-NLP-OCR 1.6.1 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/spark-nlp-ocr-assembly-1.6.1.jar)
91+
[Spark-NLP-OCR 1.6.2 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/spark-nlp-ocr-assembly-1.6.2.jar)
9292

9393
## Maven central
9494

@@ -100,19 +100,19 @@ Our package is deployed to maven central. In order to add this package as a depe
100100
<dependency>
101101
<groupId>com.johnsnowlabs.nlp</groupId>
102102
<artifactId>spark-nlp_2.11</artifactId>
103-
<version>1.6.1</version>
103+
<version>1.6.2</version>
104104
</dependency>
105105
```
106106

107107
#### SBT
108108
```sbtshell
109-
libraryDependencies += "com.johnsnowlabs.nlp" % "spark-nlp_2.11" % "1.6.1"
109+
libraryDependencies += "com.johnsnowlabs.nlp" % "spark-nlp_2.11" % "1.6.2"
110110
```
111111

112112
If you are using `scala 2.11`
113113

114114
```sbtshell
115-
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "1.6.1"
115+
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "1.6.2"
116116
```
117117

118118
## Using the jar manually
@@ -133,17 +133,17 @@ The preferred way to use the library when running spark programs is using the `-
133133

134134
If you have troubles using pretrained() models in your environment, here a list to various models (only valid for latest versions).
135135
If there is any older than current version of a model, it means they still work for current versions.
136-
### Updated for 1.6.1
136+
### Updated for 1.6.2
137137
### Pipelines
138138
* [Basic Pipeline](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_basic_en_1.6.1_2_1533856444797.zip)
139-
* [Advanced Pipeline](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_advanced_en_1.6.1_2_1533856478690.zip)
140-
* [Vivekn Sentiment Pipeline](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_vivekn_en_1.6.1_2_1533942424443.zip)
139+
* [Advanced Pipeline](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_advanced_en_1.6.2_2_1534781366259.zip)
140+
* [Vivekn Sentiment Pipeline](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_vivekn_en_1.6.2_2_1534781342094.zip)
141141

142142
### Models
143143
* [PerceptronModel (POS)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_fast_en_1.6.1_2_1533853928168.zip)
144-
* [ViveknSentimentModel (Sentiment)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/vivekn_fast_en_1.6.1_2_1533942419063.zip)
145-
* [SymmetricDeleteModel (Spell Checker)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/spell_sd_fast_en_1.6.1_2_1533854712643.zip)
146-
* [NorvigSweetingModel (Spell Checker)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/spell_fast_en_1.6.1_2_1533854544551.zip)
144+
* [ViveknSentimentModel (Sentiment)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/vivekn_fast_en_1.6.2_2_1534781337758.zip)
145+
* [SymmetricDeleteModel (Spell Checker)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/spell_sd_fast_en_1.6.2_2_1534781178138.zip)
146+
* [NorvigSweetingModel (Spell Checker)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/spell_fast_en_1.6.2_2_1534781328404.zip)
147147
* [AssertionDLModel (Assertion Status)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/as_fast_dl_en_1.6.1_2_1533855787457.zip)
148148
* [NerCRFModel (NER)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ner_fast_en_1.6.1_2_1533854463219.zip)
149149
* [LemmatizerModel (Lemmatizer)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_fast_en_1.6.1_2_1533854538211.zip)

build.sbt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ name := "spark-nlp"
99

1010
organization := "com.johnsnowlabs.nlp"
1111

12-
version := "1.6.1"
12+
version := "1.6.2"
1313

1414
scalaVersion in ThisBuild := scalaVer
1515

@@ -138,7 +138,7 @@ assemblyMergeStrategy in assembly := {
138138
lazy val ocr = (project in file("ocr"))
139139
.settings(
140140
name := "spark-nlp-ocr",
141-
version := "1.6.1",
141+
version := "1.6.2",
142142
libraryDependencies ++= ocrDependencies ++
143143
analyticsDependencies ++
144144
testDependencies,

docs/index.html

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -78,8 +78,8 @@ <h2 class="title">High Performance NLP with Apache Spark </h2>
7878
</p>
7979
<a class="btn btn-info btn-cta" style="float: center;margin-top: 10px;" href="mailto:[email protected]?subject=SparkNLP%20Slack%20access" target="_blank"> Questions? Join our Slack</a>
8080
<b/><p/><p/>
81-
<p><span class="label label-warning">2018 Aug 9th - Update!</span> 1.6.1 Released! Fixed S3-based clusters support, new CHUNK type annotation and more!
82-
Learn changes <a href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.1/CHANGELOG">HERE</a> and check out for updated documentation below</p>
81+
<p><span class="label label-warning">2018 Aug 20th - Update!</span> 1.6.2 Released! Annotation performance revisited! Check our changelog
82+
Learn changes <a href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.2/CHANGELOG">HERE</a> and check out for updated documentation below</p>
8383
</div>
8484
<div id="cards-wrapper" class="cards-wrapper row">
8585
<div class="item item-green col-md-4 col-sm-6 col-xs-6">

docs/notebooks.html

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,7 @@ <h4 id="scala-vivekn-notebook" class="section-block"> Vivekn Sentiment Analysis<
103103
Since we are dealing with small amounts of data, we put in practice LightPipelines.
104104
</p>
105105
<p>
106-
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.1/example/src/TrainViveknSentiment.scala" target="_blank"> Take me to notebook!</a>
106+
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.2/example/src/TrainViveknSentiment.scala" target="_blank"> Take me to notebook!</a>
107107
</p>
108108
</div>
109109
</section>
@@ -135,7 +135,7 @@ <h4 id="vivekn-notebook" class="section-block"> Vivekn Sentiment Analysis</h4>
135135
better Sentiment Analysis accuracy
136136
</p>
137137
<p>
138-
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.1/python/example/vivekn-sentiment/sentiment.ipynb" target="_blank"> Take me to notebook!</a>
138+
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.2/python/example/vivekn-sentiment/sentiment.ipynb" target="_blank"> Take me to notebook!</a>
139139
</p>
140140
</div>
141141
<div>
@@ -157,7 +157,7 @@ <h4 id="sentiment-notebook" class="section-block"> Rule-based Sentiment Analysis
157157
Each of these sentences will be used for giving a score to text
158158
</p>
159159
</p>
160-
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.1/python/example/dictionary-sentiment/sentiment.ipynb" target="_blank"> Take me to notebook!</a>
160+
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.2/python/example/dictionary-sentiment/sentiment.ipynb" target="_blank"> Take me to notebook!</a>
161161
</p>
162162
</div>
163163
<div>
@@ -177,7 +177,7 @@ <h4 id="crfner-notebook" class="section-block"> CRF Named Entity Recognition</h4
177177
approach to use the same pipeline for tagging external resources.
178178
</p>
179179
<p>
180-
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.1/python/example/crf-ner/ner.ipynb" target="_blank"> Take me to notebook!</a>
180+
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.2/python/example/crf-ner/ner.ipynb" target="_blank"> Take me to notebook!</a>
181181
</p>
182182
</div>
183183
<div>
@@ -196,7 +196,7 @@ <h4 id="dlner-notebook" class="section-block"> CNN Deep Learning NER</h4>
196196
and it will leverage batch-based distributed calls to native TensorFlow libraries during prediction.
197197
</p>
198198
<p>
199-
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.1/python/example/dl-ner/ner.ipynb" target="_blank"> Take me to notebook!</a>
199+
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.2/python/example/dl-ner/ner.ipynb" target="_blank"> Take me to notebook!</a>
200200
</p>
201201
</div>
202202
<div>
@@ -211,7 +211,7 @@ <h4 id="text-notebook" class="section-block"> Simple Text Matching</h4>
211211
This annotator is an AnnotatorModel and does not require training.
212212
</p>
213213
<p>
214-
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.1/python/example/text-matcher/extractor.ipynb" target="_blank"> Take me to notebook!</a>
214+
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.2/python/example/text-matcher/extractor.ipynb" target="_blank"> Take me to notebook!</a>
215215
</p>
216216
</div>
217217
<div>
@@ -226,7 +226,7 @@ <h4 id="assertion-notebook" class="section-block"> Assertion Status with LogReg<
226226
dataset will return the appropriate result.
227227
</p>
228228
<p>
229-
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.1/python/example/logreg-assertion/assertion.ipynb" target="_blank"> Take me to notebook!</a>
229+
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.2/python/example/logreg-assertion/assertion.ipynb" target="_blank"> Take me to notebook!</a>
230230
</p>
231231
</div>
232232
<div>
@@ -241,7 +241,7 @@ <h4 id="dlassertion-notebook" class="section-block"> Deep Learning Assertion Sta
241241
graphs may be redesigned if needed.
242242
</p>
243243
<p>
244-
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.1/python/example/dl-assertion/assertion.ipynb" target="_blank"> Take me to notebook!</a>
244+
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.2/python/example/dl-assertion/assertion.ipynb" target="_blank"> Take me to notebook!</a>
245245
</p>
246246
</div>
247247
<div>
@@ -260,7 +260,7 @@ <h4 id="downloader-notebook" class="section-block"> Retrieving Pretrained models
260260
Such components may then be injected seamlessly into further pipelines, and so on.
261261
</p>
262262
<p>
263-
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.1/python/example/model-downloader/ModelDownloaderExample.ipynb" target="_blank"> Take me to notebook!</a>
263+
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.2/python/example/model-downloader/ModelDownloaderExample.ipynb" target="_blank"> Take me to notebook!</a>
264264
</p>
265265
</div>
266266
</section>

0 commit comments

Comments
 (0)