Skip to content

Commit fd64bb9

Browse files
authored
Merge pull request #102 from JohnSnowLabs/annotators-train-from-fit
Annotators train from fit
2 parents 68de3f4 + 1d40fef commit fd64bb9

File tree

82 files changed

+1660
-1462
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

82 files changed

+1660
-1462
lines changed

CHANGELOG

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,25 @@
1+
========
2+
1.4.0
3+
========
4+
---------------
5+
New features
6+
---------------
7+
8+
* ExternalResource helpers used to represents external data information. Such information includes the format,
9+
delimiters and how to read it.
10+
* SpellChecker, ViveknSentiment and POS Perceptron can now train from the dataset passed to fit().
11+
This is more "spark"-like as it should always be. New params included as required.
12+
13+
---------------
14+
Enhancements
15+
---------------
16+
17+
* ResourceHelper now has an improved SourceStream class which allows for more consistent HDFS/Filesystem reading by using
18+
more of the Hadoop APIs.
19+
* application.conf is a global setting and be overriden.
20+
* PySpark API improved by creating AnnotatorApproach and AnnotatorModel classes
21+
* EntityMatcher now uses recursive Pipelines
22+
123
========
224
1.3.0
325
========

README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -10,18 +10,18 @@ Take a look at our official spark-nlp page: http://nlp.johnsnowlabs.com/ for use
1010

1111
This library has been uploaded to the spark-packages repository https://spark-packages.org/package/JohnSnowLabs/spark-nlp .
1212

13-
To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:1.3.0` to you spark command
13+
To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:1.4.0` to you spark command
1414

1515
```sh
16-
spark-shell --packages JohnSnowLabs:spark-nlp:1.3.0
16+
spark-shell --packages JohnSnowLabs:spark-nlp:1.4.0
1717
```
1818

1919
```sh
20-
pyspark --packages JohnSnowLabs:spark-nlp:1.3.0
20+
pyspark --packages JohnSnowLabs:spark-nlp:1.4.0
2121
```
2222

2323
```sh
24-
spark-submit --packages JohnSnowLabs:spark-nlp:1.3.0
24+
spark-submit --packages JohnSnowLabs:spark-nlp:1.4.0
2525
```
2626

2727
If you want to use and old version check the spark-packages websites to see all the releases.
@@ -36,19 +36,19 @@ Our package is deployed to maven central. In order to add this package as a depe
3636
<dependency>
3737
<groupId>com.johnsnowlabs.nlp</groupId>
3838
<artifactId>spark-nlp_2.11</artifactId>
39-
<version>1.3.0</version>
39+
<version>1.4.0</version>
4040
</dependency>
4141
```
4242

4343
#### SBT
4444
```sbtshell
45-
libraryDependencies += "com.johnsnowlabs.nlp" % "spark-nlp_2.11" % "1.3.0"
45+
libraryDependencies += "com.johnsnowlabs.nlp" % "spark-nlp_2.11" % "1.4.0"
4646
```
4747

4848
If you are using `scala 2.11`
4949

5050
```sbtshell
51-
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "1.3.0"
51+
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "1.4.0"
5252
```
5353

5454
## Using the jar manually

build.sbt

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ name := "spark-nlp"
77

88
organization := "com.johnsnowlabs.nlp"
99

10-
version := "1.3.0"
10+
version := "1.4.0"
1111

1212
scalaVersion := scalaVer
1313

@@ -110,6 +110,9 @@ testOptions in Test += Tests.Argument("-oF")
110110
/** Disables tests in assembly */
111111
test in assembly := {}
112112

113+
/** Publish test artificat **/
114+
publishArtifact in Test := true
115+
113116
/** Copies the assembled jar to the pyspark/lib dir **/
114117
lazy val copyAssembledJar = taskKey[Unit]("Copy assembled jar to pyspark/lib")
115118

docs/quickstart.html

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -95,16 +95,16 @@ <h2 class="section-title">Requirements</h2>
9595
depending on your desired use case:
9696
</p>
9797
</p>
98-
<pre><code class="language-python">spark-shell --packages JohnSnowLabs:spark-nlp:1.3.0
99-
pyspark --packages JohnSnowLabs:spark-nlp:1.3.0
100-
spark-submit --packages JohnSnowLabs:spark-nlp:1.3.0
98+
<pre><code class="language-python">spark-shell --packages JohnSnowLabs:spark-nlp:1.4.0
99+
pyspark --packages JohnSnowLabs:spark-nlp:1.4.0
100+
spark-submit --packages JohnSnowLabs:spark-nlp:1.4.0
101101
</code></pre>
102102
<p>
103103
Another way to use the library is by appending jar file into spark classpath,
104104
which can be downloaded
105-
<a href="http://repo1.maven.org/maven2/com/johnsnowlabs/nlp/spark-nlp_2.11/1.3.0/spark-nlp_2.11-1.3.0.jar">here</a>
105+
<a href="http://repo1.maven.org/maven2/com/johnsnowlabs/nlp/spark-nlp_2.11/1.4.0/spark-nlp_2.11-1.4.0.jar">here</a>
106106
then, run spark-shell or spark-submit with appropriate <b>--jars
107-
/path/to/spark-nlp_2.11-1.3.0.jar</b> to use the library in spark.
107+
/path/to/spark-nlp_2.11-1.4.0.jar</b> to use the library in spark.
108108
</p>
109109
<p>
110110
For further alternatives and documentation check out our README page in <a href="https://github.com/JohnSnowLabs/spark-nlp">GitHub</a>.

python/example/crf-ner/ner.ipynb

Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,8 @@
88
},
99
"outputs": [],
1010
"source": [
11-
" import sys\n",
11+
"import os\n",
12+
"import sys\n",
1213
"sys.path.append('../../')\n",
1314
"\n",
1415
"from pyspark.sql import SparkSession\n",
@@ -91,9 +92,7 @@
9192
{
9293
"cell_type": "code",
9394
"execution_count": null,
94-
"metadata": {
95-
"collapsed": true
96-
},
95+
"metadata": {},
9796
"outputs": [],
9897
"source": [
9998
"\n",
@@ -110,7 +109,6 @@
110109
" .setOutputCol(\"token\")\n",
111110
"\n",
112111
"posTagger = PerceptronApproach()\\\n",
113-
" .setCorpusPath(\"anc-pos-corpus/\")\\\n",
114112
" .setIterations(5)\\\n",
115113
" .setInputCols([\"token\", \"document\"])\\\n",
116114
" .setOutputCol(\"pos\")\n",
@@ -123,8 +121,8 @@
123121
" .setMinEpochs(1)\\\n",
124122
" .setMaxEpochs(20)\\\n",
125123
" .setLossEps(1e-3)\\\n",
126-
" .setDicts([\"ner-corpus/dict.txt\"])\\\n",
127-
" .setDatasetPath(\"eng.train\")\\\n",
124+
" .setExternalFeatures(\"file://\" + os.getcwd() + \"/../../../src/main/resources/ner-corpus/dict.txt\")\\\n",
125+
" .setExternalDataset(\"file://\" + os.getcwd() + \"/eng.train\")\\\n",
128126
" .setL2(1)\\\n",
129127
" .setC0(1250000)\\\n",
130128
" .setRandomSeed(0)\\\n",
@@ -154,7 +152,7 @@
154152
"#Load the input data to be annotated\n",
155153
"data = spark. \\\n",
156154
" read. \\\n",
157-
" parquet(\"../../../src/test/resources/sentiment.parquet\"). \\\n",
155+
" parquet(\"file://\" + os.getcwd() + \"/../../../src/test/resources/sentiment.parquet\"). \\\n",
158156
" limit(1000)\n",
159157
"data.cache()\n",
160158
"data.count()\n",

python/example/entities-extractor/extractor.ipynb

Lines changed: 17 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
},
99
"outputs": [],
1010
"source": [
11+
"import os\n",
1112
"import sys\n",
1213
"sys.path.append('../../')\n",
1314
"\n",
@@ -40,9 +41,7 @@
4041
{
4142
"cell_type": "code",
4243
"execution_count": null,
43-
"metadata": {
44-
"collapsed": true
45-
},
44+
"metadata": {},
4645
"outputs": [],
4746
"source": [
4847
"import time\n",
@@ -60,13 +59,14 @@
6059
" .setOutputCol(\"token\")\n",
6160
"\n",
6261
"extractor = EntityExtractor()\\\n",
63-
" .setEntitiesPath(\"entities.txt\")\\\n",
62+
" .setEntities(\"file://\" + os.getcwd() + \"/entities.txt\")\\\n",
6463
" .setInputCols([\"token\", \"sentence\"])\\\n",
6564
" .setOutputCol(\"entites\")\n",
6665
"\n",
6766
"finisher = Finisher() \\\n",
6867
" .setInputCols([\"entites\"]) \\\n",
69-
" .setIncludeKeys(True)\n",
68+
" .setIncludeKeys(False) \\\n",
69+
" .setCleanAnnotations(True)\n",
7070
"\n",
7171
"pipeline = Pipeline(\n",
7272
" stages = [\n",
@@ -87,11 +87,11 @@
8787
"#Load the input data to be annotated\n",
8888
"data = spark. \\\n",
8989
" read. \\\n",
90-
" parquet(\"../../../src/test/resources/sentiment.parquet\"). \\\n",
90+
" parquet(\"file://\" + os.getcwd() + \"../../../../src/test/resources/sentiment.parquet\"). \\\n",
9191
" limit(1000)\n",
9292
"data.cache()\n",
9393
"data.count()\n",
94-
"data.show()"
94+
"data.show(20)"
9595
]
9696
},
9797
{
@@ -120,9 +120,16 @@
120120
{
121121
"cell_type": "code",
122122
"execution_count": null,
123-
"metadata": {
124-
"collapsed": true
125-
},
123+
"metadata": {},
124+
"outputs": [],
125+
"source": [
126+
"extracted.select(\"finished_entites\")"
127+
]
128+
},
129+
{
130+
"cell_type": "code",
131+
"execution_count": null,
132+
"metadata": {},
126133
"outputs": [],
127134
"source": [
128135
"pipeline.write().overwrite().save(\"./extractor_pipeline\")\n",

0 commit comments

Comments
 (0)