6.0.4 #14615

DevinTDHa · 2025-06-30T10:50:44Z

DevinTDHa
Jun 30, 2025
Maintainer

📢 Spark NLP 6.0.4: MiniLMEmbeddings, DataFrame Optimization, and Enhanced PDF Processing

We are excited to announce the release of Spark NLP 6.0.4! This version brings advancements in text embeddings with the introduction of the MiniLM family, Spark DataFrame optimizations, and enhanced PDF document parsing. Upgrade to 6.0.4 to leverage these cutting-edge features and expand your NLP capabilities at scale.

Stay updated with our latest examples and tutorials by visiting our Medium - Spark NLP blog!

🔥 Highlights

Introducing MiniLMEmbeddings: Support for the efficient and powerful MiniLMEmbeddings models, providing state-of-the-art text representations.
New DataFrameOptimizer: A new DataFrameOptimizer transformer to streamline and optimize Spark DataFrame operations, offering configurable repartitioning, caching, and persistence options.
Advanced PDF Reader Features: Enhancements to the PDF Reader with extractCoordinates for spatial metadata, normalizeLigatures for improved text consistency, and a new exception column for enhanced fault tolerance.

🚀 New Features & Enhancements

Advanced Text Embeddings

This release introduces a new family of efficient text embedding models:

MiniLMEmbeddings: Support for the MiniLMEmbeddings annotator, enabling the use of MiniLM models for generating highly efficient and effective sentence embeddings. These models are designed to provide strong performance while being significantly smaller and faster than larger alternatives, making them ideal for a wide range of NLP tasks requiring compact and powerful text representations. (Link to notebook)

Spark DataFrame Optimization

DataFrameOptimizer: Introducing the new DataFrameOptimizer transformer, designed to enhance the performance and manageability of Spark DataFrames within your NLP pipelines. (Link to notebook)
- Configurable Repartitioning: Allows for automatic repartitioning of DataFrames, ensuring optimal data distribution for downstream processing.
- Optional Caching: Supports DataFrame caching (doCache) to significantly speed up iterative computations.
- Persistent Output: Adds robust support for persisting DataFrames to disk in various formats (csv, json, parquet) with custom writer options via outputOptions.
- Schema Preservation: Efficiently preserves the original DataFrame schema, making it a seamless utility for complex Spark NLP pipelines.

Enhanced PDF Document Processing

The PDF Reader and PdfToText transformer have been significantly improved for more comprehensive and fault-tolerant document parsing. (Link to notebook)

Spatial Metadata Extraction (extractCoordinates): A new configurable parameter extractCoordinates in PdfToText and the PDF Reader. When enabled, this outputs detailed spatial metadata (text position and dimensions) for each character in the PDF.
Ligature Normalization (normalizeLigatures): When extractCoordinates is enabled, the normalizeLigatures option ensures that ligature characters (e.g., ﬁ, ﬂ, œ) are automatically normalized to their decomposed forms (fi, fl, oe).
Fault Tolerance with Exception Column: A new exception output column has been introduced to capture and log any processing errors encountered while handling individual PDF documents.

❤️ Community Support

Slack For live discussion with the Spark NLP community and the team
GitHub Bug reports, feature requests, and contributions
Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
Medium Spark NLP articles
JohnSnowLabs official Medium
YouTube Spark NLP video tutorials

⚙️ Installation

Python

#PyPI
pip install spark-nlp==6.0.4

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.0.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.0.4

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.0.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.0.4

Apple Silicon

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.0.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.0.4

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.0.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.0.4

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>6.0.4</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>6.0.4</version>
</dependency>

spark-nlp-silicon:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-silicon_2.12</artifactId>
    <version>6.0.4</version>
</dependency>

spark-nlp-aarch64:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-aarch64_2.12</artifactId>
    <version>6.0.4</version>
</dependency>

FAT JARs

CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-6.0.4.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-6.0.4.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-6.0.4.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-6.0.4.jar

What's Changed

Sparknlp 282 Introducing MiniLMEmbeddings Sparknlp 282 Introducing MiniLMEmbeddings #14610 by @prabod
[SPARKNLP-1086] Introducing DataFrameOptimizer [SPARKNLP-1086] Introducing DataFrameOptimizer #14607 by @danilojsl
[SPARKNLP-1161] Adding features to PDF Reader [SPARKNLP-1161] Adding features to PDF Reader #14596 by @danilojsl

Full Changelog: 6.0.3...6.0.4

This discussion was created from the release 6.0.4.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

6.0.4 #14615

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

6.0.4 #14615

Uh oh!

DevinTDHa Jun 30, 2025 Maintainer

📢 Spark NLP 6.0.4: MiniLMEmbeddings, DataFrame Optimization, and Enhanced PDF Processing

🔥 Highlights

🚀 New Features & Enhancements

Advanced Text Embeddings

Spark DataFrame Optimization

Enhanced PDF Document Processing

❤️ Community Support

⚙️ Installation

Python

Spark Packages

Maven

FAT JARs

What's Changed

Replies: 0 comments

DevinTDHa
Jun 30, 2025
Maintainer