Skip to content

Packaging tools jar with the python package #1634

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
May 6, 2025
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 32 additions & 5 deletions user_tools/build.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/bin/bash
# Copyright (c) 2023-2024, NVIDIA CORPORATION.
# Copyright (c) 2023-2025, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -29,6 +29,7 @@ WORK_DIR="$( cd -- "$( dirname -- "${BASH_SOURCE[0]:-$0}"; )" &> /dev/null && pw

# Define resource directory
RESOURCE_DIR="src/spark_rapids_pytools/resources"
TOOLS_RESOURCE_FOLDER="tools-resources"
PREPACKAGED_FOLDER="csp-resources"

# Constants and variables of core module
Expand All @@ -37,6 +38,8 @@ TOOLS_JAR_FILE=""


# Function to run mvn command to build the tools jar
# This function skips the test cases and builds the jar file and only
# picks the jar file without sources/javadoc/tests..
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to pass arguments to the build_jar function.
This probably should come from build.sh so that we can have different builds for different jobs.
For example, in #1639 I added a profile-id release. We should be pass that argument down to the mvn command in order to create the specific build we need.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amahussein . I am evaluating the final behavior of this script on the basis of the internal build jobs and post having finalised the design, will update the build.sh file to take appropriate params.

build_jar_from_source() {
# store teh current directory
local curr_dir
Expand All @@ -62,11 +65,22 @@ build_jar_from_source() {
cd "$curr_dir" || exit
}

# Function to run the dependency downloader script for fat mode
# Function to run the dependency downloader script for non-fat/fat mode
# prepackage_mgr.py file downloads the dependencies for the csp-related resources
# in case of fat mode.
# In case of non-fat mode, it just copies the tools jar into the tools-resources folder
# --fetch_all_csp=True toggles the fat/non-fat mode for the script
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name fetch_all_csp is not good in that option. If this is a flag indicating Fat mode, then why don't we use something more descriptive like fat_mode_enabled ?

Copy link
Collaborator Author

@sayedbilalbari sayedbilalbari Apr 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update the comments to make this more clear. So previous behavior was -

  • in case fat passed, then build_jar, download_all_dependencies, compress and add it to the wheel
  • in case no, then just create the whl from the source files

For this ./build.sh fat vs ./build.sh triggered either cases

Now to replicate the same, script calling behavior , we still call either ./build.sh fat and ./build.sh. The difference now, fat mode packs all the csp related dependencies by compressing it into a tgz file and includes it in the resources.
fetch_all_csp flag is for the internal prepackage_mgr to include the jar or the jar/csp resources. Hence --fetch_all_csp as a param.

Let me know if this still seems wrong, will update it in that case

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not get the new behavior.
for both build options (fat and non-fat), do the tools-jar go into tools-resources or it depends on the build mode?

Copy link
Collaborator Author

@sayedbilalbari sayedbilalbari Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For both build modes, the tools-jar goes into the resources.
In the fat mode is that all csp dependencies are also packed with the jar.

download_web_dependencies() {
local res_dir="$1"
local is_fat_mode="$2"
local web_downloader_script="$res_dir/dev/prepackage_mgr.py"
python "$web_downloader_script" run --resource_dir="$res_dir" --tools_jar="$TOOLS_JAR_FILE"
if [ "$is_fat_mode" = "true" ]; then
echo "Downloading dependencies for fat mode"
python "$web_downloader_script" run --resource_dir="$res_dir" --tools_jar="$TOOLS_JAR_FILE" --fetch_all_csp=True
else
echo "Downloading dependencies for non-fat mode"
python "$web_downloader_script" run --resource_dir="$res_dir" --tools_jar="$TOOLS_JAR_FILE" --fetch_all_csp=False
fi
if [ $? -ne 0 ]; then
echo "Dependency download failed for fat mode. Exiting"
exit 1
Expand All @@ -76,6 +90,8 @@ download_web_dependencies() {
# Function to remove dependencies from the fat directory
remove_web_dependencies() {
local res_dir="$1"
# remove tools jar
rm -rf "${res_dir:?}"/"$TOOLS_RESOURCE_FOLDER"
# remove folder recursively
rm -rf "${res_dir:?}"/"$PREPACKAGED_FOLDER"
# remove compressed file in case archive-mode was enabled
Expand All @@ -90,12 +106,23 @@ pre_build() {

# Build process
build() {
# Deletes pre-existing csp-resources.tgz folder
remove_web_dependencies "$RESOURCE_DIR"
# Build the tools jar from source
build_jar_from_source
if [ "$build_mode" = "fat" ]; then
echo "Building in fat mode"
build_jar_from_source
download_web_dependencies "$RESOURCE_DIR"
# This will download the dependencies and create the csp-resources
# and copy the dependencies into the csp-resources folder
# Tools resources are copied into the tools-resources folder
download_web_dependencies "$RESOURCE_DIR" "true"
else
echo "Building in non-fat mode"
# This will just copy the tools jar built from source into the tools-resources folder
download_web_dependencies "$RESOURCE_DIR" "false"
fi
# Builds the python wheel file
# Look into the pyproject.toml file for the build system requirements
python -m build --wheel
}

Expand Down
2 changes: 1 addition & 1 deletion user_tools/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ where = ["src"]
[tool.setuptools.dynamic]
version = {attr = "spark_rapids_pytools.__version__"}
[tool.setuptools.package-data]
"*"= ["*.json", "*.yaml", "*.ms", "*.sh", "*.tgz", "*.properties"]
"*"= ["*.json", "*.yaml", "*.ms", "*.sh", "*.tgz", "*.properties","*.jar"]
[tool.poetry]
repository = "https://github.com/NVIDIA/spark-rapids-tools/tree/main"
[project.optional-dependencies]
Expand Down
1 change: 0 additions & 1 deletion user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py
Original file line number Diff line number Diff line change
Expand Up @@ -427,7 +427,6 @@ class RapidsJarTool(RapidsTool):

def _process_jar_arg(self):
# TODO: use the StorageLib to download the jar file
jar_path = ''
tools_jar_url = self.wrapper_options.get('toolsJar')
try:
if tools_jar_url is None:
Expand Down
48 changes: 32 additions & 16 deletions user_tools/src/spark_rapids_pytools/rapids/tool_ctxt.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,9 @@ class ToolContext(YAMLPropertiesContainer):
Utils.resource_path('csp-resources.tgz'),
Utils.resource_path('csp-resources')
]
tools_resource_path: ClassVar[List[str]] = [
Utils.resource_path('tools-resources')
]

@classmethod
def are_resources_prepackaged(cls) -> bool:
Expand Down Expand Up @@ -87,6 +90,9 @@ def _init_fields(self):
def get_deploy_mode(self) -> Any:
return self.platform_opts.get('deployMode')

def use_local_tools_jar(self) -> bool:
return self.get_ctxt('useLocalToolsJar')

def is_fat_wheel_mode(self) -> bool:
return self.get_ctxt('fatWheelModeEnabled')

Expand Down Expand Up @@ -142,9 +148,9 @@ def set_local_workdir(self, parent: str):
self.logger.info('Dependencies are generated locally in local disk as: %s', dep_folder)
self.logger.info('Local output folder is set as: %s', exec_root_dir)

def _identify_fat_wheel_jar(self, resource_files: List[str]) -> None:
def _identify_tools_wheel_jar(self, resource_files: List[str]) -> None:
"""
Identifies the tools JAR file from resource files in fat wheel mode and sets its name in the context.
Identifies the tools JAR file from resource files and sets its name in the context.
:param resource_files: List of resource files to search for the tools JAR file.
:raises AssertionError: If the number of matching files is not exactly one.
"""
Expand All @@ -155,15 +161,28 @@ def _identify_fat_wheel_jar(self, resource_files: List[str]) -> None:
(f'Expected exactly one tools JAR file, found {len(matched_files)}. '
'Rebuild the wheel package with the correct tools JAR file.')
# set the tools JAR file name in the context
self.set_ctxt('fatWheelModeJarFileName', FSUtil.get_resource_name(matched_files[0]))
self.set_ctxt('useLocalToolsJar', True)
self.set_ctxt('toolsJarFileName', FSUtil.get_resource_name(matched_files[0]))

def load_prepackaged_resources(self):
"""
Checks for the tools jar and adds it to context
Checks if the packaging includes the CSP dependencies. If so, it moves the dependencies
into the tmp folder. This allows the tool to pick the resources from cache folder.
"""
for tools_related_files in self.tools_resource_path:
# This function uses a regex based comparison to identify the tools jar file
# from the tools-resources directory. The jar is pre-packed in the wheel file
# and moved to the work directory when user runs the tool.
self.logger.info('Checking for tools related files in %s', tools_related_files)
if os.path.exists(tools_related_files):
FSUtil.copy_resource(tools_related_files, self.get_cache_folder())
self._identify_tools_wheel_jar(FSUtil.get_all_files(tools_related_files))

if not self.are_resources_prepackaged():
self.logger.info('No prepackaged resources found.')
return

self.set_ctxt('fatWheelModeEnabled', True)
self.logger.info(Utils.gen_str_header('Fat Wheel Mode Is Enabled',
ruler='_', line_width=50))
Expand All @@ -173,12 +192,10 @@ def load_prepackaged_resources(self):
if os.path.isdir(res_path):
# this is a directory, copy all the contents to the tmp
FSUtil.copy_resource(res_path, self.get_cache_folder())
self._identify_fat_wheel_jar(FSUtil.get_all_files(res_path))
else:
# this is an archived file
with tarfile.open(res_path, mode='r:*') as tar_file:
tar_file.extractall(self.get_cache_folder())
self._identify_fat_wheel_jar(tar_file.getnames())
tar_file.close()

def get_output_folder(self) -> str:
Expand All @@ -193,11 +210,10 @@ def get_local_work_dir(self) -> str:
return self.get_local('depFolder')

def get_rapids_jar_url(self) -> str:
self.logger.info('Fetching the Rapids Jar URL')
# get the version from the package, instead of the yaml file
# jar_version = self.get_value('sparkRapids', 'version')
if self.is_fat_wheel_mode():
return self._get_tools_jar_in_fat_wheel_mode()
self.logger.info('Fetching the Rapids Jar URL from local context')
if self.use_local_tools_jar():
return self._get_tools_jar_from_local()
self.logger.info('Tools JAR not found in local context. Downloading from Maven.')
mvn_base_url = self.get_value('sparkRapids', 'mvnUrl')
jar_version = Utilities.get_latest_mvn_jar_from_metadata(mvn_base_url)
rapids_url = self.get_value('sparkRapids', 'repoUrl').format(mvn_base_url, jar_version, jar_version)
Expand Down Expand Up @@ -237,22 +253,22 @@ def get_platform_name(self) -> str:
"""
return CspEnv.pretty_print(self.platform.type_id)

def _get_tools_jar_in_fat_wheel_mode(self) -> str:
def _get_tools_jar_from_local(self) -> str:
"""
Extracts the tools JAR file from the context and returns its path from the cache folder.
"""
jar_filename = self.get_ctxt('fatWheelModeJarFileName')
jar_filename = self.get_ctxt('toolsJarFileName')
if jar_filename is None:
raise ValueError(
'In Fat Mode. Tools JAR file name not found in context. '
'Rebuild the wheel package or re-run without fat wheel mode.'
'Tools JAR file name not found in context. '
'Make sure the tools JAR is included in the package.'
)
# construct the path to the tools JAR file in the cache folder
jar_filepath = FSUtil.build_path(self.get_cache_folder(), jar_filename)
if not FSUtil.resource_exists(jar_filepath):
raise FileNotFoundError(
f'In Fat Mode. Tools JAR not found in cache folder: {jar_filepath}. '
'Rebuild the wheel package or re-run without fat wheel mode.'
f'Tools JAR not found in cache folder: {jar_filepath}. '
'Rebuild the wheel package'
)
self.logger.info('Using jar from wheel file %s', jar_filepath)
return jar_filepath
62 changes: 41 additions & 21 deletions user_tools/src/spark_rapids_pytools/resources/dev/prepackage_mgr.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2023-2024, NVIDIA CORPORATION.
# Copyright (c) 2023-2025, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -36,7 +36,8 @@
'_supported_platforms': [csp.value for csp in CspEnv if csp != CspEnv.NONE],
'_configs_suffix': '-configs.json',
'_mvn_base_url': 'https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark-tools_2.12',
'_folder_name': 'csp-resources'
'_folder_name': 'csp-resources',
'_tools_folder_name': 'tools-resources'
}


Expand Down Expand Up @@ -70,17 +71,19 @@ def __init__(self,
resource_dir: str,
dest_dir: str = None,
tools_jar: str = None,
archive_enabled: bool = True):
archive_enabled: bool = True,
fetch_all_csp: bool = False):
for field_name in prepackage_conf:
setattr(self, field_name, prepackage_conf.get(field_name))
self.resource_dir = resource_dir
self.dest_dir = dest_dir
self.tools_jar = tools_jar
self.archive_enabled = archive_enabled
# process the arguments for default values
self.fetch_all_csp = fetch_all_csp
print(f'Resource directory is: {self.resource_dir}')
print(f'tools_jar = {tools_jar}')
self.resource_dir = FSUtil.get_abs_path(self.resource_dir)
self.tools_resources_dir = FSUtil.build_full_path(self.resource_dir, self._tools_folder_name) # pylint: disable=no-member
if self.dest_dir is None:
self.dest_dir = FSUtil.build_full_path(self.resource_dir, self._folder_name) # pylint: disable=no-member
else:
Expand All @@ -94,6 +97,8 @@ def _get_spark_rapids_jar_url(self) -> str:
def _fetch_resources(self) -> dict:
"""
Fetches the resource information from configuration files for each supported platform.
Tools jar if passed explicitly is set as a dependency. Else it is build from source
and added as a dependency.
Returns a dictionary of resource details.
"""
resource_uris = {}
Expand All @@ -110,32 +115,39 @@ def _fetch_resources(self) -> dict:
jar_file_name = FSUtil.get_resource_name(tools_jar_url)
resource_uris[tools_jar_url] = {
'depItem': RuntimeDependency(name=jar_file_name, uri=tools_jar_url),
'prettyName': jar_file_name
'prettyName': jar_file_name,
'isToolsResource': True
}

for platform in self._supported_platforms: # pylint: disable=no-member
config_file = FSUtil.build_full_path(self.resource_dir,
f'{platform}{self._configs_suffix}') # pylint: disable=no-member
platform_conf = JSONPropertiesContainer(config_file)
dependency_list = RapidsTool.get_rapids_tools_dependencies('LOCAL', platform_conf)
for dependency in dependency_list:
if dependency.uri:
uri_str = str(dependency.uri)
pretty_name = FSUtil.get_resource_name(uri_str)
resource_uris[uri_str] = {
'depItem': dependency,
'prettyName': pretty_name
}
if self.fetch_all_csp:
for platform in self._supported_platforms: # pylint: disable=no-member
config_file = FSUtil.build_full_path(self.resource_dir,
f'{platform}{self._configs_suffix}') # pylint: disable=no-member
platform_conf = JSONPropertiesContainer(config_file)
dependency_list = RapidsTool.get_rapids_tools_dependencies('LOCAL', platform_conf)
for dependency in dependency_list:
if dependency.uri:
uri_str = str(dependency.uri)
pretty_name = FSUtil.get_resource_name(uri_str)
resource_uris[uri_str] = {
'depItem': dependency,
'prettyName': pretty_name,
'isToolsResource': False
}
else:
print('Skipping fetching all CSP resources')
return resource_uris

def _download_resources(self, resource_uris: dict):
download_tasks = []
for res_uri, res_info in resource_uris.items():
resource_name = res_info.get('prettyName')
is_tools_resource = res_info.get('isToolsResource')
dest_folder = self.tools_resources_dir if is_tools_resource else self.dest_dir
print(f'Creating download task: {resource_name}')
# All the downloadTasks enforces download
download_tasks.append(DownloadTask(src_url=res_uri, # pylint: disable=no-value-for-parameter)
dest_folder=self.dest_dir,
dest_folder=dest_folder,
configs={'forceDownload': True}))
# Begin downloading the resources
download_results = DownloadManager(download_tasks, max_workers=12).submit()
Expand All @@ -161,11 +173,19 @@ def _compress_resources(self) -> Optional[str]:
def run(self):
"""
Main method to fetch and download dependencies.
This function goes through the following steps:
1. Fetches the resources from the configuration files.
2. Downloads the resources to the specified directory.
3. Optionally compresses the resources into a tar file.
4. Cleans up the resources if compression is enabled.
"""
resources_to_download = self._fetch_resources()
self._download_resources(resources_to_download)
output_res = self._compress_resources()
print(f'CSP-prepackaged resources stored as {output_res}')
if self.fetch_all_csp:
output_res = self._compress_resources()
print(f'CSP-prepackaged resources stored as {output_res}')
else:
print('Packaged with tools resources only.')


if __name__ == '__main__':
Expand Down