Skip to content

Commit 91b2a6b

Browse files
committed
ZFS Interface for Accelerators (Z.I.A.)
The ZIO pipeline has been modified to allow for external, alternative implementations of existing operations to be used. The original ZFS functions remain in the code as fallback in case the external implementation fails. Definitions: Accelerator - an entity (usually hardware) that is intended to accelerate operations Offloader - synonym of accelerator; used interchangeably Data Processing Unit Services Module (DPUSM) - https://github.com/hpc/dpusm - defines a "provider API" for accelerator vendors to set up - defines a "user API" for accelerator consumers to call - maintains list of providers and coordinates interactions between providers and consumers. Provider - a DPUSM wrapper for an accelerator's API Offload - moving data from ZFS/memory to the accelerator Onload - the opposite of offload In order for Z.I.A. to be extensible, it does not directly communicate with a fixed accelerator. Rather, Z.I.A. acquires a handle to a DPUSM, which is then used to acquire handles to providers. Using ZFS with Z.I.A.: 1. Build and start the DPUSM 2. Implement, build, and register a provider with the DPUSM 3. Reconfigure ZFS with '--with-zia=<DPUSM root>' 4. Rebuild and start ZFS 5. Create a zpool 6. Select the provider zpool set zia_provider=<provider name> <zpool> 7. Select operations to offload zpool set zia_<property>=on <zpool> The operations that have been modified are: - compression - non-raw-writes only - decompression - checksum - not handling embedded checksums - checksum compute and checksum error call the same function - raidz - generation - reconstruction - vdev_file - open - write - close - vdev_disk - open - invalidate - write - flush - close Successful operations do not bring data back into memory after they complete, allowing for subsequent offloader operations reuse the data. This results in only one data movement per ZIO at the beginning of a pipeline that is necessary for getting data from ZFS to the accelerator. When errors ocurr and the offloaded data is still accessible, the offloaded data will be onloaded (or dropped if it still matches the in-memory copy) for that ZIO pipeline stage and processed with ZFS. This will cause thrashing if a later operation offloads data. This should not happen often, as constant errors (resulting in data movement) is not expected to be the norm. Unrecoverable errors such as hardware failures will trigger pipeline restarts (if necessary) in order to complete the original ZIO using the software path. The modifications to ZFS can be thought of as two sets of changes: - The ZIO write pipeline - compression, checksum, RAIDZ generation, and write - Each stage starts by offloading data that was not previously offloaded - This allows for ZIOs to be offloaded at any point in the pipeline - Resilver - vdev_raidz_io_done (RAIDZ reconstruction, checksum, and RAIDZ generation), and write - Because the core of resilver is vdev_raidz_io_done, data is only offloaded once at the beginning of vdev_raidz_io_done - Errors cause data to be onloaded, but will not re-offload in subsequent steps within resilver - Write is a separate ZIO pipeline stage, so it will attempt to offload data The zio_decompress function has been modified to allow for offloading but the ZIO read pipeline as a whole has not, so it is not part of the above list. An example provider implementation can be found in module/zia-software-provider - The provider's "hardware" is actually software - data is "offloaded" to memory not owned by ZFS - Calls ZFS functions in order to not reimplement operations - Has kernel module parameters that can be used to trigger ZIA_ACCELERATOR_DOWN states for testing pipeline restarts. abd_t, raidz_row_t, and vdev_t have each been given an additional "void *<prefix>_zia_handle" member. These opaque handles point to data that is located on an offloader. abds are still allocated, but their payloads are expected to diverge from the offloaded copy as operations are run. Encryption and deduplication are disabled for zpools with Z.I.A. operations enabled Aggregation is disabled for offloaded abds RPMs will build with Z.I.A. Signed-off-by: Jason Lee <[email protected]>
1 parent 4d469ac commit 91b2a6b

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

57 files changed

+5469
-72
lines changed

Makefile.am

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,8 @@ dist_noinst_DATA += module/os/linux/spl/THIRDPARTYLICENSE.gplv2
5757
dist_noinst_DATA += module/os/linux/spl/THIRDPARTYLICENSE.gplv2.descrip
5858
dist_noinst_DATA += module/zfs/THIRDPARTYLICENSE.cityhash
5959
dist_noinst_DATA += module/zfs/THIRDPARTYLICENSE.cityhash.descrip
60+
dist_noinst_DATA += module/zfs/THIRDPARTYLICENSE.zia
61+
dist_noinst_DATA += module/zfs/THIRDPARTYLICENSE.zia.descrip
6062

6163
@CODE_COVERAGE_RULES@
6264

config/Rules.am

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ AM_CPPFLAGS += -DPKGDATADIR=\"$(pkgdatadir)\"
4444
AM_CPPFLAGS += $(DEBUG_CPPFLAGS)
4545
AM_CPPFLAGS += $(CODE_COVERAGE_CPPFLAGS)
4646
AM_CPPFLAGS += -DTEXT_DOMAIN=\"zfs-@ac_system_l@-user\"
47+
AM_CPPFLAGS += $(ZIA_CPPFLAGS)
4748

4849
if ASAN_ENABLED
4950
AM_CPPFLAGS += -DZFS_ASAN_ENABLED

config/zfs-build.m4

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -263,6 +263,8 @@ AC_DEFUN([ZFS_AC_CONFIG], [
263263
AC_SUBST(TEST_JOBS)
264264
])
265265
266+
ZFS_AC_ZIA
267+
266268
ZFS_INIT_SYSV=
267269
ZFS_INIT_SYSTEMD=
268270
ZFS_WANT_MODULES_LOAD_D=
@@ -294,7 +296,8 @@ AC_DEFUN([ZFS_AC_CONFIG], [
294296
[test "x$qatsrc" != x ])
295297
AM_CONDITIONAL([WANT_DEVNAME2DEVID], [test "x$user_libudev" = xyes ])
296298
AM_CONDITIONAL([WANT_MMAP_LIBAIO], [test "x$user_libaio" = xyes ])
297-
AM_CONDITIONAL([PAM_ZFS_ENABLED], [test "x$enable_pam" = xyes])
299+
AM_CONDITIONAL([PAM_ZFS_ENABLED], [test "x$enable_pam" = xyes ])
300+
AM_CONDITIONAL([ZIA_ENABLED], [test "x$enable_zia" = xyes ])
298301
])
299302

300303
dnl #
@@ -342,6 +345,10 @@ AC_DEFUN([ZFS_AC_RPM], [
342345
RPM_DEFINE_COMMON=${RPM_DEFINE_COMMON}' --define "__strip /bin/true"'
343346
])
344347
348+
AS_IF([test "x$enable_zia" = xyes], [
349+
RPM_DEFINE_COMMON=${RPM_DEFINE_COMMON}' --define "$(WITH_ZIA) 1" --define "DPUSM_ROOT $(DPUSM_ROOT)"'
350+
])
351+
345352
RPM_DEFINE_UTIL=' --define "_initconfdir $(initconfdir)"'
346353
347354
dnl # Make the next three RPM_DEFINE_UTIL additions conditional, since

config/zia.m4

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
dnl # Adds --with-zia=PATH to configuration options
2+
dnl # The path provided should point to the DPUSM
3+
dnl # root and contain Module.symvers.
4+
AC_DEFUN([ZFS_AC_ZIA], [
5+
AC_ARG_WITH([zia],
6+
AS_HELP_STRING([--with-zia=PATH],
7+
[Path to Data Processing Services Module]),
8+
[
9+
DPUSM_ROOT="$withval"
10+
AS_IF([test "x$DPUSM_ROOT" != "xno"],
11+
[enable_zia=yes],
12+
[enable_zia=no])
13+
],
14+
[enable_zia=no]
15+
)
16+
17+
AS_IF([test "x$enable_zia" == "xyes"],
18+
AS_IF([! test -d "$DPUSM_ROOT"],
19+
[AC_MSG_ERROR([--with-zia=PATH requires the DPUSM root directory])]
20+
)
21+
22+
DPUSM_SYMBOLS="$DPUSM_ROOT/Module.symvers"
23+
24+
AS_IF([test -r $DPUSM_SYMBOLS],
25+
[
26+
AC_MSG_RESULT([$DPUSM_SYMBOLS])
27+
ZIA_CPPFLAGS="-DZIA=1 -I$DPUSM_ROOT/include"
28+
KERNEL_ZIA_CPPFLAGS="-DZIA=1 -I$DPUSM_ROOT/include"
29+
WITH_ZIA="_with_zia"
30+
31+
AC_SUBST(WITH_ZIA)
32+
AC_SUBST(KERNEL_ZIA_CPPFLAGS)
33+
AC_SUBST(ZIA_CPPFLAGS)
34+
AC_SUBST(DPUSM_SYMBOLS)
35+
AC_SUBST(DPUSM_ROOT)
36+
],
37+
[
38+
AC_MSG_ERROR([
39+
*** Failed to find Module.symvers in:
40+
$DPUSM_SYMBOLS
41+
])
42+
]
43+
)
44+
)
45+
])

include/Makefile.am

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,9 @@ COMMON_H = \
143143
sys/zfs_vfsops.h \
144144
sys/zfs_vnops.h \
145145
sys/zfs_znode.h \
146+
sys/zia.h \
147+
sys/zia_cddl.h \
148+
sys/zia_private.h \
146149
sys/zil.h \
147150
sys/zil_impl.h \
148151
sys/zio.h \

include/sys/abd.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,7 @@ typedef struct abd {
6565
list_t abd_gang_chain;
6666
} abd_gang;
6767
} abd_u;
68+
void *abd_zia_handle;
6869
} abd_t;
6970

7071
typedef int abd_iter_func_t(void *buf, size_t len, void *priv);

include/sys/fs/zfs.h

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -262,6 +262,19 @@ typedef enum {
262262
ZPOOL_PROP_DEDUP_TABLE_SIZE,
263263
ZPOOL_PROP_DEDUP_TABLE_QUOTA,
264264
ZPOOL_PROP_DEDUPCACHED,
265+
ZPOOL_PROP_ZIA_AVAILABLE,
266+
ZPOOL_PROP_ZIA_PROVIDER,
267+
ZPOOL_PROP_ZIA_COMPRESS,
268+
ZPOOL_PROP_ZIA_DECOMPRESS,
269+
ZPOOL_PROP_ZIA_CHECKSUM,
270+
ZPOOL_PROP_ZIA_RAIDZ1_GEN,
271+
ZPOOL_PROP_ZIA_RAIDZ2_GEN,
272+
ZPOOL_PROP_ZIA_RAIDZ3_GEN,
273+
ZPOOL_PROP_ZIA_RAIDZ1_REC,
274+
ZPOOL_PROP_ZIA_RAIDZ2_REC,
275+
ZPOOL_PROP_ZIA_RAIDZ3_REC,
276+
ZPOOL_PROP_ZIA_FILE_WRITE,
277+
ZPOOL_PROP_ZIA_DISK_WRITE,
265278
ZPOOL_NUM_PROPS
266279
} zpool_prop_t;
267280

include/sys/spa_impl.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,7 @@
5252
#include <sys/zfeature.h>
5353
#include <sys/zthr.h>
5454
#include <sys/dsl_deadlist.h>
55+
#include <sys/zia.h>
5556
#include <zfeature_common.h>
5657

5758
#ifdef __cplusplus
@@ -479,6 +480,8 @@ struct spa {
479480
*/
480481
spa_config_lock_t spa_config_lock[SCL_LOCKS]; /* config changes */
481482
zfs_refcount_t spa_refcount; /* number of opens */
483+
484+
zia_props_t spa_zia_props;
482485
};
483486

484487
extern char *spa_config_path;

include/sys/vdev_disk.h

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,5 +42,13 @@
4242

4343
#ifdef _KERNEL
4444
#include <sys/vdev.h>
45+
46+
#ifdef __linux__
47+
int __vdev_classic_physio(struct block_device *bdev, zio_t *zio,
48+
size_t io_size, uint64_t io_offset, int rw, int flags);
49+
int vdev_disk_io_flush(struct block_device *bdev, zio_t *zio);
50+
void vdev_disk_error(zio_t *zio);
51+
#endif /* __linux__ */
52+
4553
#endif /* _KERNEL */
4654
#endif /* _SYS_VDEV_DISK_H */

include/sys/vdev_file.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,10 @@ typedef struct vdev_file {
4040
extern void vdev_file_init(void);
4141
extern void vdev_file_fini(void);
4242

43+
#ifdef __linux__
44+
extern mode_t vdev_file_open_mode(spa_mode_t spa_mode);
45+
#endif
46+
4347
#ifdef __cplusplus
4448
}
4549
#endif

include/sys/vdev_impl.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -467,6 +467,8 @@ struct vdev {
467467
uint64_t vdev_io_t;
468468
uint64_t vdev_slow_io_n;
469469
uint64_t vdev_slow_io_t;
470+
471+
void *vdev_zia_handle;
470472
};
471473

472474
#define VDEV_PAD_SIZE (8 << 10)

include/sys/vdev_raidz.h

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -169,6 +169,11 @@ extern int vdev_raidz_load(vdev_t *);
169169
#define RAIDZ_EXPAND_PAUSE_SCRATCH_POST_REFLOW_1 6
170170
#define RAIDZ_EXPAND_PAUSE_SCRATCH_POST_REFLOW_2 7
171171

172+
void vdev_raidz_generate_parity_p(struct raidz_row *);
173+
void vdev_raidz_generate_parity_pq(struct raidz_row *);
174+
void vdev_raidz_generate_parity_pqr(struct raidz_row *);
175+
void vdev_raidz_reconstruct_general(struct raidz_row *, int *, int);
176+
172177
#ifdef __cplusplus
173178
}
174179
#endif

include/sys/vdev_raidz_impl.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -136,6 +136,7 @@ typedef struct raidz_row {
136136
uint64_t rr_offset; /* Logical offset for *_io_verify() */
137137
uint64_t rr_size; /* Physical size for *_io_verify() */
138138
#endif
139+
void *rr_zia_handle;
139140
raidz_col_t rr_col[]; /* Flexible array of I/O columns */
140141
} raidz_row_t;
141142

include/sys/zap_impl.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ typedef struct mzap_phys {
6161
uint64_t mz_salt;
6262
uint64_t mz_normflags;
6363
uint64_t mz_pad[5];
64-
mzap_ent_phys_t mz_chunk[1];
64+
mzap_ent_phys_t mz_chunk[];
6565
/* actually variable size depending on block size */
6666
} mzap_phys_t;
6767

0 commit comments

Comments
 (0)