Accelerate STM32 QSPI read/write operations by DMA #456

wdx04 · 2025-04-27T12:49:48Z

Summary of changes

Added STM32H7 MDMA support to stm32_dma_utils.
Faster QSPI read/write speed using DMA.
The acceleration ratio is between 5x and 10x, depending on the microcontroller chip.
No modification is needed on user code.
This is suitable for filesystem access on QSPI Flash chips, or for driving some LCDs with QSPI interface.
Tested on at least one board of the following STM32 Families: F4, F7, L4, L4+, L5, G4, WB, U5, H7 with QSPI, H7 with OSPI.

Impact of changes

No.

Migration actions required

No.

Documentation

The QSPI Flash uses one DMA channel on the microcontroller.
There's no confict with existing SPI DMA links.

Some sequential read performance data:
(NUCLEO boards are tested with an external W25Q128FV module)

Board	Flash Chip	QSPI Frequency	Read Speed
DISCO-F746NG	N25Q128A	108MHz	29.09MB/s
DISCO-F469NI	N25Q128A	90MHz	24.32MB/s
DISCO-L476VG	N25Q128A	40MHz	9.43MB/s
Custom H723ZG	W25Q64JV	91.6MHz	42.78MB/s
Custom U575RI	W25Q64JV	80MHz	37.74MB/s
NUCLEO-G474RE	W25Q128FV	85MHz	13.72MB/s
WB5MMG-DK	S25FL128	32MHz	3.78MB/s
L4S5I-IOT01A	MX25R64	60MHz	12.03MB/s
NUCLEO-L552ZE	W25Q128FV	55MHz	11.13MB/s
NUCLEO-H432ZI2	W25Q128FV	120MHz	43.72MB/s
NUCLEO-F413ZH	W25Q128FV	100MHz	13.51MB/s

Pull request type

[X] Patch update (Bug fix / Target update / Docs update / Test update / Refactor)
[] Feature update (New feature / Functionality change / New API)
[] Major update (Breaking change E.g. Return code change / API behaviour change)

Test results

[] No Tests required for this change (E.g docs only update)
[] Covered by existing mbed-os tests (Greentea or Unittest)
[X] Tests / results supplied as part of this PR

All boards are tested by filling the flash with "55 AA" pattern and then read out.
Discovery boards and custom boards are additionally tested by creating a FAT filesystem on the QSPI Flash and writing a 3.1MB binary file, then read it out and check the CRC.

Support all STM32 Families with QSPI/OSPI HAL API

multiplemonomials

Nice work on this! Really appreciate this getting added! Just a few comments here and there.

multiplemonomials · 2025-04-30T16:08:58Z

targets/TARGET_STM/TARGET_STM32H7/stm_dma_info.h

+/// Mapping from QSPI/OSPI index to DMA link info
+#if defined(OCTOSPI1)
+static const DMALinkInfo OSPIDMALinks[] = {
+        {4, 0, MDMA_REQUEST_OCTOSPI1_FIFO_TH},


DMA 4 means MDMA, right? Could we define this as a constant in stm_dma_utils.h?

OK. I think I can define the constant as MDMA_IDX.
stm_dma_ip_v1.h seems to be a better place to put the constant in, because MDMA is only available in STM32H7 and STM32H7 is treated as a special case of DMA IP v1.

multiplemonomials · 2025-05-05T15:48:36Z

targets/TARGET_STM/qspi_api.c

+            }
+            else {
+                // wait until transfer complete or timeout
+                while(obj->handle.State == HAL_OSPI_STATE_BUSY_TX);


I wonder, is it possible to do this wait in a non spinlocking way, so other threads can execute? Like, you could create a semaphore, give it from the transfer complete callback, and wait on it here

multiplemonomials · 2025-05-05T15:54:56Z

targets/TARGET_STM/qspi_api.c

+                    obj->handle.State = HAL_OSPI_STATE_READY;
+                }
+    #if defined (__DCACHE_PRESENT) && (__DCACHE_PRESENT == 1U)
+                SCB_InvalidateDCache_by_Addr((uint32_t*)data, *length);


You have to be careful with this. If the data buffer is not cache aligned, this can actually be a destructive oprtation, because it will clobber the values of any other variables within the cache blocks in question. This actually happened to me during testing of DMA SPI, and it caused a HardFault.

The only way to do this truly safely is for the data buffer to start on a cache line and have a size that's an integer number of cache lines. You'd probably have to allocate a new buffer on the heap, recieve into it, and then copy into the user supplied buffer. Or, alternately, you could do this at the QSPI class layer by changing the read operation to use a CacheAlignedBuffer class.

I cleaned the cache before DMA reception, so it would be safe if other variables within the cache block are not modified by other threads or interrupt handlers during DMA reception. Yes it's not 100% safe, but allocating additional memory or changing the public interface of QSPI class also have their own drawbacks. I'll try to split the read buffer into cache-aligned parts and the cache-unaligned parts, and use the non-DMA read function on the unaligned parts.

yeah... it's just a little risky in my book the way it is now because you can't prove that another thread or ISR isn't using memory in those blocks while this code is running

multiplemonomials · 2025-05-05T16:04:13Z

targets/TARGET_STM/stm_dma_utils.c

+        dmaHandle->Init.Endianness = MDMA_LITTLE_ENDIANNESS_PRESERVE;
+
+        dmaHandle->Init.SourceInc = periphInc ? MDMA_SRC_INC_BYTE: MDMA_SRC_INC_DISABLE;
+        dmaHandle->Init.DestinationInc = memInc ? MDMA_DEST_INC_BYTE: MDMA_DEST_INC_DISABLE;


do we need to consider periphInc or the mem/periph alignment here?

multiplemonomials · 2025-05-05T16:07:49Z

targets/TARGET_STM/qspi_api.c

+        dmaLink = &OSPIDMALinks[0];
+#endif
+        // Initialize DMA channel
+        DMAHandlePointer dmaHandle = stm_init_dma_link(dmaLink, DMA_PERIPH_TO_MEMORY, false, true, 1, 1);


Hmm... This DMA is actually used for mem to periph and periph to mem transfers... does that mean the direction argument is meaningless for MDMA? I feel like this may need a little refactoring later on.

MDMA does not distinguish between memory and peripherals. The source and destination addresses can be both memory and peripheral addresses.

wdx04 added 2 commits April 27, 2025 15:38

Accelerate QSPI read/write operations by DMA

44a1de2

Support all STM32 Families with QSPI/OSPI HAL API

fix stm_dma_utils compliation errors for MCUs with shared IRQs

8fad66e

multiplemonomials reviewed May 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate STM32 QSPI read/write operations by DMA #456

Accelerate STM32 QSPI read/write operations by DMA #456

wdx04 commented Apr 27, 2025

multiplemonomials left a comment

multiplemonomials Apr 30, 2025

wdx04 May 6, 2025

multiplemonomials May 5, 2025

multiplemonomials May 5, 2025

wdx04 May 6, 2025

multiplemonomials May 6, 2025

multiplemonomials May 5, 2025

multiplemonomials May 5, 2025

wdx04 May 6, 2025

Accelerate STM32 QSPI read/write operations by DMA #456

Are you sure you want to change the base?

Accelerate STM32 QSPI read/write operations by DMA #456

Conversation

wdx04 commented Apr 27, 2025

Summary of changes

Impact of changes

Migration actions required

Documentation

Pull request type

Test results

multiplemonomials left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment