Skip to content

Accelerate STM32 QSPI read/write operations by DMA #456

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

wdx04
Copy link

@wdx04 wdx04 commented Apr 27, 2025

Summary of changes

Added STM32H7 MDMA support to stm32_dma_utils.
Faster QSPI read/write speed using DMA.
The acceleration ratio is between 5x and 10x, depending on the microcontroller chip.
No modification is needed on user code.
This is suitable for filesystem access on QSPI Flash chips, or for driving some LCDs with QSPI interface.
Tested on at least one board of the following STM32 Families: F4, F7, L4, L4+, L5, G4, WB, U5, H7 with QSPI, H7 with OSPI.

Impact of changes

No.

Migration actions required

No.

Documentation

The QSPI Flash uses one DMA channel on the microcontroller.
There's no confict with existing SPI DMA links.

Some sequential read performance data:
(NUCLEO boards are tested with an external W25Q128FV module)

Board Flash Chip QSPI Frequency Read Speed
DISCO-F746NG N25Q128A 108MHz 29.09MB/s
DISCO-F469NI N25Q128A 90MHz 24.32MB/s
DISCO-L476VG N25Q128A 40MHz 9.43MB/s
Custom H723ZG W25Q64JV 91.6MHz 42.78MB/s
Custom U575RI W25Q64JV 80MHz 37.74MB/s
NUCLEO-G474RE W25Q128FV 85MHz 13.72MB/s
WB5MMG-DK S25FL128 32MHz 3.78MB/s
L4S5I-IOT01A MX25R64 60MHz 12.03MB/s
NUCLEO-L552ZE W25Q128FV 55MHz 11.13MB/s
NUCLEO-H432ZI2 W25Q128FV 120MHz 43.72MB/s
NUCLEO-F413ZH W25Q128FV 100MHz 13.51MB/s

Pull request type

[X] Patch update (Bug fix / Target update / Docs update / Test update / Refactor)
[] Feature update (New feature / Functionality change / New API)
[] Major update (Breaking change E.g. Return code change / API behaviour change)

Test results

[] No Tests required for this change (E.g docs only update)
[] Covered by existing mbed-os tests (Greentea or Unittest)
[X] Tests / results supplied as part of this PR

All boards are tested by filling the flash with "55 AA" pattern and then read out.
Discovery boards and custom boards are additionally tested by creating a FAT filesystem on the QSPI Flash and writing a 3.1MB binary file, then read it out and check the CRC.


Copy link
Collaborator

@multiplemonomials multiplemonomials left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on this! Really appreciate this getting added! Just a few comments here and there.

/// Mapping from QSPI/OSPI index to DMA link info
#if defined(OCTOSPI1)
static const DMALinkInfo OSPIDMALinks[] = {
{4, 0, MDMA_REQUEST_OCTOSPI1_FIFO_TH},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DMA 4 means MDMA, right? Could we define this as a constant in stm_dma_utils.h?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I think I can define the constant as MDMA_IDX.
stm_dma_ip_v1.h seems to be a better place to put the constant in, because MDMA is only available in STM32H7 and STM32H7 is treated as a special case of DMA IP v1.

}
else {
// wait until transfer complete or timeout
while(obj->handle.State == HAL_OSPI_STATE_BUSY_TX);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, is it possible to do this wait in a non spinlocking way, so other threads can execute? Like, you could create a semaphore, give it from the transfer complete callback, and wait on it here

obj->handle.State = HAL_OSPI_STATE_READY;
}
#if defined (__DCACHE_PRESENT) && (__DCACHE_PRESENT == 1U)
SCB_InvalidateDCache_by_Addr((uint32_t*)data, *length);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to be careful with this. If the data buffer is not cache aligned, this can actually be a destructive oprtation, because it will clobber the values of any other variables within the cache blocks in question. This actually happened to me during testing of DMA SPI, and it caused a HardFault.

The only way to do this truly safely is for the data buffer to start on a cache line and have a size that's an integer number of cache lines. You'd probably have to allocate a new buffer on the heap, recieve into it, and then copy into the user supplied buffer. Or, alternately, you could do this at the QSPI class layer by changing the read operation to use a CacheAlignedBuffer class.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cleaned the cache before DMA reception, so it would be safe if other variables within the cache block are not modified by other threads or interrupt handlers during DMA reception. Yes it's not 100% safe, but allocating additional memory or changing the public interface of QSPI class also have their own drawbacks. I'll try to split the read buffer into cache-aligned parts and the cache-unaligned parts, and use the non-DMA read function on the unaligned parts.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah... it's just a little risky in my book the way it is now because you can't prove that another thread or ISR isn't using memory in those blocks while this code is running

dmaHandle->Init.Endianness = MDMA_LITTLE_ENDIANNESS_PRESERVE;

dmaHandle->Init.SourceInc = periphInc ? MDMA_SRC_INC_BYTE: MDMA_SRC_INC_DISABLE;
dmaHandle->Init.DestinationInc = memInc ? MDMA_DEST_INC_BYTE: MDMA_DEST_INC_DISABLE;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to consider periphInc or the mem/periph alignment here?

dmaLink = &OSPIDMALinks[0];
#endif
// Initialize DMA channel
DMAHandlePointer dmaHandle = stm_init_dma_link(dmaLink, DMA_PERIPH_TO_MEMORY, false, true, 1, 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... This DMA is actually used for mem to periph and periph to mem transfers... does that mean the direction argument is meaningless for MDMA? I feel like this may need a little refactoring later on.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MDMA does not distinguish between memory and peripherals. The source and destination addresses can be both memory and peripheral addresses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants