Skip to content

i.cca: fix indexing, buffer size issues and matrix computation #5948

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jayneel-shah18
Copy link
Contributor

@jayneel-shah18 jayneel-shah18 commented Jun 23, 2025

This PR resolves a segmentation fault and memory misalignment in the i.cca module. (#5947)

Changes Made:

  • Corrected all loops and array accesses to use zero-based indexing.
  • Replaced incorrect memory allocations sized by nclass with allocations correctly sized by bands.
  • Fixed off-by-one output raster naming to preserve the original <prefix>.1, <prefix>.2, etc.
  • Ensured symmetric covariance matrices by correctly including diagonal elements in the computation.

Testing:

The changes were tested using the same reproduction steps outlined in the linked issue. The module now runs successfully and produces canonical component outputs without error. This fix brings i.cca in line with standard C indexing practices, preventing both runtime crashes and silent memory corruption.

Fix Matrix Calculation and Numeric Stability

Summary of new changes

  • Correct computation of between-class scatter matrix B
    • The original implementation subtracted μ̄ * μ̄ᵀ in each loop iteration, which caused B to be mis-scaled.
    • This is replaced with the canonical formula, which matches theoretical expectations (validated with synthetic data).

B = (1 / (nclass - 1)) * [ ∑ nᵢ * μᵢ * μᵢᵀ - N * μ̄ * μ̄ᵀ ]

  • Normalize canonical vectors

    • Added normalization step so each row of the transformation matrix q has unit norm.
    • This improves interpretability and stability.
  • Switch outputs to DCELL

    • Output rasters now use DCELL precision instead of CELL.
    • Prevents rounding errors and loss of detail in canonical projections.

I will be adding the regression testsuite for this module in a separate PR once the changes are reviewed.

@petrasovaa
Copy link
Contributor

Have you seen #3239?

@github-actions github-actions bot added C Related code is in C module imagery labels Jun 23, 2025
@jayneel-shah18
Copy link
Contributor Author

jayneel-shah18 commented Jun 24, 2025

I reviewed #3239 in depth. The patch I have proposed in this PR addresses the same root issues: segmentation faults and incorrect array indexing in i.cca.

As @marisn rightly pointed out, while the module now executes without crashing, correctness of the output remains uncertain. To investigate this, I constructed a synthetic test case based on the LASDOC methodology as suggested there to verify the mathematical part of the canonical transformation.

Test Setup

I generated synthetic rasters with two classes, each having 2D Gaussian distributions with distinct means and covariance matrices:

mean1 = [2, 3],    cov1 = [[1,  0.5],  [0.5,  1]]
mean2 = [4, 1],    cov2 = [[1, -0.5],  [-0.5, 1]]

The signature file generated by i.gensig closely matched these intended statistics:

band1  band2
Class 1: mean = [2.08307, 3.11709], cov = [[0.823, 0.300], [0.300, 0.775]]
Class 2: mean = [3.91068, 1.13281], cov = [[1.099, -0.592], [-0.592, 0.959]]

These values are acceptably close for testing purposes.

Expected vs Actual CCA Output

Using LASDOC's canonical projection direction, the expected separation along the first canonical axis should have absolute difference between the means nearly equal to 2.8.

However, the actual results from the patched i.cca module were:

  • Mean of cca.1 in class 1 (masked): 3.73
  • Mean of cca.1 in class 2 (masked): 3.19
  • Difference = 0.54, significantly less than expected

These results suggest that while the implementation runs, the transformation may not be projecting the data along the true canonical direction. The cause could lie in incorrect eigenvector handling, normalization, or covariance preprocessing.

I will now dive into the matrix computation steps other than main.c to find where the numerical behavior diverges from LASDOC expectations.

@echoix
Copy link
Member

echoix commented Jun 24, 2025

How big of an error/deviation would an off by one issue cause in the same synthetic data, before concluding if it is a close enough result?

@jayneel-shah18
Copy link
Contributor Author

Yes, the off-by-one bug was critical, but even after fixing it, the numerical deviation suggests that other computational inaccuracies remain. I tried fixing a precision issues in transform.c, after correcting the data type from CELL to FCELL during the transformation step, the delta improved slightly to ~0.57.

@petrasovaa petrasovaa requested a review from marisn June 24, 2025 13:50
Copy link
Contributor

@marisn marisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should do a calculation on paper based on the original description and then create a test case based on this manually validated data. Take look at some other modules to see examples (e.g. r.kappa, r.mapcalc) – the dataset does not have to be large one (e.g. 5x5 raster should be fine) to make it easy to calculate by hand. This should be enough to validate that the algorithm is implemented correctly. Then the next step can be tracking down numerical stability issues.

A test case to merge this PR is a must!

@jayneel-shah18
Copy link
Contributor Author

I tried the numerical example and while debugging I found out that in the original between() function, the computation of the between-class scatter matrix B was implemented as:

// Compute weighted overall mean
for (i = 0; i < nclass; i++)
    for (j = 0; j < bands; j++)
        newvec[j] += nsamp[i] * mu[i][j];

// Compute outer product of overall mean
for (i = 0; i < bands; i++)
    for (j = 0; j < bands; j++)
        tmp1[i][j] = (newvec[i] * newvec[j]) / samptot;

// Compute sum of (n_i * mu_i * mu_i^T) - overall_mean_outer
for (k = 0; k < nclass; k++) {
    product(mu[k], nsamp[k], tmp0, bands);
    for (i = 0; i < bands; i++)
        for (j = 0; j < bands; j++)
            tmp2[i][j] += tmp0[i][j] - tmp1[i][j];  
}

The subtraction of tmp1 (i.e., the outer product of the global mean μ̄·μ̄ᵀ) was performed within the loop over each class. This results in subtracting the same μ̄·μ̄ᵀ term multiple times , once per class , which violates the formula:

Correct formula:
$B = \frac{1}{n_{\text{classes}} - 1} \left[ \sum_{i} n_i \mu_i \mu_i^T - N \bar{\mu} \bar{\mu}^T \right]$


I tried to fix the logic which now separates the classwise accumulation and the global mean subtraction:

// Step 1: Compute weighted global mean
for (j = 0; j < bands; j++) {
    newvec[j] = 0.0;
    for (i = 0; i < nclass; i++)
        newvec[j] += nsamp[i] * mu[i][j];
    newvec[j] /= samptot;
}

// Step 2: Compute N · (μ̄ · μ̄ᵀ)
for (i = 0; i < bands; i++)
    for (j = 0; j < bands; j++)
        tmp1[i][j] = samptot * newvec[i] * newvec[j]; 

// Step 3: Accumulate ∑ n_i · μ_i · μ_iᵀ
for (k = 0; k < nclass; k++) {
    product(mu[k], nsamp[k], tmp0, bands);
    for (i = 0; i < bands; i++)
        for (j = 0; j < bands; j++)
            tmp2[i][j] += tmp0[i][j]; 
}

// Step 4: Subtract global mean outer product once
for (i = 0; i < bands; i++)
    for (j = 0; j < bands; j++)
        tmp2[i][j] -= tmp1[i][j];

// Step 5: Normalize
for (i = 0; i < bands; i++)
    for (j = 0; j < bands; j++)
        p[i][j] = tmp2[i][j] / (nclass - 1);

With this fix:

  • The B matrix now matches the canonical result:

    B = [[ 50, -50 ],
         [-50,  50 ]]
    
  • Canonical vector q[0] = [0.7071, -0.7071]

  • Projection difference = 2.8284 as expected

I will add this change to the PR once it is verified. Thank you!

@jayneel-shah18 jayneel-shah18 changed the title i.cca: fix indexing and buffer size issues i.cca: fix indexing, buffer size issues and matrix computation Jun 27, 2025
@petrasovaa petrasovaa requested a review from marisn June 27, 2025 13:46
@petrasovaa
Copy link
Contributor

This looks good to me, but if somebody else more proficient in linear algebra wants to review this, that would be better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C Related code is in C imagery module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants