Skip to content

Commit 7d1187d

Browse files
adienesNHDaly
authored andcommitted
[docs] add performance tip concerning overly-fused broadcast loops (#49228)
cautionary / explanatory documentation following the wake of https://discourse.julialang.org/t/unexpected-broadcasting-behavior-involving-eachrow/96781/88
1 parent 2638891 commit 7d1187d

File tree

2 files changed

+29
-0
lines changed

2 files changed

+29
-0
lines changed

doc/src/manual/functions.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1162,6 +1162,8 @@ julia> 1:5 .|> [x->x^2, inv, x->2*x, -, isodd]
11621162
true
11631163
```
11641164

1165+
All functions in the fused broadcast are always called for every element of the result. Thus `X .+ σ .* randn.()` will add a mask of independent and identically sampled random values to each element of the array `X`, but `X .+ σ .* randn()` will add the *same* random sample to each element. In cases where the fused computation is constant along one or more axes of the broadcast iteration, it may be possible to leverage a space-time tradeoff and allocate intermediate values to reduce the number of computations. See more at [performance tips](@ref man-performance-unfuse).
1166+
11651167
## Further Reading
11661168

11671169
We should mention here that this is far from a complete picture of defining functions. Julia has

doc/src/manual/performance-tips.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1095,6 +1095,33 @@ a new temporary array and executes in a separate loop. In this example
10951095
convenient to sprinkle some dots in your expressions than to
10961096
define a separate function for each vectorized operation.
10971097

1098+
## [Fewer dots: Unfuse certain intermediate broadcasts](@id man-performance-unfuse)
1099+
1100+
The dot loop fusion mentioned above enables concise and idiomatic code to express highly performant operations. However, it is important to remember that the fused operation will be computed at every iteration of the broadcast. This means that in some situations, particularly in the presence of composed or multidimensional broadcasts, an expression with dot calls may be computing a function more times than intended. As an example, say we want to build a random matrix whose rows have Euclidean norm one. We might write something like the following:
1101+
```
1102+
julia> x = rand(1000, 1000);
1103+
1104+
julia> d = sum(abs2, x; dims=2);
1105+
1106+
julia> @time x ./= sqrt.(d);
1107+
0.002049 seconds (4 allocations: 96 bytes)
1108+
```
1109+
This will work. However, this expression will actually recompute `sqrt(d[i])` for *every* element in the row `x[i, :]`, meaning that many more square roots are computed than necessary. To see precisely over which indices the broadcast will iterate, we can call `Broadcast.combine_axes` on the arguments of the fused expression. This will return a tuple of ranges whose entries correspond to the axes of iteration; the product of lengths of these ranges will be the total number of calls to the fused operation.
1110+
1111+
It follows that when some components of the broadcast expression are constant along an axis—like the `sqrt` along the second dimension in the preceding example—there is potential for a performance improvement by forcibly "unfusing" those components, i.e. allocating the result of the broadcasted operation in advance and reusing the cached value along its constant axis. Some such potential approaches are to use temporary variables, wrap components of a dot expression in `identity`, or use an equivalent intrinsically vectorized (but non-fused) function.
1112+
```
1113+
julia> @time let s = sqrt.(d); x ./= s end;
1114+
0.000809 seconds (5 allocations: 8.031 KiB)
1115+
1116+
julia> @time x ./= identity(sqrt.(d));
1117+
0.000608 seconds (5 allocations: 8.031 KiB)
1118+
1119+
julia> @time x ./= map(sqrt, d);
1120+
0.000611 seconds (4 allocations: 8.016 KiB)
1121+
```
1122+
1123+
Any of these options yields approximately a three-fold speedup at the cost of an allocation; for large broadcastables this speedup can be asymptotically very large.
1124+
10981125
## [Consider using views for slices](@id man-performance-views)
10991126

11001127
In Julia, an array "slice" expression like `array[1:5, :]` creates

0 commit comments

Comments
 (0)