Skip to content

Conversation

@dtellenbach
Copy link
Member

For tail-calls we want to re-use the caller stack-frame and potentially
need to copy stack arguments.

For large stack arguments, such as by-val structs, this can lead to
overwriting incoming stack arguments when preparing outgoing ones by
copying them. E.g., in cases like

    %"struct.s1" = type { [19 x i32] }

    define void @f0(ptr byval(%"struct.s1") %0, ptr %1) {
    tail call  void @f1(ptr %1, ptr byval(%"struct.s1") %0)
    ret void
    }

    declare  void @f1(ptr, ptr)

that swap arguments, the last bytes of %0 are on the stack, followed by
%1. To prepare the outgoing arguments, %0 needs to be copied and %1
needs to be loaded into r0. However, currently the copy of %0
overwrites the location of %1, resulting in loading garbage into r0.

We fix that by forcing the load to the pointer stack argument to happen
before the copy.

For tail-calls we want to re-use the caller stack-frame and potentially
need to copy stack arguments.

For large stack arguments, such as by-val structs, this can lead to
overwriting incoming stack arguments when preparing outgoing ones by
copying them. E.g., in cases like

        %"struct.s1" = type { [19 x i32] }

        define void @f0(ptr byval(%"struct.s1") %0, ptr %1) {
        tail call  void @F1(ptr %1, ptr byval(%"struct.s1") %0)
        ret void
        }

        declare  void @F1(ptr, ptr)

that swap arguments, the last bytes of %0 are on the stack, followed by
%1. To prepare the outgoing arguments, %0 needs to be copied and %1
needs to be loaded into r0. However, currently the copy of %0
overwrites the location of %1, resulting in loading garbage into r0.

We fix that by forcing the load to the pointer stack argument to happen
before the copy.
@llvmbot
Copy link
Member

llvmbot commented Nov 5, 2025

@llvm/pr-subscribers-backend-arm

Author: David Tellenbach (dtellenbach)

Changes

For tail-calls we want to re-use the caller stack-frame and potentially
need to copy stack arguments.

For large stack arguments, such as by-val structs, this can lead to
overwriting incoming stack arguments when preparing outgoing ones by
copying them. E.g., in cases like

    %"struct.s1" = type { [19 x i32] }

    define void @<!-- -->f0(ptr byval(%"struct.s1") %0, ptr %1) {
    tail call  void @<!-- -->f1(ptr %1, ptr byval(%"struct.s1") %0)
    ret void
    }

    declare  void @<!-- -->f1(ptr, ptr)

that swap arguments, the last bytes of %0 are on the stack, followed by
%1. To prepare the outgoing arguments, %0 needs to be copied and %1
needs to be loaded into r0. However, currently the copy of %0
overwrites the location of %1, resulting in loading garbage into r0.

We fix that by forcing the load to the pointer stack argument to happen
before the copy.


Full diff: https://github.com/llvm/llvm-project/pull/166492.diff

2 Files Affected:

  • (modified) llvm/lib/Target/ARM/ARMISelLowering.cpp (+36-1)
  • (added) llvm/test/CodeGen/ARM/byval_struct_copy_tailcall.ll (+69)
diff --git a/llvm/lib/Target/ARM/ARMISelLowering.cpp b/llvm/lib/Target/ARM/ARMISelLowering.cpp
index 6b0653457cbaf..46ab962e05bbd 100644
--- a/llvm/lib/Target/ARM/ARMISelLowering.cpp
+++ b/llvm/lib/Target/ARM/ARMISelLowering.cpp
@@ -2510,9 +2510,44 @@ ARMTargetLowering::LowerCall(TargetLowering::CallLoweringInfo &CLI,
 
     if (isTailCall && VA.isMemLoc() && !AfterFormalArgLoads) {
       Chain = DAG.getStackArgumentTokenFactor(Chain);
-      if (ByValTempChain)
+      if (ByValTempChain) {
+        // In case of large byval copies, re-using the stackframe for tail-calls
+        // can lead to overwriting incoming arguments on the stack. Force
+        // loading these stack arguments before the copy to avoid that.
+        SmallVector<SDValue, 8> IncomingLoad;
+        for (unsigned I = 0; I < OutVals.size(); ++I) {
+          if (Outs[I].Flags.isByVal())
+            continue;
+
+          SDValue OutVal = OutVals[I];
+          LoadSDNode *OutLN = dyn_cast_or_null<LoadSDNode>(OutVal);
+          if (!OutLN)
+            continue;
+
+          FrameIndexSDNode *FIN =
+              dyn_cast_or_null<FrameIndexSDNode>(OutLN->getBasePtr());
+          if (!FIN)
+            continue;
+
+          if (!MFI.isFixedObjectIndex(FIN->getIndex()))
+            continue;
+
+          for (const CCValAssign &VA : ArgLocs) {
+            if (VA.isMemLoc())
+              IncomingLoad.push_back(OutVal.getValue(1));
+          }
+        }
+
+        // Update the chain to force loads for potentially clobbered argument
+        // loads to happen before the byval copy.
+        if (!IncomingLoad.empty()) {
+          IncomingLoad.push_back(Chain);
+          Chain = DAG.getNode(ISD::TokenFactor, dl, MVT::Other, IncomingLoad);
+        }
+
         Chain = DAG.getNode(ISD::TokenFactor, dl, MVT::Other, Chain,
                             ByValTempChain);
+      }
       AfterFormalArgLoads = true;
     }
 
diff --git a/llvm/test/CodeGen/ARM/byval_struct_copy_tailcall.ll b/llvm/test/CodeGen/ARM/byval_struct_copy_tailcall.ll
new file mode 100644
index 0000000000000..50c676c425ce7
--- /dev/null
+++ b/llvm/test/CodeGen/ARM/byval_struct_copy_tailcall.ll
@@ -0,0 +1,69 @@
+; RUN: llc -mtriple thumbv7em-apple-darwin -o - < %s | FileCheck %s
+
+%"struct.s1" = type { [19 x i32] }
+
+define void @f0(ptr byval(%"struct.s1") %0, ptr %1) #1 {
+; CHECK-LABEL: _f0:                                    @ @f0
+; CHECK-NEXT:  @ %bb.0:
+; CHECK-NEXT:  	sub	sp, #16
+; CHECK-NEXT:  	push	{r4, lr}
+; CHECK-NEXT:  	sub	sp, #76
+; CHECK-NEXT:  	add.w	r9, sp, #84
+; CHECK-NEXT:  	stm.w	r9, {r0, r1, r2, r3}
+; CHECK-NEXT:  	mov	r0, sp
+; CHECK-NEXT:  	add	r1, sp, #84
+; CHECK-NEXT:  	movs	r2, #76
+; CHECK-NEXT:  	mov	r3, r0
+; CHECK-NEXT:  LBB0_1:                                 @ =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:  	ldr	r4, [r1], #4
+; CHECK-NEXT:  	subs	r2, #4
+; CHECK-NEXT:  	str	r4, [r3], #4
+; CHECK-NEXT:  	bne	LBB0_1
+; CHECK-NEXT:  @ %bb.2:
+; CHECK-NEXT:  	add.w	r1, r0, #12
+; CHECK-NEXT:  	add	r2, sp, #100
+; CHECK-NEXT:  	ldr	r0, [sp, #160]
+; CHECK-NEXT:  	ldr	r3, [r1], #4
+; CHECK-NEXT:  	str	r3, [r2], #4
+; CHECK-NEXT:  	ldr	r3, [r1], #4
+; CHECK-NEXT:  	str	r3, [r2], #4
+; CHECK-NEXT:  	ldr	r3, [r1], #4
+; CHECK-NEXT:  	str	r3, [r2], #4
+; CHECK-NEXT:  	ldr	r3, [r1], #4
+; CHECK-NEXT:  	str	r3, [r2], #4
+; CHECK-NEXT:  	ldr	r3, [r1], #4
+; CHECK-NEXT:  	str	r3, [r2], #4
+; CHECK-NEXT:  	ldr	r3, [r1], #4
+; CHECK-NEXT:  	str	r3, [r2], #4
+; CHECK-NEXT:  	ldr	r3, [r1], #4
+; CHECK-NEXT:  	str	r3, [r2], #4
+; CHECK-NEXT:  	ldr	r3, [r1], #4
+; CHECK-NEXT:  	str	r3, [r2], #4
+; CHECK-NEXT:  	ldr	r3, [r1], #4
+; CHECK-NEXT:  	str	r3, [r2], #4
+; CHECK-NEXT:  	ldr	r3, [r1], #4
+; CHECK-NEXT:  	str	r3, [r2], #4
+; CHECK-NEXT:  	ldr	r3, [r1], #4
+; CHECK-NEXT:  	str	r3, [r2], #4
+; CHECK-NEXT:  	ldr	r3, [r1], #4
+; CHECK-NEXT:  	str	r3, [r2], #4
+; CHECK-NEXT:  	ldr	r3, [r1], #4
+; CHECK-NEXT:  	str	r3, [r2], #4
+; CHECK-NEXT:  	ldr	r3, [r1], #4
+; CHECK-NEXT:  	str	r3, [r2], #4
+; CHECK-NEXT:  	ldr	r3, [r1], #4
+; CHECK-NEXT:  	str	r3, [r2], #4
+; CHECK-NEXT:  	ldr	r3, [r1], #4
+; CHECK-NEXT:  	str	r3, [r2], #4
+; CHECK-NEXT:  	ldm.w	sp, {r1, r2, r3}
+; CHECK-NEXT:  	add	sp, #76
+; CHECK-NEXT:  	pop.w	{r4, lr}
+; CHECK-NEXT:  	add	sp, #16
+; CHECK-NEXT:  	b.w	_f1
+  tail call  void @f1(ptr %1, ptr byval(%"struct.s1") %0)
+  ret void
+}
+
+declare void @f1(ptr, ptr)
+
+attributes #1 = { nounwind "frame-pointes"="non-leaf" }

@dtellenbach dtellenbach requested a review from Prabhuk November 5, 2025 02:20
@dtellenbach
Copy link
Member Author

Seems to be happening since

commit 376d7b27fa3de4f72c2a3cec4f941c1ca3f1d7f2
Author: Oliver Stannard <[oliver.stannard@arm.com](mailto:oliver.stannard@arm.com)>
Date:   Thu Sep 26 16:34:03 2024 +0100

    [ARM] Optimise byval arguments in tail-calls

    We don't need to copy byval arguments to tail calls via a temporary, if
    we can prove that we are not copying from the outgoing argument area.
    This patch does this when the source if the argument is one of:
    * Memory in the local stack frame, which can't be used for tail-call
      arguments.
    * A global variable.

    We can also avoid doing the copy completely if the source and
    destination are the same memory location, which is the case when the
    caller and callee have the same signature, and pass some arguments
    through unmodified.

@ostannard

; CHECK-NEXT: @ %bb.2:
; CHECK-NEXT: add.w r1, r0, #12
; CHECK-NEXT: add r2, sp, #100
; CHECK-NEXT: ldr r0, [sp, #160]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the load that had happened after the copy and got overwritten.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants