Преглед изворни кода

Enable generation of llvm.lifetime.start/.end intrinsics (#3034)

* Enable generation of llvm.lifetime.start/.end intrinsics.

- Remove HLSL change from CGDecl.cpp::EmitLifetimeStart() that disabled
- generation of lifetime markers in the front end.
- Enable generation of lifetime intrinsics when inlining functions
- (PassManagerBuilder.cpp).
- Both of these cover a different set of situations that can lead to
- inefficient code without lifetime intrinsics (see examples below):
  - Assume a struct is created inside a loop but some or all of its
    fields are only initialized conditionally before the struct is being
    used. If the alloca of that struct does not receive lifetime intrinsics
    before being lowered to SSA its definition will effectively be hoisted
    out of the loop, which changes the original semantics: Since the
    initialization is conditional, the correct SSA form for this code
    requires a phi node in the loop header that persists the value of the
    struct field throughout different iterations because the compiler
    doesn't know anymore that the field can't be initialized in a different
    iteration than when it is used.
  - If the lifetime of an alloca in a function is the entire function it
    doesn't need lifetime intrinsics. However, when inlining that function,
    the alloca's lifetime will then suddenly span the entire caller, causing
    similar issues as described above.
- For backwards compatibility, replace lifetime.start/.end intrinsics
  with a store of undef in DxilPreparePasses.cpp, or, for validator
  version < 1.6, with a store of 0 (undef store is disallowed). This is
  slightly inconvenient but achieves the same goal as the lifetime
  intrinsics. The zero initialization is actually the current manual
  workaround for developers that hit one of the above issues.
- Allow lifetime intrinsics to pass DXIL validation.
- Allow undef stores to pass DXIL validation.
- Allow bitcast to i8* to pass DXIL validation.
- Make various places in the code aware of lifetime intrinsics and their
- related bitcasts to i8*.
- Adjust ScalarReplAggregatesHLSL so it generates new intrinsics for
  each element once a structure is broken up. Also make sure that lifetime
  intrinsics are removed when replacing one pointer by another upon seeing
  a memcpy. This is required to prevent a pointer accidentally
  "inheriting" wrong lifetimes.
- Adjust PromoteMemoryToRegister to treat an existing lifetime.start
- intrinsic as a definition.
- Since lifetime intrinsics require a cleanup, the logic in
  CGStmt.cpp:EmitBreakStmt() had to be changed: EmitHLSLCondBreak() now
  returns the generated BranchInst. That branch is then passed into
  EmitBranchThroughCleanup(), which uses it instead of creating a new one.
  This way, the cleanup is generated correctly and the wave handling also
  still works as intended.
- Adjust a number of tests that now behave slightly differently.
  memcpy_preuser.hlsl was actually exhibiting exactly the situation
  explained above and relied on the struct definition of "oStruct" to be
  hoisted out to produce the desired IR. And entry_memcpy() in
  cbuf_memcpy_replace.hlsl required an explicit initialization: With
  lifetime intrinsics, the original code correctly collapsed to returning
  undef. Without lifetime intrinsics, the compiler could not prove this.
  With proper initialization, the test now has the intended effect, even
  though the collapsing to undef could be a desireable test for lifetime
  intrinsics.

Example 1:

Original code:
for( ;; ) {
  func();
  MyStruct s;
  if( c ) {
    s.x = ...;
    ... = s.x;
  }
  ... = s.x;
}

Without lifetime intrinsics, this is equivalent to:
MyStruct s;
for( ;; ) {
  func();
  if( c ) {
    s.x = ...;
    ... = s.x;
  }
  ... = s.x;
}

After SROA, we now have a value live across the function call, which will cause a spill:
for( ;; ) {
  x_p = phi( undef, x_p2 );
  func();
  if( c ) {
    x1 = ...;
    ... = x1;
  }
  x_p2 = phi( x_p, x1 );
  ... = x_p2;
}

Example 2:

void consume(in Data data);
void expensiveComputation();

bool produce(out Data data) {
    if (condition) {
        data = ...; // <-- conditional assignment of out-qualified parameter
        return true;
    }
    return false; // <-- out-qualified parameter left uninitialized
}
void foo(int N) {
    for (int i=0; i<N; ++i) {
        Data data;
        bool valid = produce(data); // <-- generates a phi to prior iteration's value when inlined. There should be none
        if (valid)
            consume(data);
        expensiveComputation(); // <-- said phi is alive here, inflating register pressure
    }
}

* Implement lifetime intrinsic execution test.

- Test SM 6.0, 6.3, and 6.5. The 6.5 test behaves exactly the same way
  as 6.3, it is meant a placeholder for 6.6.
- Test validator versions 1.5 and 1.6.
- Abstract a few things in the ExecutionTest infrastructure to enable
  better code sharing, e.g. for lib compilation.

* Make memcpy replacement conservative by removing lifetimes of both src and dst. Add regression test for this case.

* Allow to force replacing lifetime intrinsics by zeroinitializer stores via compile option.

* Fix regression where lifetimes caused code that was not cleaned up properly.

- Add SROA and Jump Threading passes as early in the optimization
  pipeline as possible without interfering with lowering. These two are
  required to fully remove redundant code due to insertion of cleanup blocks
  for lifetimes. Previously, SROA ran much too late, and Jump Threading
  was disabled everywhere.
- A side effect of this is that we can now have unstructured control
  flow more often. This also breaks one test that was originally written
  when a part of SimplifyCFG that could also create unstructured control
  flow was disabled. That part is still disabled, but jump threading has
  the same effect. I don't know why unstructured control flow is a
  problem for the optimization pipeline.
- Add a regression test that requires the two phases to be cleaned up properly.
- Disable the simplifycfg test which now fails even though simplifycfg
  still does what it should.

* Disable lifetime intrinsics for SM < 6.6, add flag to enable explicitly.

- Add missing default value for unrelated option StructurizeLoopExitsForUnroll.
- Re-enable simplify cfg test disabled in a previous commit.
rkarrenberg пре 4 година
родитељ
комит
eaa7f95d07
53 измењених фајлова са 1426 додато и 93 уклоњено
  1. 3 0
      include/dxc/DXIL/DxilModule.h
  2. 5 2
      include/dxc/HLSL/HLModule.h
  3. 2 0
      include/dxc/Support/HLSLOptions.h
  4. 4 0
      include/dxc/Support/HLSLOptions.td
  5. 2 1
      include/llvm/Transforms/IPO/PassManagerBuilder.h
  6. 9 0
      lib/DXIL/DxilModule.cpp
  7. 3 0
      lib/DxcSupport/HLSLOptions.cpp
  8. 28 5
      lib/HLSL/DxilCondenseResources.cpp
  9. 1 0
      lib/HLSL/DxilGenerationPass.cpp
  10. 104 5
      lib/HLSL/DxilPreparePasses.cpp
  11. 16 2
      lib/HLSL/DxilValidation.cpp
  12. 5 2
      lib/HLSL/HLLegalizeParameter.cpp
  13. 22 0
      lib/HLSL/HLMatrixLowerPass.cpp
  14. 5 0
      lib/HLSL/HLModule.cpp
  15. 9 0
      lib/HLSL/HLOperationLower.cpp
  16. 5 0
      lib/HLSL/HLUtil.cpp
  17. 12 3
      lib/Transforms/IPO/PassManagerBuilder.cpp
  18. 10 0
      lib/Transforms/Scalar/DxilConditionalMem2Reg.cpp
  19. 13 5
      lib/Transforms/Scalar/DxilLoopUnroll.cpp
  20. 10 5
      lib/Transforms/Scalar/HoistConstantArray.cpp
  21. 5 0
      lib/Transforms/Scalar/LowerTypePasses.cpp
  22. 68 2
      lib/Transforms/Scalar/ScalarReplAggregatesHLSL.cpp
  23. 66 9
      lib/Transforms/Utils/PromoteMemoryToRegister.cpp
  24. 4 0
      tools/clang/include/clang/Frontend/CodeGenOptions.h
  25. 1 0
      tools/clang/lib/CodeGen/BackendUtil.cpp
  26. 4 2
      tools/clang/lib/CodeGen/CGCleanup.cpp
  27. 3 4
      tools/clang/lib/CodeGen/CGDecl.cpp
  28. 3 2
      tools/clang/lib/CodeGen/CGExpr.cpp
  29. 6 4
      tools/clang/lib/CodeGen/CGExprCXX.cpp
  30. 27 6
      tools/clang/lib/CodeGen/CGHLSLMS.cpp
  31. 4 2
      tools/clang/lib/CodeGen/CGHLSLRuntime.h
  32. 13 5
      tools/clang/lib/CodeGen/CGStmt.cpp
  33. 2 1
      tools/clang/lib/CodeGen/CodeGenFunction.h
  34. 1 2
      tools/clang/test/HLSLFileCheck/hlsl/control_flow/basic_blocks/cbuf_memcpy_replace.hlsl
  35. 0 2
      tools/clang/test/HLSLFileCheck/hlsl/control_flow/return/whole_scope_returned_loop.hlsl
  36. 228 0
      tools/clang/test/HLSLFileCheck/hlsl/lifetimes/lifetimes.hlsl
  37. 39 0
      tools/clang/test/HLSLFileCheck/hlsl/lifetimes/lifetimes_force_zero_flag.hlsl
  38. 232 0
      tools/clang/test/HLSLFileCheck/hlsl/lifetimes/lifetimes_lib_6_3.hlsl
  39. 91 0
      tools/clang/test/HLSLFileCheck/hlsl/lifetimes/lifetimes_loop_live_vals.hlsl
  40. 62 0
      tools/clang/test/HLSLFileCheck/hlsl/lifetimes/lifetimes_replacememcpy.hlsl
  41. 4 2
      tools/clang/test/HLSLFileCheck/passes/hl/sroa_hlsl/memcpy_preuser.hlsl
  42. 1 0
      tools/clang/test/HLSLFileCheck/passes/llvm/simplifycfg/fold-cond-branch-on-phi.hlsl
  43. 1 1
      tools/clang/test/HLSLFileCheck/samples/MiniEngine/FXAAPass2HCS.hlsl
  44. 1 1
      tools/clang/test/HLSLFileCheck/samples/MiniEngine/FXAAPass2HDebugCS.hlsl
  45. 1 1
      tools/clang/test/HLSLFileCheck/samples/MiniEngine/FXAAPass2VCS.hlsl
  46. 1 1
      tools/clang/test/HLSLFileCheck/samples/MiniEngine/FXAAPass2VDebugCS.hlsl
  47. 1 1
      tools/clang/test/HLSLFileCheck/samples/MiniEngine/GenerateHistogramCS.hlsl
  48. 1 1
      tools/clang/test/HLSLFileCheck/samples/d3d11/ComputeShaderSort11.hlsl
  49. 1 0
      tools/clang/test/HLSLFileCheck/shader_targets/library/inout_struct_mismatch.hlsl
  50. 5 1
      tools/clang/test/HLSLFileCheck/shader_targets/library/lib_arg_flatten/lib_arg_flatten2.hlsl
  51. 1 2
      tools/clang/test/HLSLFileCheck/shader_targets/library/lib_arg_flatten/lib_empty_struct_arg.hlsl
  52. 2 0
      tools/clang/tools/dxcompiler/dxcompilerobj.cpp
  53. 279 11
      tools/clang/unittests/HLSL/ExecutionTest.cpp

+ 3 - 0
include/dxc/DXIL/DxilModule.h

@@ -64,6 +64,8 @@ public:
   void SetValidatorVersion(unsigned ValMajor, unsigned ValMinor);
   bool UpgradeValidatorVersion(unsigned ValMajor, unsigned ValMinor);
   void GetValidatorVersion(unsigned &ValMajor, unsigned &ValMinor) const;
+  void SetForceZeroStoreLifetimes(bool ForceZeroStoreLifetimes);
+  bool GetForceZeroStoreLifetimes() const;
 
   // Return true on success, requires valid shader model and CollectShaderFlags to have been set
   bool GetMinValidatorVersion(unsigned &ValMajor, unsigned &ValMinor) const;
@@ -335,6 +337,7 @@ private:
   unsigned m_DxilMinor;
   unsigned m_ValMajor;
   unsigned m_ValMinor;
+  bool m_ForceZeroStoreLifetimes;
 
   std::unique_ptr<OP> m_pOP;
   size_t m_pUnused;

+ 5 - 2
include/dxc/HLSL/HLModule.h

@@ -53,7 +53,7 @@ struct HLOptions {
   HLOptions()
       : bDefaultRowMajor(false), bIEEEStrict(false), bAllResourcesBound(false), bDisableOptimizations(false),
         bLegacyCBufferLoad(false), PackingStrategy(0), bUseMinPrecision(false), bDX9CompatMode(false),
-        bFXCCompatMode(false), bLegacyResourceReservation(false), unused(0) {
+        bFXCCompatMode(false), bLegacyResourceReservation(false), bForceZeroStoreLifetimes(false), unused(0) {
   }
   uint32_t GetHLOptionsRaw() const;
   void SetHLOptionsRaw(uint32_t data);
@@ -68,7 +68,8 @@ struct HLOptions {
   unsigned bDX9CompatMode          : 1;
   unsigned bFXCCompatMode          : 1;
   unsigned bLegacyResourceReservation : 1;
-  unsigned unused                  : 21;
+  unsigned bForceZeroStoreLifetimes : 1;
+  unsigned unused                  : 20;
 };
 
 typedef std::unordered_map<const llvm::Function *, std::unique_ptr<DxilFunctionProps>> DxilFunctionPropsMap;
@@ -87,6 +88,8 @@ public:
   const ShaderModel *GetShaderModel() const;
   void SetValidatorVersion(unsigned ValMajor, unsigned ValMinor);
   void GetValidatorVersion(unsigned &ValMajor, unsigned &ValMinor) const;
+  void SetForceZeroStoreLifetimes(bool ForceZeroStoreLifetimes);
+  bool GetForceZeroStoreLifetimes() const;
 
   // HLOptions
   void SetHLOptions(HLOptions &opts);

+ 2 - 0
include/dxc/Support/HLSLOptions.h

@@ -189,6 +189,8 @@ public:
   bool ResMayAlias = false; // OPT_res_may_alias
   unsigned long ValVerMajor = UINT_MAX, ValVerMinor = UINT_MAX; // OPT_validator_version
   unsigned ScanLimit = 0; // OPT_memdep_block_scan_limit
+  bool ForceZeroStoreLifetimes = false; // OPT_force_zero_store_lifetimes
+  bool EnableLifetimeMarkers = false; // OPT_enable_lifetime_markers
 
   // Optimization pass enables, disables and selects
   std::map<std::string, bool> DxcOptimizationToggles; // OPT_opt_enable & OPT_opt_disable

+ 4 - 0
include/dxc/Support/HLSLOptions.td

@@ -271,6 +271,10 @@ def validator_version : Separate<["-", "/"], "validator-version">, Group<hlslcom
   HelpText<"Override validator version for module.  Format: <major.minor> ; Default: DXIL.dll version or current internal version.">;
 def print_after_all : Flag<["-", "/"], "print-after-all">, Group<hlslcomp_Group>, Flags<[CoreOption, HelpHidden]>,
   HelpText<"Print LLVM IR after each pass.">;
+def force_zero_store_lifetimes : Flag<["-", "/"], "force-zero-store-lifetimes">, Group<hlslcomp_Group>, Flags<[CoreOption]>,
+  HelpText<"Instead of generating lifetime intrinsics (SM >= 6.6) or storing undef (SM < 6.6), force fall back to storing zeroinitializer.">;
+def enable_lifetime_markers : Flag<["-", "/"], "enable-lifetime-markers">, Group<hlslcomp_Group>, Flags<[CoreOption]>,
+  HelpText<"Enable generation of lifetime markers">;
 
 // Used with API only
 def skip_serialization : Flag<["-", "/"], "skip-serialization">, Group<hlslcore_Group>, Flags<[CoreOption, HelpHidden]>,

+ 2 - 1
include/llvm/Transforms/IPO/PassManagerBuilder.h

@@ -133,7 +133,8 @@ public:
   bool HLSLResMayAlias = false; // HLSL Change
   unsigned ScanLimit = 0; // HLSL Change
   bool EnableGVN = true; // HLSL Change
-  bool StructurizeLoopExitsForUnroll; // HLSL Change
+  bool StructurizeLoopExitsForUnroll = false; // HLSL Change
+  bool HLSLEnableLifetimeMarkers = false; // HLSL Change
 
 private:
   /// ExtensionList - This is list of all of the extensions that are registered.

+ 9 - 0
lib/DXIL/DxilModule.cpp

@@ -113,6 +113,7 @@ DxilModule::DxilModule(Module *pModule)
 , m_DxilMinor(DXIL::kDxilMinor)
 , m_ValMajor(1)
 , m_ValMinor(0)
+, m_ForceZeroStoreLifetimes(false)
 , m_pOP(llvm::make_unique<OP>(pModule->getContext(), pModule))
 , m_pTypeSystem(llvm::make_unique<DxilTypeSystem>(pModule))
 , m_bDisableOptimizations(false)
@@ -182,6 +183,10 @@ void DxilModule::SetValidatorVersion(unsigned ValMajor, unsigned ValMinor) {
   m_ValMinor = ValMinor;
 }
 
+void DxilModule::SetForceZeroStoreLifetimes(bool ForceZeroStoreLifetimes) {
+  m_ForceZeroStoreLifetimes = ForceZeroStoreLifetimes;
+}
+
 bool DxilModule::UpgradeValidatorVersion(unsigned ValMajor, unsigned ValMinor) {
   // Don't upgrade if validation was disabled.
   if (m_ValMajor == 0 && m_ValMinor == 0) {
@@ -200,6 +205,10 @@ void DxilModule::GetValidatorVersion(unsigned &ValMajor, unsigned &ValMinor) con
   ValMinor = m_ValMinor;
 }
 
+bool DxilModule::GetForceZeroStoreLifetimes() const {
+  return m_ForceZeroStoreLifetimes;
+}
+
 bool DxilModule::GetMinValidatorVersion(unsigned &ValMajor, unsigned &ValMinor) const {
   if (!m_pSM)
     return false;

+ 3 - 0
lib/DxcSupport/HLSLOptions.cpp

@@ -641,6 +641,9 @@ int ReadDxcOpts(const OptTable *optionTable, unsigned flagsToInclude,
   opts.PrintAfterAll = Args.hasFlag(OPT_print_after_all, OPT_INVALID, false);
   opts.ResMayAlias = Args.hasFlag(OPT_res_may_alias, OPT_INVALID, false);
   opts.ResMayAlias = Args.hasFlag(OPT_res_may_alias_, OPT_INVALID, opts.ResMayAlias);
+  opts.ForceZeroStoreLifetimes = Args.hasFlag(OPT_force_zero_store_lifetimes, OPT_INVALID, false);
+  opts.EnableLifetimeMarkers = Args.hasFlag(OPT_enable_lifetime_markers, OPT_INVALID,
+                                            DXIL::CompareVersions(Major, Minor, 6, 6) >= 0);
 
   if (opts.DefaultColMajor && opts.DefaultRowMajor) {
     errors << "Cannot specify /Zpr and /Zpc together, use /? to get usage information";

+ 28 - 5
lib/HLSL/DxilCondenseResources.cpp

@@ -30,6 +30,7 @@
 #include "llvm/IR/Module.h"
 #include "llvm/IR/PassManager.h"
 #include "llvm/IR/DebugInfo.h"
+#include "llvm/Analysis/ValueTracking.h"
 #include "llvm/ADT/BitVector.h"
 #include "llvm/ADT/SetVector.h"
 #include "llvm/Pass.h"
@@ -1177,6 +1178,10 @@ public:
     } else if (Constant *C = dyn_cast<Constant>(V)) {
       // skip @llvm.used entry
       return;
+    } else if (BitCastInst *BCI = dyn_cast<BitCastInst>(V)) {
+      DXASSERT(onlyUsedByLifetimeMarkers(BCI),
+               "expected bitcast to only be used by lifetime intrinsics");
+      return;
     } else if (bAlloca) {
       m_Errors.ReportError(ResourceUseErrors::AllocaUserDisallowed, V);
     } else {
@@ -2099,11 +2104,7 @@ void DxilLowerCreateHandleForLib::TranslateDxilResourceUses(
       DXASSERT(handleMapOnFunction.count(userF), "must exist");
       Value *handle = handleMapOnFunction[userF];
       ReplaceResourceUserWithHandle(static_cast<DxilResource &>(res), ldInst, handle);
-    } else {
-      DXASSERT(dyn_cast<GEPOperator>(user) != nullptr,
-               "else AddOpcodeParamForIntrinsic in CodeGen did not patch uses "
-               "to only have ld/st refer to temp object");
-      GEPOperator *GEP = cast<GEPOperator>(user);
+    } else if (GEPOperator *GEP = dyn_cast<GEPOperator>(user)) {
       Value *idx = nullptr;
       if (GEP->getNumIndices() == 2) {
         // one dim array of resource
@@ -2170,6 +2171,28 @@ void DxilLowerCreateHandleForLib::TranslateDxilResourceUses(
       if (Instruction *I = dyn_cast<Instruction>(GEP)) {
         I->eraseFromParent();
       }
+    } else if (BitCastInst *BCI = dyn_cast<BitCastInst>(user)) {
+      DXASSERT(onlyUsedByLifetimeMarkers(BCI),
+               "expected bitcast to only be used by lifetime intrinsics");
+      for (auto BCIU = BCI->user_begin(), BCIE = BCI->user_end(); BCIU != BCIE;) {
+        IntrinsicInst *II = cast<IntrinsicInst>(*(BCIU++));
+        II->eraseFromParent();
+      }
+      BCI->eraseFromParent();
+    } else if (ConstantExpr *CE = dyn_cast<ConstantExpr>(user)) {
+      // A GEPOperator can also be a ConstantExpr, so it must be checked before
+      // this code.
+      DXASSERT(CE->getOpcode() == Instruction::BitCast, "expected bitcast");
+      DXASSERT(onlyUsedByLifetimeMarkers(CE),
+               "expected ConstantExpr to only be used by lifetime intrinsics");
+      for (auto CEU = CE->user_begin(), CEE = CE->user_end(); CEU != CEE;) {
+        IntrinsicInst *II = cast<IntrinsicInst>(*(CEU++));
+        II->eraseFromParent();
+      }
+    } else {
+      DXASSERT(false,
+               "AddOpcodeParamForIntrinsic in CodeGen did not patch uses "
+               "to only have ld/st refer to temp object");
     }
   }
   // Erase unused handle.

+ 1 - 0
lib/HLSL/DxilGenerationPass.cpp

@@ -100,6 +100,7 @@ void InitDxilModuleFromHLModule(HLModule &H, DxilModule &M, bool HasDebugInfo) {
   H.GetValidatorVersion(ValMajor, ValMinor);
   M.SetValidatorVersion(ValMajor, ValMinor);
   M.SetShaderModel(H.GetShaderModel(), H.GetHLOptions().bUseMinPrecision);
+  M.SetForceZeroStoreLifetimes(H.GetHLOptions().bForceZeroStoreLifetimes);
 
   // Entry function.
   if (!M.GetShaderModel()->IsLib()) {

+ 104 - 5
lib/HLSL/DxilPreparePasses.cpp

@@ -364,7 +364,7 @@ public:
     }
   }
 
-  void patchDxil_1_6(Module &M, hlsl::OP *hlslOP) {
+  void patchDxil_1_6(Module &M, hlsl::OP *hlslOP, unsigned ValMajor, unsigned ValMinor) {
     for (auto it : hlslOP->GetOpFuncList(DXIL::OpCode::AnnotateHandle)) {
       Function *F = it.second;
       if (!F)
@@ -379,15 +379,108 @@ public:
     }
   }
 
+  // Replace llvm.lifetime.start/.end intrinsics with undef or zeroinitializer
+  // stores (for earlier validator versions) unless the pointer is a global
+  // that has an initializer.
+  // This works around losing scoping information in earlier shader models
+  // that do not support the intrinsics natively.
+  void patchLifetimeIntrinsics(Module &M, unsigned ValMajor, unsigned ValMinor, bool forceZeroStoreLifetimes) {
+    // Get the declarations. This may introduce them if there were none before.
+    Value *StartDecl = Intrinsic::getDeclaration(&M, Intrinsic::lifetime_start);
+    Value *EndDecl   = Intrinsic::getDeclaration(&M, Intrinsic::lifetime_end);
+
+    // Collect all calls to both intrinsics.
+    std::vector<CallInst*> intrinsicCalls;
+    for (Use &U : StartDecl->uses()) {
+      // All users must be call instructions.
+      CallInst *CI = dyn_cast<CallInst>(U.getUser());
+      DXASSERT(CI,
+               "Expected user of lifetime.start intrinsic to be a CallInst");
+      intrinsicCalls.push_back(CI);
+    }
+    for (Use &U : EndDecl->uses()) {
+      // All users must be call instructions.
+      CallInst *CI = dyn_cast<CallInst>(U.getUser());
+      DXASSERT(CI, "Expected user of lifetime.end intrinsic to be a CallInst");
+      intrinsicCalls.push_back(CI);
+    }
+
+    // Replace each intrinsic with an undef store.
+    for (CallInst *CI : intrinsicCalls) {
+      // Find the corresponding pointer (bitcast from alloca, global value, an
+      // argument, ...).
+      Value *voidPtr = CI->getArgOperand(1);
+      DXASSERT(voidPtr->getType()->isPointerTy() &&
+               voidPtr->getType()->getPointerElementType()->isIntegerTy(8),
+               "Expected operand of lifetime intrinsic to be of type i8*" );
+
+      Value *ptr = nullptr;
+      if (ConstantExpr *CE = dyn_cast<ConstantExpr>(voidPtr)) {
+        // This can happen if a local variable/array is promoted to a constant
+        // global. In this case we must not introduce a store, since that would
+        // overwrite the constant values in the initializer. Thus, we simply
+        // remove the intrinsic.
+        DXASSERT(CE->getOpcode() == Instruction::BitCast,
+                 "expected operand of lifetime intrinsic to be a bitcast");
+      } else {
+        // Otherwise, it must be a normal bitcast.
+        DXASSERT(isa<BitCastInst>(voidPtr),
+                 "Expected operand of lifetime intrinsic to be a bitcast");
+        BitCastInst *BC = cast<BitCastInst>(voidPtr);
+        ptr = BC->getOperand(0);
+
+        // If the original pointer is a global with initializer, do not replace
+        // the intrinsic with a store.
+        if (GlobalVariable *GV = dyn_cast<GlobalVariable>(ptr))
+          if (GV->hasInitializer() || GV->isExternallyInitialized())
+            ptr = nullptr;
+      }
+
+      if (ptr) {
+        // Determine the type to use when storing undef.
+        DXASSERT(ptr->getType()->isPointerTy(),
+                 "Expected type of operand of lifetime intrinsic bitcast operand to be a pointer");
+        Type *T = ptr->getType()->getPointerElementType();
+
+        // Store undef at the location of the start/end intrinsic.
+        // If we are targeting validator version < 6.6 we cannot store undef
+        // since it causes a validation error. As a workaround we store 0, which
+        // achieves mostly the same as storing undef but can cause overhead in
+        // some situations.
+        // We also allow to force zeroinitializer through a flag.
+        if (forceZeroStoreLifetimes || ValMajor < 1 || (ValMajor == 1 && ValMinor < 6))
+          IRBuilder<>(CI).CreateStore(Constant::getNullValue(T), ptr);
+        else
+          IRBuilder<>(CI).CreateStore(UndefValue::get(T), ptr);
+      }
+
+      // Erase the intrinsic call and, if it has no uses anymore, the bitcast as
+      // well.
+      DXASSERT_NOMSG(CI->use_empty());
+      CI->eraseFromParent();
+
+      // Erase the bitcast inst if it is not a ConstantExpr.
+      if (BitCastInst *BC = dyn_cast<BitCastInst>(voidPtr))
+        if (BC->use_empty())
+          BC->eraseFromParent();
+    }
+
+    // Erase the intrinsic declarations.
+    DXASSERT_NOMSG(StartDecl->use_empty());
+    DXASSERT_NOMSG(EndDecl->use_empty());
+    cast<Function>(StartDecl)->eraseFromParent();
+    cast<Function>(EndDecl)->eraseFromParent();
+  }
+
   bool runOnModule(Module &M) override {
     if (M.HasDxilModule()) {
       DxilModule &DM = M.GetDxilModule();
       unsigned ValMajor = 0;
       unsigned ValMinor = 0;
-      M.GetDxilModule().GetValidatorVersion(ValMajor, ValMinor);
+      DM.GetValidatorVersion(ValMajor, ValMinor);
       unsigned DxilMajor = 0;
       unsigned DxilMinor = 0;
-      M.GetDxilModule().GetDxilVersion(DxilMajor, DxilMinor);
+      DM.GetDxilVersion(DxilMajor, DxilMinor);
 
       bool IsLib = DM.GetShaderModel()->IsLib();
       // Skip validation patch for lib.
@@ -402,10 +495,16 @@ public:
           MarkUsedSignatureElements(DM.GetPatchConstantFunction(), DM);
       }
 
+      // Replace lifetime intrinsics if requested or necessary.
+      const bool forceZeroStoreLifetimes = DM.GetForceZeroStoreLifetimes();
+      if (forceZeroStoreLifetimes || DxilMinor < 6) {
+        patchLifetimeIntrinsics(M, ValMajor, ValMinor, forceZeroStoreLifetimes);
+      }
+
       // Remove store undef output.
-      hlsl::OP *hlslOP = M.GetDxilModule().GetOP();
+      hlsl::OP *hlslOP = DM.GetOP();
       if (DxilMinor < 6) {
-        patchDxil_1_6(M, hlslOP);
+        patchDxil_1_6(M, hlslOP, ValMajor, ValMinor);
       }
       RemoveStoreUndefOutput(M, hlslOP);
 

+ 16 - 2
lib/HLSL/DxilValidation.cpp

@@ -47,6 +47,7 @@
 #include "llvm/Bitcode/ReaderWriter.h"
 #include <unordered_set>
 #include "llvm/Analysis/LoopInfo.h"
+#include "llvm/Analysis/ValueTracking.h"
 #include "llvm/IR/Dominators.h"
 #include "llvm/Analysis/PostDominators.h"
 #include "dxc/HLSL/DxilSpanAllocator.h"
@@ -204,7 +205,7 @@ const char *hlsl::GetValidationRuleText(ValidationRule value) {
     case hlsl::ValidationRule::TypesIntWidth: return "Int type '%0' has an invalid width.";
     case hlsl::ValidationRule::TypesNoMultiDim: return "Only one dimension allowed for array type.";
     case hlsl::ValidationRule::TypesNoPtrToPtr: return "Pointers to pointers, or pointers in structures are not allowed.";
-    case hlsl::ValidationRule::TypesI8: return "I8 can only be used as immediate value for intrinsic.";
+    case hlsl::ValidationRule::TypesI8: return "I8 can only be used as immediate value for intrinsic or as i8* via bitcast by lifetime intrinsics.";
     case hlsl::ValidationRule::SmName: return "Unknown shader model '%0'.";
     case hlsl::ValidationRule::SmDxilVersion: return "Shader model requires Dxil Version %0,%1.";
     case hlsl::ValidationRule::SmOpcode: return "Opcode %0 not valid in shader model %1.";
@@ -3189,6 +3190,8 @@ static bool IsLLVMInstructionAllowedForLib(Instruction &I, ValidationContext &Va
 static void ValidateFunctionBody(Function *F, ValidationContext &ValCtx) {
   bool SupportsMinPrecision =
       ValCtx.DxilMod.GetGlobalFlags() & DXIL::kEnableMinPrecision;
+  bool SupportsLifetimeIntrinsics =
+      ValCtx.DxilMod.GetShaderModel()->IsSM66Plus();
   SmallVector<CallInst *, 16> gradientOps;
   SmallVector<CallInst *, 16> barriers;
   CallInst *setMeshOutputCounts = nullptr;
@@ -3292,6 +3295,9 @@ static void ValidateFunctionBody(Function *F, ValidationContext &ValCtx) {
           if (ShuffleVectorInst *Shuf = dyn_cast<ShuffleVectorInst>(&I)) {
             legalUndef = op == I.getOperand(1);
           }
+          if (StoreInst *Store = dyn_cast<StoreInst>(&I)) {
+            legalUndef = op == I.getOperand(0);
+          }
 
           if (!legalUndef)
             ValCtx.EmitInstrError(&I,
@@ -3306,6 +3312,7 @@ static void ValidateFunctionBody(Function *F, ValidationContext &ValCtx) {
         }
         if (IntegerType *IT = dyn_cast<IntegerType>(op->getType())) {
           if (IT->getBitWidth() == 8) {
+            // We always fail if we see i8 as operand type of a non-lifetime instruction.
             ValCtx.EmitInstrError(&I, ValidationRule::TypesI8);
           }
         }
@@ -3318,7 +3325,10 @@ static void ValidateFunctionBody(Function *F, ValidationContext &ValCtx) {
         Ty = Ty->getArrayElementType();
       if (IntegerType *IT = dyn_cast<IntegerType>(Ty)) {
         if (IT->getBitWidth() == 8) {
-          ValCtx.EmitInstrError(&I, ValidationRule::TypesI8);
+          // Allow i8* cast for llvm.lifetime.* intrinsics.
+          if (!SupportsLifetimeIntrinsics || !isa<BitCastInst>(I) || !onlyUsedByLifetimeMarkers(&I)) {
+            ValCtx.EmitInstrError(&I, ValidationRule::TypesI8);
+          }
         }
       }
 
@@ -3418,6 +3428,10 @@ static void ValidateFunctionBody(Function *F, ValidationContext &ValCtx) {
         BitCastInst *Cast = cast<BitCastInst>(&I);
         Type *FromTy = Cast->getOperand(0)->getType();
         Type *ToTy = Cast->getType();
+        // Allow i8* cast for llvm.lifetime.* intrinsics.
+        if (SupportsLifetimeIntrinsics &&
+            ToTy == Type::getInt8PtrTy(ToTy->getContext()))
+            continue;
         if (isa<PointerType>(FromTy)) {
           FromTy = FromTy->getPointerElementType();
           ToTy = ToTy->getPointerElementType();

+ 5 - 2
lib/HLSL/HLLegalizeParameter.cpp

@@ -59,7 +59,7 @@ AllocaInst *createAllocaForPatch(Function &F, Type *Ty) {
 void copyIn(AllocaInst *temp, Value *arg, CallInst *CI, unsigned size) {
   if (size == 0)
     return;
-  // copy arg to temp befor CI.
+  // Copy arg to temp before CI.
   IRBuilder<> Builder(CI);
   Builder.CreateMemCpy(temp, arg, size, 1);
 }
@@ -67,7 +67,7 @@ void copyIn(AllocaInst *temp, Value *arg, CallInst *CI, unsigned size) {
 void copyOut(AllocaInst *temp, Value *arg, CallInst *CI, unsigned size) {
   if (size == 0)
     return;
-  // copy temp to arg after CI.
+  // Copy temp to arg after CI.
   IRBuilder<> Builder(CI->getNextNode());
   Builder.CreateMemCpy(arg, temp, size, 1);
 }
@@ -227,6 +227,7 @@ void ParameterCopyInCopyOut(hlsl::HLModule &HLM) {
       continue;
     unsigned size = DL.getTypeAllocSize(Ty);
     AllocaInst *temp = createAllocaForPatch(*CI->getParent()->getParent(), Ty);
+    // TODO: Adding lifetime intrinsics isn't easy here, have to analyze uses.
     if (data.bCopyIn)
       copyIn(temp, arg, CI, size);
     if (data.bCopyOut)
@@ -289,6 +290,7 @@ bool HLLegalizeParameter::runOnModule(Module &M) {
 
 void HLLegalizeParameter::patchWriteOnInParam(Function &F, Argument &Arg,
                                               const DataLayout &DL) {
+  // TODO: Adding lifetime intrinsics isn't easy here, have to analyze uses.
   Type *Ty = Arg.getType()->getPointerElementType();
   AllocaInst *temp = createAllocaForPatch(F, Ty);
   Arg.replaceAllUsesWith(temp);
@@ -300,6 +302,7 @@ void HLLegalizeParameter::patchWriteOnInParam(Function &F, Argument &Arg,
 
 void HLLegalizeParameter::patchReadOnOutParam(Function &F, Argument &Arg,
                                               const DataLayout &DL) {
+  // TODO: Adding lifetime intrinsics isn't easy here, have to analyze uses.
   Type *Ty = Arg.getType()->getPointerElementType();
   AllocaInst *temp = createAllocaForPatch(F, Ty);
   Arg.replaceAllUsesWith(temp);

+ 22 - 0
lib/HLSL/HLMatrixLowerPass.cpp

@@ -29,6 +29,7 @@
 #include "llvm/Transforms/Utils/Local.h"
 #include "llvm/Pass.h"
 #include "llvm/Support/raw_ostream.h"
+#include "llvm/Analysis/ValueTracking.h"
 #include <unordered_set>
 #include <vector>
 
@@ -291,6 +292,11 @@ void HLMatrixLowerPass::getMatrixAllocasAndOtherInsts(Function &Func,
       // typically a global variable or alloca.
       if (isa<GetElementPtrInst>(&Inst)) continue;
 
+      // Don't lower lifetime intrinsics here, we'll handle them as we lower the alloca.
+      IntrinsicInst *Intrin = dyn_cast<IntrinsicInst>(&Inst);
+      if (Intrin && Intrin->getIntrinsicID() == Intrinsic::lifetime_start) continue;
+      if (Intrin && Intrin->getIntrinsicID() == Intrinsic::lifetime_end) continue;
+
       if (AllocaInst *Alloca = dyn_cast<AllocaInst>(&Inst)) {
         if (HLMatrixType::isMatrixOrPtrOrArrayPtr(Alloca->getType())) {
           MatAllocas.emplace_back(Alloca);
@@ -540,6 +546,22 @@ void HLMatrixLowerPass::replaceAllVariableUses(
       continue;
     }
 
+    if (BitCastInst *BCI = dyn_cast<BitCastInst>(Use.getUser())) {
+      // Replace bitcasts to i8* for lifetime intrinsics.
+      if (BCI->getType()->isPointerTy() && BCI->getType()->getPointerElementType()->isIntegerTy(8))
+      {
+        DXASSERT(onlyUsedByLifetimeMarkers(BCI),
+                 "bitcast to i8* must only be used by lifetime intrinsics");
+        Value *NewBCI = IRBuilder<>(BCI).CreateBitCast(LoweredPtr, BCI->getType());
+        // Replace all uses of the use.
+        BCI->replaceAllUsesWith(NewBCI);
+        // Remove the current use to end iteration.
+        Use.set(UndefValue::get(Use->getType()));
+        addToDeadInsts(BCI);
+        continue;
+      }
+    }
+
     // Recreate the same GEP sequence, if any, on the lowered pointer
     IRBuilder<> Builder(cast<Instruction>(Use.getUser()));
     Value *LoweredStackTopPtr = GEPIdxStack.size() == 1

+ 5 - 0
lib/HLSL/HLModule.cpp

@@ -28,6 +28,7 @@
 #include "llvm/IR/DIBuilder.h"
 #include "llvm/Support/raw_ostream.h"
 #include "llvm/IR/GetElementPtrTypeIterator.h"
+#include "llvm/Analysis/ValueTracking.h"
 
 using namespace llvm;
 using std::string;
@@ -1220,6 +1221,10 @@ void HLModule::MarkPreciseAttributeOnPtrWithFunctionCall(llvm::Value *Ptr,
           MarkPreciseAttributeOnValWithFunctionCall(CI, Builder, M);
         }
       }
+    } else if (BitCastInst *BCI = dyn_cast<BitCastInst>(U)) {
+      // Do not mark bitcasts. We only expect them here due to lifetime intrinsics.
+      DXASSERT(onlyUsedByLifetimeMarkers(BCI),
+               "expected bitcast to only be used by lifetime intrinsics");
     } else {
       // Must be GEP here.
       GetElementPtrInst *GEP = cast<GetElementPtrInst>(U);

+ 9 - 0
lib/HLSL/HLOperationLower.cpp

@@ -31,6 +31,7 @@
 #include "llvm/IR/GetElementPtrTypeIterator.h"
 #include "llvm/IR/IRBuilder.h"
 #include "llvm/IR/Instructions.h"
+#include "llvm/IR/IntrinsicInst.h"
 #include "llvm/IR/Module.h"
 #include "llvm/ADT/APSInt.h"
 
@@ -6203,6 +6204,14 @@ void TranslateCBAddressUserLegacy(Instruction *user, Value *handle,
       }
 
       CI->eraseFromParent();
+    } else if (IntrinsicInst *II = dyn_cast<IntrinsicInst>(user)) {
+      if( II->getIntrinsicID() == Intrinsic::lifetime_start ||
+          II->getIntrinsicID() == Intrinsic::lifetime_end ) {
+        DXASSERT(II->use_empty(), "lifetime intrinsic can't have uses");
+        II->eraseFromParent();
+      } else {
+        DXASSERT(0, "not implemented yet");
+      }
     } else {
       DXASSERT(0, "not implemented yet");
     }

+ 5 - 0
lib/HLSL/HLUtil.cpp

@@ -104,6 +104,11 @@ void analyzePointer(const Value *V, PointerStatus &PS, DxilTypeSystem &typeSys,
       PS.MarkAsLoaded();
     } else if (const CallInst *CI = dyn_cast<CallInst>(U)) {
       Function *F = CI->getCalledFunction();
+      if (F->isIntrinsic()) {
+        if (F->getIntrinsicID() == Intrinsic::lifetime_start ||
+            F->getIntrinsicID() == Intrinsic::lifetime_end)
+          continue;
+      }
       DxilFunctionAnnotation *annotation = typeSys.GetFunctionAnnotation(F);
       if (!annotation) {
         HLOpcodeGroup group = hlsl::GetHLOpcodeGroupByName(F);

+ 12 - 3
lib/Transforms/IPO/PassManagerBuilder.cpp

@@ -207,7 +207,7 @@ void PassManagerBuilder::populateFunctionPassManager(
 }
 
 // HLSL Change Starts
-static void addHLSLPasses(bool HLSLHighLevel, unsigned OptLevel, bool OnlyWarnOnUnrollFail, bool StructurizeLoopExitsForUnroll, hlsl::HLSLExtensionsCodegenHelper *ExtHelper, legacy::PassManagerBase &MPM) {
+static void addHLSLPasses(bool HLSLHighLevel, unsigned OptLevel, bool OnlyWarnOnUnrollFail, bool StructurizeLoopExitsForUnroll, bool EnableLifetimeMarkers, hlsl::HLSLExtensionsCodegenHelper *ExtHelper, legacy::PassManagerBase &MPM) {
 
   // Don't do any lowering if we're targeting high-level.
   if (HLSLHighLevel) {
@@ -255,6 +255,14 @@ static void addHLSLPasses(bool HLSLHighLevel, unsigned OptLevel, bool OnlyWarnOn
   // Special Mem2Reg pass that skips precise marker.
   MPM.add(createDxilConditionalMem2RegPass(NoOpt));
 
+  // Clean up inefficiencies that can cause unnecessary live values related to
+  // lifetime marker cleanup blocks. This is the earliest possible location
+  // without interfering with HLSL-specific lowering.
+  if (!NoOpt && EnableLifetimeMarkers) {
+    MPM.add(createSROAPass());
+    MPM.add(createJumpThreadingPass());
+  }
+
   // Remove unneeded dxbreak conditionals
   MPM.add(createCleanupDxBreakPass());
 
@@ -354,6 +362,7 @@ void PassManagerBuilder::populateModulePassManager(
     addHLSLPasses(HLSLHighLevel, OptLevel,
       this->HLSLOnlyWarnOnUnrollFail,
       this->StructurizeLoopExitsForUnroll,
+      this->HLSLEnableLifetimeMarkers,
       this->HLSLExtensionsCodeGen,
       MPM);
 
@@ -389,12 +398,12 @@ void PassManagerBuilder::populateModulePassManager(
   MPM.add(createDxilRewriteOutputArgDebugInfoPass()); // Fix output argument types.
 
   MPM.add(createHLLegalizeParameter()); // legalize parameters before inline.
-  MPM.add(createAlwaysInlinerPass(/*InsertLifeTime*/false));
+  MPM.add(createAlwaysInlinerPass(/*InsertLifeTime*/this->HLSLEnableLifetimeMarkers));
   if (Inliner) {
     delete Inliner;
     Inliner = nullptr;
   }
-  addHLSLPasses(HLSLHighLevel, OptLevel, this->HLSLOnlyWarnOnUnrollFail, this->StructurizeLoopExitsForUnroll, HLSLExtensionsCodeGen, MPM); // HLSL Change
+  addHLSLPasses(HLSLHighLevel, OptLevel, this->HLSLOnlyWarnOnUnrollFail, this->StructurizeLoopExitsForUnroll, this->HLSLEnableLifetimeMarkers, HLSLExtensionsCodeGen, MPM); // HLSL Change
   // HLSL Change Ends
 
   // Add LibraryInfo if we have some.

+ 10 - 0
lib/Transforms/Scalar/DxilConditionalMem2Reg.cpp

@@ -25,6 +25,7 @@
 #include "dxc/DXIL/DxilUtil.h"
 #include "dxc/HLSL/HLModule.h"
 #include "llvm/Analysis/DxilValueCache.h"
+#include "llvm/Analysis/ValueTracking.h"
 
 using namespace llvm;
 using namespace hlsl;
@@ -233,6 +234,15 @@ public:
           }
           Store->eraseFromParent();
         }
+        else if (BitCastInst *BCI = dyn_cast<BitCastInst>(U)) {
+          DXASSERT(onlyUsedByLifetimeMarkers(BCI),
+                   "expected bitcast to only be used by lifetime intrinsics");
+          for (auto BCIU = BCI->user_begin(), BCIE = BCI->user_end(); BCIU != BCIE;) {
+            IntrinsicInst *II = cast<IntrinsicInst>(*(BCIU++));
+            II->eraseFromParent();
+          }
+          BCI->eraseFromParent();
+        }
         else {
           llvm_unreachable("Cannot handle non-store/load on precise vector allocas");
         }

+ 13 - 5
lib/Transforms/Scalar/DxilLoopUnroll.cpp

@@ -84,6 +84,7 @@
 #include "dxc/DXIL/DxilOperations.h"
 #include "dxc/HLSL/HLModule.h"
 #include "llvm/Analysis/DxilValueCache.h"
+#include "llvm/Analysis/ValueTracking.h"
 
 #include "DxilRemoveUnstructuredLoopExits.h"
 
@@ -543,8 +544,8 @@ static bool BreakUpArrayAllocas(bool AllowOOBIndex, IteratorT ItBegin, IteratorT
 
     GEPs.clear(); // Re-use array
     for (User *U : AI->users()) {
-      // Try to set all GEP operands to constant
       if (GEPOperator *GEP = dyn_cast<GEPOperator>(U)) {
+        // Try to set all GEP operands to constant
         if (!GEP->hasAllConstantIndices() && isa<GetElementPtrInst>(GEP)) {
           for (unsigned i = 0; i < GEP->getNumIndices(); i++) {
             Value *IndexOp = GEP->getOperand(i + 1);
@@ -565,11 +566,17 @@ static bool BreakUpArrayAllocas(bool AllowOOBIndex, IteratorT ItBegin, IteratorT
         else {
           GEPs.push_back(GEP);
         }
+
+        continue;
       }
-      else {
-        GEPs.clear();
-        break;
-      }
+
+      // Ignore uses that are only used by lifetime intrinsics.
+      if (onlyUsedByLifetimeMarkers(U))
+        continue;
+
+      // We've found something that prevents us from safely replacing this alloca.
+      GEPs.clear();
+      break;
     }
 
     if (!GEPs.size())
@@ -613,6 +620,7 @@ static bool BreakUpArrayAllocas(bool AllowOOBIndex, IteratorT ItBegin, IteratorT
         NewPointer = ScalarAlloca;
       }
 
+      // TODO: Inherit lifetimes start/end locations from AI if available.
       GEP->replaceAllUsesWith(NewPointer);
     } 
 

+ 10 - 5
lib/Transforms/Scalar/HoistConstantArray.cpp

@@ -82,6 +82,7 @@
 #include "llvm/IR/Function.h"
 #include "llvm/IR/Operator.h"
 #include "llvm/Support/Casting.h"
+#include "llvm/Analysis/ValueTracking.h"
 using namespace llvm;
 
 namespace {
@@ -132,7 +133,7 @@ namespace {
     void EnsureSize();
     void GetArrayStores(GEPOperator *gep,
                         std::vector<StoreInst *> &stores) const;
-    bool AllArrayUsersAreGEP(std::vector<GEPOperator *> &geps);
+    bool AllArrayUsersAreGEPOrLifetime(std::vector<GEPOperator *> &geps);
     bool AllGEPUsersAreValid(GEPOperator *gep);
     UndefValue *UndefElement();
   };
@@ -209,10 +210,14 @@ void CandidateArray::GetArrayStores(GEPOperator *gep,
     }
   }
 }
-// Check to see that all the users of the array are GEPs.
+// Check to see that all the users of the array are GEPs or lifetime intrinsics.
 // If so, populate the `geps` vector with a list of all geps that use the array.
-bool CandidateArray::AllArrayUsersAreGEP(std::vector<GEPOperator *> &geps) {
+bool CandidateArray::AllArrayUsersAreGEPOrLifetime(std::vector<GEPOperator *> &geps) {
   for (User *U : m_Alloca->users()) {
+    // Allow users that are only used by lifetime intrinsics.
+    if (onlyUsedByLifetimeMarkers(U))
+      continue;
+
     GEPOperator *gep = dyn_cast<GEPOperator>(U);
     if (!gep)
       return false;
@@ -250,7 +255,7 @@ bool CandidateArray::AllGEPUsersAreValid(GEPOperator *gep) {
 
 // Analyze all uses of the array to see if it qualifes as a constant array.
 // We check the following conditions:
-//  1. Make sure alloca is only used by GEP.
+//  1. Make sure alloca is only used by GEP and lifetime intrinsics.
 //  2. Make sure GEP is only used in load/store.
 //  3. Make sure all stores have constant indicies.
 //  4. Make sure all stores are constants.
@@ -258,7 +263,7 @@ bool CandidateArray::AllGEPUsersAreValid(GEPOperator *gep) {
 void CandidateArray::AnalyzeUses() {
   m_IsConstArray = false;
   std::vector<GEPOperator *> geps;
-  if (!AllArrayUsersAreGEP(geps))
+  if (!AllArrayUsersAreGEPOrLifetime(geps))
     return;
 
   for (GEPOperator *gep : geps)

+ 5 - 0
lib/Transforms/Scalar/LowerTypePasses.cpp

@@ -26,6 +26,7 @@
 #include "llvm/IR/Module.h"
 #include "llvm/Transforms/Scalar.h"
 #include "llvm/Transforms/Utils/Local.h"
+#include "llvm/Analysis/ValueTracking.h"
 #include <vector>
 
 using namespace llvm;
@@ -367,6 +368,10 @@ void DynamicIndexingVectorToArray::ReplaceVectorWithArray(Value *Vec, Value *A)
         EltSt->setAlignment(align);
       }
       stInst->eraseFromParent();
+    } else if (BitCastInst *castInst = dyn_cast<BitCastInst>(User)) {
+      DXASSERT(onlyUsedByLifetimeMarkers(castInst),
+               "expected bitcast to only be used by lifetime intrinsics");
+      castInst->setOperand(0, A);
     } else {
       // Vector parameter should be lowered.
       // No function call should use vector.

+ 68 - 2
lib/Transforms/Scalar/ScalarReplAggregatesHLSL.cpp

@@ -2435,7 +2435,38 @@ void SROA_Helper::RewriteBitCast(BitCastInst *BCI) {
   SrcTy = SrcTy->getPointerElementType();
 
   if (!DstTy->isStructTy()) {
-    assert(0 && "Type mismatch.");
+    // This is an llvm.lifetime.* intrinsic. Replace bitcast by a bitcast for each element.
+    SmallVector<IntrinsicInst*, 16> ToReplace;
+
+    DXASSERT(onlyUsedByLifetimeMarkers(BCI),
+             "expected struct bitcast to only be used by lifetime intrinsics");
+
+    for (User *User : BCI->users()) {
+      IntrinsicInst *Intrin = cast<IntrinsicInst>(User);
+      ToReplace.push_back(Intrin);
+    }
+
+    const DataLayout &DL = BCI->getModule()->getDataLayout();
+    for (IntrinsicInst *Intrin : ToReplace) {
+      IRBuilder<> Builder(Intrin);
+
+      for (Value *Elt : NewElts) {
+        assert(Elt->getType()->isPointerTy());
+        Type     *ElPtrTy = Elt->getType();
+        Type     *ElTy    = ElPtrTy->getPointerElementType();
+        Value    *SizeV   = Builder.getInt64( DL.getTypeAllocSize(ElTy) );
+        Value    *Ptr     = Builder.CreateBitCast(Elt, Builder.getInt8PtrTy());
+        Value    *Args[]  = {SizeV, Ptr};
+        CallInst *C       = Builder.CreateCall(Intrin->getCalledFunction(), Args);
+        C->setDoesNotThrow();
+      }
+
+      assert(Intrin->use_empty());
+      Intrin->eraseFromParent();
+    }
+
+    assert(BCI->use_empty());
+    BCI->eraseFromParent();
     return;
   }
 
@@ -3191,6 +3222,35 @@ static void CopyElementsOfStructsWithIdenticalLayout(
   }
 }
 
+static void removeLifetimeUsers(Value *V) {
+  std::set<Value*> users(V->users().begin(), V->users().end());
+  for (Value *U : users) {
+    if (IntrinsicInst *II = dyn_cast<IntrinsicInst>(U)) {
+      if (II->getIntrinsicID() == Intrinsic::lifetime_start ||
+          II->getIntrinsicID() == Intrinsic::lifetime_end) {
+        II->eraseFromParent();
+      }
+    } else if (isa<BitCastInst>(U) ||
+               isa<AddrSpaceCastInst>(U) ||
+               isa<GetElementPtrInst>(U)) {
+      // Recurse into bitcast, addrspacecast, GEP.
+      removeLifetimeUsers(U);
+      if (U->use_empty())
+        cast<Instruction>(U)->eraseFromParent();
+    }
+  }
+}
+
+// Conservatively remove all lifetime users of both source and target.
+// Otherwise, wrong lifetimes could be inherited either way.
+// TODO: We should be merging the lifetimes. For convenience, just remove them
+//       for now to be safe.
+static void updateLifetimeForReplacement(Value *From, Value *To)
+{
+    removeLifetimeUsers(From);
+    removeLifetimeUsers(To);
+}
+
 static bool DominateAllUsers(Instruction *I, Value *V, DominatorTree *DT);
 
 
@@ -3210,9 +3270,11 @@ static bool ReplaceMemcpy(Value *V, Value *Src, MemCpyInst *MC,
   if (Instruction *SrcI = dyn_cast<Instruction>(Src))
     if (!DominateAllUsers(SrcI, V, DT))
       return false;
+
   Type *TyV = V->getType()->getPointerElementType();
   Type *TySrc = Src->getType()->getPointerElementType();
   if (Constant *C = dyn_cast<Constant>(V)) {
+    updateLifetimeForReplacement(V, Src);
     if (TyV == TySrc) {
       if (isa<Constant>(Src)) {
         V->replaceAllUsesWith(Src);
@@ -3228,8 +3290,10 @@ static bool ReplaceMemcpy(Value *V, Value *Src, MemCpyInst *MC,
     }
   } else {
     if (TyV == TySrc) {
-      if (V != Src)
+      if (V != Src) {
+        updateLifetimeForReplacement(V, Src);
         V->replaceAllUsesWith(Src);
+      }
     } else if (!IsUnboundedArrayMemcpy(TyV, TySrc)) {
       Value* DestVal = MC->getRawDest();
       Value* SrcVal = MC->getRawSource();
@@ -3270,6 +3334,7 @@ static bool ReplaceMemcpy(Value *V, Value *Src, MemCpyInst *MC,
             MemcpySplitter::SplitMemCpy(MC, DL, annotation, typeSys);
             return true;
           } else {
+            updateLifetimeForReplacement(V, Src);
             DstPtr->replaceAllUsesWith(SrcPtr);
           }
         } else {
@@ -3278,6 +3343,7 @@ static bool ReplaceMemcpy(Value *V, Value *Src, MemCpyInst *MC,
         }
       }
     } else {
+      updateLifetimeForReplacement(V, Src);
       DXASSERT(IsUnboundedArrayMemcpy(TyV, TySrc), "otherwise mismatched types in memcpy are not unbounded array");
       ReplaceUnboundedArrayUses(V, Src);
     }

+ 66 - 9
lib/Transforms/Utils/PromoteMemoryToRegister.cpp

@@ -297,7 +297,8 @@ private:
 
   void ComputeLiveInBlocks(AllocaInst *AI, AllocaInfo &Info,
                            const SmallPtrSetImpl<BasicBlock *> &DefBlocks,
-                           SmallPtrSetImpl<BasicBlock *> &LiveInBlocks);
+                           SmallPtrSetImpl<BasicBlock *> &LiveInBlocks,
+                           BasicBlock *LifetimeStartBB);
   void RenamePass(BasicBlock *BB, BasicBlock *Pred,
                   RenamePassData::ValVector &IncVals,
                   std::vector<RenamePassData> &Worklist);
@@ -306,9 +307,11 @@ private:
 
 } // end of anonymous namespace
 
-static void removeLifetimeIntrinsicUsers(AllocaInst *AI) {
+static BasicBlock *determineLifetimeStartBBAndRemoveLifetimeIntrinsicUsers(AllocaInst *AI) {
   // Knowing that this alloca is promotable, we know that it's safe to kill all
   // instructions except for load and store.
+  bool DetermineLifetimeStartLoc = true;
+  BasicBlock *LifetimeStartBB = nullptr;
 
   for (auto UI = AI->user_begin(), UE = AI->user_end(); UI != UE;) {
     Instruction *I = cast<Instruction>(*UI);
@@ -318,16 +321,30 @@ static void removeLifetimeIntrinsicUsers(AllocaInst *AI) {
 
     if (!I->getType()->isVoidTy()) {
       // The only users of this bitcast/GEP instruction are lifetime intrinsics.
-      // Follow the use/def chain to erase them now instead of leaving it for
-      // dead code elimination later.
       for (auto UUI = I->user_begin(), UUE = I->user_end(); UUI != UUE;) {
-        Instruction *Inst = cast<Instruction>(*UUI);
+        IntrinsicInst *Inst = dyn_cast<IntrinsicInst>(*UUI);
+        assert(Inst);
         ++UUI;
+        if (DetermineLifetimeStartLoc && Inst->getIntrinsicID() == Intrinsic::lifetime_start) {
+          if (!LifetimeStartBB) {
+            // Remember the lifetime start block.
+            LifetimeStartBB = Inst->getParent();
+          } else {
+            // We currently don't handle alloca with multiple lifetime.start
+            // intrinsics because there can be lots of complicated cases such
+            // as multiple disjoint lifetimes in a single block.
+            // Clear the block and stop looking for a new one.
+            LifetimeStartBB = nullptr;
+            DetermineLifetimeStartLoc = false;
+          }
+        }
         Inst->eraseFromParent();
       }
     }
     I->eraseFromParent();
   }
+
+  return LifetimeStartBB;
 }
 
 /// \brief Rewrite as many loads as possible given a single store.
@@ -544,7 +561,7 @@ void PromoteMem2Reg::run() {
     assert(AI->getParent()->getParent() == &F &&
            "All allocas should be in the same function, which is same as DF!");
 
-    removeLifetimeIntrinsicUsers(AI);
+    BasicBlock *LifetimeStartBB = determineLifetimeStartBBAndRemoveLifetimeIntrinsicUsers(AI);
 
     if (AI->use_empty()) {
       // If there are no uses of the alloca, just delete it now.
@@ -562,9 +579,16 @@ void PromoteMem2Reg::run() {
     // analogous to finding the 'uses' and 'definitions' of each variable.
     Info.AnalyzeAlloca(AI);
 
+    // Determine whether this alloca only has one store *before* potentially
+    // adding a lifetime.start block as another definition. This allows to
+    // rewrite such an alloca efficiently if possible but still benefit from
+    // correct lifetime information should the single store not dominate all
+    // loads.
+    const bool IsSingleStoreAlloca = Info.DefiningBlocks.size() == 1;
+
     // If there is only a single store to this value, replace any loads of
     // it that are directly dominated by the definition with the value stored.
-    if (Info.DefiningBlocks.size() == 1) {
+    if (IsSingleStoreAlloca) {
       if (rewriteSingleStoreAlloca(AI, Info, LBI, DT, AST)) {
         // The alloca has been processed, move on.
         RemoveFromAllocasList(AllocaNum);
@@ -616,7 +640,15 @@ void PromoteMem2Reg::run() {
     // Determine which blocks the value is live in.  These are blocks which lead
     // to uses.
     SmallPtrSet<BasicBlock *, 32> LiveInBlocks;
-    ComputeLiveInBlocks(AI, Info, DefBlocks, LiveInBlocks);
+    ComputeLiveInBlocks(AI, Info, DefBlocks, LiveInBlocks, LifetimeStartBB);
+
+    // If there is a lifetime start block, add it to the def blocks now.
+    // This ensures that the block that holds the lifetime.start call is treated
+    // as a "definition" by IDF to prevent phi nodes inserted into loop headers
+    // due to false dependencies.
+    if (LifetimeStartBB) {
+        DefBlocks.insert(LifetimeStartBB);
+    }
 
     // At this point, we're committed to promoting the alloca using IDF's, and
     // the standard SSA construction algorithm.  Determine which blocks need phi
@@ -808,10 +840,19 @@ void PromoteMem2Reg::run() {
 /// These are blocks which lead to uses.  Knowing this allows us to avoid
 /// inserting PHI nodes into blocks which don't lead to uses (thus, the
 /// inserted phi nodes would be dead).
+///
+/// The lifetime start block is important for cases where lifetime is restricted
+/// to a loop and not all loads are dominated by stores. When walking up the
+/// CFG, stopping at this block prevents the value from being considered live in
+/// the loop header, which in turn prevents the value from being live across
+/// multiple loop iterations through a phi with undef as input from the preheader.
+/// Also, the block must not even be considered as starting point for the CFG
+/// traversal since the value can't be live-in before its lifetime starts.
 void PromoteMem2Reg::ComputeLiveInBlocks(
     AllocaInst *AI, AllocaInfo &Info,
     const SmallPtrSetImpl<BasicBlock *> &DefBlocks,
-    SmallPtrSetImpl<BasicBlock *> &LiveInBlocks) {
+    SmallPtrSetImpl<BasicBlock *> &LiveInBlocks,
+    BasicBlock *LifetimeStartBB) {
 
   // To determine liveness, we must iterate through the predecessors of blocks
   // where the def is live.  Blocks are added to the worklist if we need to
@@ -827,6 +868,18 @@ void PromoteMem2Reg::ComputeLiveInBlocks(
     if (!DefBlocks.count(BB))
       continue;
 
+    // If lifetime of this value starts in this block it isn't live-in.
+    // Even if the block is in a loop, a load before the lifetime intrinsic is
+    // dead by definition since lifetime must also end in the same loop
+    // iteration. In fact, this very condition prevents false dependencies
+    // across loop iterations that in turn cause phi nodes.
+    if (BB == LifetimeStartBB) {
+      LiveInBlockWorklist[i] = LiveInBlockWorklist.back();
+      LiveInBlockWorklist.pop_back();
+      --i, --e;
+      continue;
+    }
+
     // Okay, this is a block that both uses and defines the value.  If the first
     // reference to the alloca is a def (store), then we know it isn't live-in.
     for (BasicBlock::iterator I = BB->begin();; ++I) {
@@ -873,6 +926,10 @@ void PromoteMem2Reg::ComputeLiveInBlocks(
       if (DefBlocks.count(P))
         continue;
 
+      // The value is not live into a predecessor if lifetime starts there.
+      if (P == LifetimeStartBB)
+        continue;
+
       // Otherwise it is, add to the worklist.
       LiveInBlockWorklist.push_back(P);
     }

+ 4 - 0
tools/clang/include/clang/Frontend/CodeGenOptions.h

@@ -228,6 +228,10 @@ public:
   std::map<std::string, std::string> HLSLOptimizationSelects;
   /// Debug option to print IR after every pass
   bool HLSLPrintAfterAll = false;
+  /// Force-replace lifetime intrinsics by zeroinitializer stores.
+  bool HLSLForceZeroStoreLifetimes = false;
+  /// Enable lifetime marker generation
+  bool HLSLEnableLifetimeMarkers = false;
   // HLSL Change Ends
 
   // SPIRV Change Starts

+ 1 - 0
tools/clang/lib/CodeGen/BackendUtil.cpp

@@ -340,6 +340,7 @@ void EmitAssemblyHelper::CreatePasses() {
                         CodeGenOpts.HLSLOptimizationToggles.count("structurize-loop-exits-for-unroll") &&
                         CodeGenOpts.HLSLOptimizationToggles.find("structurize-loop-exits-for-unroll")->second;
 
+  PMBuilder.HLSLEnableLifetimeMarkers = CodeGenOpts.HLSLEnableLifetimeMarkers;
   // HLSL Change - end
 
   PMBuilder.DisableUnitAtATime = !CodeGenOpts.UnitAtATime;

+ 4 - 2
tools/clang/lib/CodeGen/CGCleanup.cpp

@@ -942,7 +942,7 @@ bool CodeGenFunction::isObviouslyBranchWithoutCleanups(JumpDest Dest) const {
 /// be known, in which case this will require a fixup.
 ///
 /// As a side-effect, this method clears the insertion point.
-void CodeGenFunction::EmitBranchThroughCleanup(JumpDest Dest) {
+void CodeGenFunction::EmitBranchThroughCleanup(JumpDest Dest, llvm::BranchInst *PreExistingBr) {
   assert(Dest.getScopeDepth().encloses(EHStack.stable_begin())
          && "stale jump destination");
 
@@ -950,7 +950,9 @@ void CodeGenFunction::EmitBranchThroughCleanup(JumpDest Dest) {
     return;
 
   // Create the branch.
-  llvm::BranchInst *BI = Builder.CreateBr(Dest.getBlock());
+  // HLSL Change Begin - use pre-generated branch if exists
+  llvm::BranchInst *BI = PreExistingBr ? PreExistingBr : Builder.CreateBr(Dest.getBlock());
+  // HLSL Change End - use pre-generated branch if exists
 
   // Calculate the innermost active normal cleanup.
   EHScopeStack::stable_iterator

+ 3 - 4
tools/clang/lib/CodeGen/CGDecl.cpp

@@ -873,11 +873,10 @@ llvm::Value *CodeGenFunction::EmitLifetimeStart(uint64_t Size,
   // For now, only in optimized builds.
   if (CGM.getCodeGenOpts().OptimizationLevel == 0)
     return nullptr;
+
   // HLSL Change Begins
-  // Don't emit the intrinsic for hlsl for now.
-  // Enable this will require SROA_HLSL to support the intrinsic.
-  // Will do it later when support lifetime marker in HLSL.
-  if (CGM.getLangOpts().HLSL)
+  // Don't emit the intrinsic for hlsl for now unless it is explicitly enabled
+  if (!CGM.getCodeGenOpts().HLSLEnableLifetimeMarkers)
     return nullptr;
   // HLSL Change Ends
 

+ 3 - 2
tools/clang/lib/CodeGen/CGExpr.cpp

@@ -3883,6 +3883,7 @@ RValue CodeGenFunction::EmitCall(QualType CalleeType, llvm::Value *Callee,
 
   // HLSL Change Begins
   llvm::SmallVector<LValue, 8> castArgList;
+  llvm::SmallVector<LValue, 8> lifetimeCleanupList;
   // The argList of the CallExpr, may be update for out parameter
   llvm::SmallVector<const Stmt *, 8> argList(E->arg_begin(), E->arg_end());
   ConstExprIterator argBegin = argList.data();
@@ -3895,7 +3896,7 @@ RValue CodeGenFunction::EmitCall(QualType CalleeType, llvm::Value *Callee,
   if (getLangOpts().HLSL) {
     if (const FunctionDecl *FD = E->getDirectCallee())
       CGM.getHLSLRuntime().EmitHLSLOutParamConversionInit(*this, FD, E,
-                                                          castArgList, argList, MapTemp);
+                                                          castArgList, argList, lifetimeCleanupList, MapTemp);
   }
   // HLSL Change Ends
 
@@ -3940,7 +3941,7 @@ RValue CodeGenFunction::EmitCall(QualType CalleeType, llvm::Value *Callee,
   // out param conversion
   // conversion and copy back after the call
   if (getLangOpts().HLSL)
-    CGM.getHLSLRuntime().EmitHLSLOutParamConversionCopyBack(*this, castArgList);
+    CGM.getHLSLRuntime().EmitHLSLOutParamConversionCopyBack(*this, castArgList, lifetimeCleanupList);
   // HLSL Change Ends
 
   return CallVal;

+ 6 - 4
tools/clang/lib/CodeGen/CGExprCXX.cpp

@@ -83,6 +83,7 @@ RValue CodeGenFunction::EmitCXXMemberOrOperatorCall(
 
   // HLSL Change Begins
   llvm::SmallVector<LValue, 8> castArgList;
+  llvm::SmallVector<LValue, 8> lifetimeCleanupList;
   // The argList of the CallExpr, may be update for out parameter
   llvm::SmallVector<const Stmt *, 8> argList(CE->arg_begin(), CE->arg_end());
   // out param conversion
@@ -93,7 +94,7 @@ RValue CodeGenFunction::EmitCXXMemberOrOperatorCall(
   if (getLangOpts().HLSL) {
     if (const FunctionDecl *FD = CE->getDirectCallee())
       CGM.getHLSLRuntime().EmitHLSLOutParamConversionInit(*this, FD, CE,
-                                                          castArgList, argList, MapTemp);
+                                                          castArgList, argList, lifetimeCleanupList, MapTemp);
   }
   // HLSL Change Ends
 
@@ -106,7 +107,7 @@ RValue CodeGenFunction::EmitCXXMemberOrOperatorCall(
   // out param conversion
   // conversion and copy back after the call
   if (getLangOpts().HLSL)
-    CGM.getHLSLRuntime().EmitHLSLOutParamConversionCopyBack(*this, castArgList);
+    CGM.getHLSLRuntime().EmitHLSLOutParamConversionCopyBack(*this, castArgList, lifetimeCleanupList);
   // HLSL Change Ends
   return CallVal;
 }
@@ -118,6 +119,7 @@ RValue CodeGenFunction::EmitCXXStructorCall(
   CallArgList Args;
   // HLSL Change Begins
   llvm::SmallVector<LValue, 8> castArgList;
+  llvm::SmallVector<LValue, 8> lifetimeCleanupList;
   // The argList of the CallExpr, may be update for out parameter
   llvm::SmallVector<const Stmt *, 8> argList(CE->arg_begin(), CE->arg_end());
   // out param conversion
@@ -128,7 +130,7 @@ RValue CodeGenFunction::EmitCXXStructorCall(
   if (getLangOpts().HLSL) {
     if (const FunctionDecl *FD = CE->getDirectCallee())
       CGM.getHLSLRuntime().EmitHLSLOutParamConversionInit(*this, FD, CE,
-                                                          castArgList, argList, MapTemp);
+                                                          castArgList, argList, lifetimeCleanupList, MapTemp);
   }
   // HLSL Change Ends
   commonEmitCXXMemberOrOperatorCall(*this, MD, Callee, ReturnValue, This,
@@ -140,7 +142,7 @@ RValue CodeGenFunction::EmitCXXStructorCall(
   // out param conversion
   // conversion and copy back after the call
   if (getLangOpts().HLSL)
-    CGM.getHLSLRuntime().EmitHLSLOutParamConversionCopyBack(*this, castArgList);
+    CGM.getHLSLRuntime().EmitHLSLOutParamConversionCopyBack(*this, castArgList, lifetimeCleanupList);
   // HLSL Change Ends
   return CallVal;
 }

+ 27 - 6
tools/clang/lib/CodeGen/CGHLSLMS.cpp

@@ -237,17 +237,19 @@ public:
       CodeGenFunction &CGF, const FunctionDecl *FD, const CallExpr *E,
       llvm::SmallVector<LValue, 8> &castArgList,
       llvm::SmallVector<const Stmt *, 8> &argList,
+      llvm::SmallVector<LValue, 8> &lifetimeCleanupList,
       const std::function<void(const VarDecl *, llvm::Value *)> &TmpArgMap)
       override;
   void EmitHLSLOutParamConversionCopyBack(
-      CodeGenFunction &CGF, llvm::SmallVector<LValue, 8> &castArgList) override;
+      CodeGenFunction &CGF, llvm::SmallVector<LValue, 8> &castArgList,
+      llvm::SmallVector<LValue, 8> &lifetimeCleanupList) override;
 
   Value *EmitHLSLMatrixOperationCall(CodeGenFunction &CGF, const clang::Expr *E,
                                      llvm::Type *RetType,
                                      ArrayRef<Value *> paramList) override;
 
   void EmitHLSLDiscard(CodeGenFunction &CGF) override;
-  void EmitHLSLCondBreak(CodeGenFunction &CGF, llvm::Function *F, llvm::BasicBlock *DestBB, llvm::BasicBlock *AltBB) override;
+  BranchInst *EmitHLSLCondBreak(CodeGenFunction &CGF, llvm::Function *F, llvm::BasicBlock *DestBB, llvm::BasicBlock *AltBB) override;
 
   Value *EmitHLSLMatrixSubscript(CodeGenFunction &CGF, llvm::Type *RetType,
                                  Value *Ptr, Value *Idx, QualType Ty) override;
@@ -370,6 +372,7 @@ CGMSHLSLRuntime::CGMSHLSLRuntime(CodeGenModule &CGM)
   opts.bAllResourcesBound = CGM.getCodeGenOpts().HLSLAllResourcesBound;
   opts.PackingStrategy = CGM.getCodeGenOpts().HLSLSignaturePackingStrategy;
   opts.bLegacyResourceReservation = CGM.getCodeGenOpts().HLSLLegacyResourceReservation;
+  opts.bForceZeroStoreLifetimes = CGM.getCodeGenOpts().HLSLForceZeroStoreLifetimes;
 
   opts.bUseMinPrecision = CGM.getLangOpts().UseMinPrecision;
   opts.bDX9CompatMode = CGM.getLangOpts().EnableDX9CompatMode;
@@ -4440,12 +4443,11 @@ void CGMSHLSLRuntime::EmitHLSLDiscard(CodeGenFunction &CGF) {
 // This allows the block containing what would have been an unconditional break to be included in the loop
 // If the block uses values that are wave-sensitive, it needs to stay in the loop to prevent optimizations
 // that might produce incorrect results by ignoring the volatile aspect of wave operation results.
-void CGMSHLSLRuntime::EmitHLSLCondBreak(CodeGenFunction &CGF, Function *F, BasicBlock *DestBB, BasicBlock *AltBB) {
+BranchInst *CGMSHLSLRuntime::EmitHLSLCondBreak(CodeGenFunction &CGF, Function *F, BasicBlock *DestBB, BasicBlock *AltBB) {
   // If not a wave-enabled stage, we can keep everything unconditional as before
   if (!m_pHLModule->GetShaderModel()->IsPS() && !m_pHLModule->GetShaderModel()->IsCS() &&
       !m_pHLModule->GetShaderModel()->IsLib()) {
-    CGF.Builder.CreateBr(DestBB);
-    return;
+    return CGF.Builder.CreateBr(DestBB);
   }
 
   // Create a branch that is temporarily conditional on a constant
@@ -4453,6 +4455,7 @@ void CGMSHLSLRuntime::EmitHLSLCondBreak(CodeGenFunction &CGF, Function *F, Basic
   llvm::Type *boolTy = llvm::Type::getInt1Ty(Context);
   BranchInst *BI = CGF.Builder.CreateCondBr(llvm::ConstantInt::get(boolTy,1), DestBB, AltBB);
   m_DxBreaks.emplace_back(BI);
+  return BI;
 }
 
 static llvm::Type *MergeIntType(llvm::IntegerType *T0, llvm::IntegerType *T1) {
@@ -5491,6 +5494,7 @@ void CGMSHLSLRuntime::EmitHLSLOutParamConversionInit(
     CodeGenFunction &CGF, const FunctionDecl *FD, const CallExpr *E,
     llvm::SmallVector<LValue, 8> &castArgList,
     llvm::SmallVector<const Stmt *, 8> &argList,
+    llvm::SmallVector<LValue, 8> &lifetimeCleanupList,
     const std::function<void(const VarDecl *, llvm::Value *)> &TmpArgMap) {
   // Special case: skip first argument of CXXOperatorCall (it is "this").
   unsigned ArgsToSkip = isa<CXXOperatorCallExpr>(E) ? 1 : 0;
@@ -5659,6 +5663,11 @@ void CGMSHLSLRuntime::EmitHLSLOutParamConversionInit(
     IRBuilder<> AllocaBuilder(dxilutil::FindAllocaInsertionPt(F));
     tmpArgAddr = AllocaBuilder.CreateAlloca(CGF.ConvertTypeForMem(ParamTy));
 
+    if (CGM.getCodeGenOpts().HLSLEnableLifetimeMarkers) {
+      const uint64_t AllocaSize = CGM.getDataLayout().getTypeAllocSize(CGF.ConvertTypeForMem(ParamTy));
+      CGF.EmitLifetimeStart(AllocaSize, tmpArgAddr);
+    }
+
     // add it to local decl map
     TmpArgMap(tmpArg, tmpArgAddr);
 
@@ -5677,6 +5686,10 @@ void CGMSHLSLRuntime::EmitHLSLOutParamConversionInit(
       }
     }
 
+    // save to generate lifetime end after call
+    if (CGM.getCodeGenOpts().HLSLEnableLifetimeMarkers)
+      lifetimeCleanupList.emplace_back(tmpLV);
+
     // cast before the call
     if (Param->isModifierIn() &&
         // Don't copy object
@@ -5719,7 +5732,8 @@ void CGMSHLSLRuntime::EmitHLSLOutParamConversionInit(
 }
 
 void CGMSHLSLRuntime::EmitHLSLOutParamConversionCopyBack(
-    CodeGenFunction &CGF, llvm::SmallVector<LValue, 8> &castArgList) {
+    CodeGenFunction &CGF, llvm::SmallVector<LValue, 8> &castArgList,
+    llvm::SmallVector<LValue, 8> &lifetimeCleanupList) {
   for (uint32_t i = 0; i < castArgList.size(); i += 2) {
     // cast after the call
     LValue tmpLV = castArgList[i];
@@ -5780,6 +5794,13 @@ void CGMSHLSLRuntime::EmitHLSLOutParamConversionCopyBack(
     } else
       tmpArgAddr->replaceAllUsesWith(argLV.getAddress());
   }
+
+  for (LValue &tmpLV : lifetimeCleanupList) {
+    QualType ParamTy = tmpLV.getType().getNonReferenceType();
+    Value *tmpArgAddr = tmpLV.getAddress();
+    const uint64_t AllocaSize = CGM.getDataLayout().getTypeAllocSize(CGF.ConvertTypeForMem(ParamTy));
+    CGF.EmitLifetimeEnd(CGF.Builder.getInt64(AllocaSize), tmpArgAddr);
+  }
 }
 
 ScopeInfo *CGMSHLSLRuntime::GetScopeInfo(Function *F) {

+ 4 - 2
tools/clang/lib/CodeGen/CGHLSLRuntime.h

@@ -76,15 +76,17 @@ public:
       CodeGenFunction &CGF, const FunctionDecl *FD, const CallExpr *E,
       llvm::SmallVector<LValue, 8> &castArgList,
       llvm::SmallVector<const Stmt *, 8> &argList,
+      llvm::SmallVector<LValue, 8> &lifetimeCleanupList,
       const std::function<void(const VarDecl *, llvm::Value *)> &TmpArgMap) = 0;
   virtual void EmitHLSLOutParamConversionCopyBack(
-      CodeGenFunction &CGF, llvm::SmallVector<LValue, 8> &castArgList) = 0;
+      CodeGenFunction &CGF, llvm::SmallVector<LValue, 8> &castArgList,
+      llvm::SmallVector<LValue, 8> &lifetimeCleanupList) = 0;
   virtual void MarkRetTemp(CodeGenFunction &CGF, llvm::Value *V,
                           clang::QualType QaulTy) = 0;
   virtual llvm::Value *EmitHLSLMatrixOperationCall(CodeGenFunction &CGF, const clang::Expr *E, llvm::Type *RetType,
       llvm::ArrayRef<llvm::Value*> paramList) = 0;
   virtual void EmitHLSLDiscard(CodeGenFunction &CGF) = 0;
-  virtual void EmitHLSLCondBreak(CodeGenFunction &CGF, llvm::Function *F, llvm::BasicBlock *DestBB, llvm::BasicBlock *AltBB) = 0;
+  virtual llvm::BranchInst *EmitHLSLCondBreak(CodeGenFunction &CGF, llvm::Function *F, llvm::BasicBlock *DestBB, llvm::BasicBlock *AltBB) = 0;
 
   // For [] on matrix
   virtual llvm::Value *EmitHLSLMatrixSubscript(CodeGenFunction &CGF,

+ 13 - 5
tools/clang/lib/CodeGen/CGStmt.cpp

@@ -1202,11 +1202,19 @@ void CodeGenFunction::EmitBreakStmt(const BreakStmt &S) {
 
   // HLSL Change Begin - incorporate unconditional branch blocks into loops
   // If it has a continue location, it's a loop
-  if (BreakContinueStack.back().ContinueBlock.getBlock() && (BreakContinueStack.size() < 2 ||
-      BreakContinueStack.back().ContinueBlock.getBlock() != BreakContinueStack.end()[-2].ContinueBlock.getBlock())) {
-    assert(EHStack.getInnermostActiveNormalCleanup() == EHStack.stable_end() && "HLSL Shouldn't need cleanups");
-    CGM.getHLSLRuntime().EmitHLSLCondBreak(*this, CurFn, BreakContinueStack.back().BreakBlock.getBlock(),
-                                           BreakContinueStack.back().ContinueBlock.getBlock());
+  llvm::BasicBlock *lastContinueBlock = BreakContinueStack.back().ContinueBlock.getBlock();
+  if (lastContinueBlock && (BreakContinueStack.size() < 2 ||
+      lastContinueBlock != BreakContinueStack.end()[-2].ContinueBlock.getBlock())) {
+    // We execute this if
+    // - we are in an unnested loop, or
+    // - we are in a nested control construct but the continue block of the enclosing loop is different from the current continue block.
+    // The second condition can happen for switch statements inside loops, which share the same continue block.
+    llvm::BasicBlock *lastBreakBlock = BreakContinueStack.back().BreakBlock.getBlock();
+    llvm::BranchInst *condBr = CGM.getHLSLRuntime().EmitHLSLCondBreak(*this, CurFn, lastBreakBlock, lastContinueBlock);
+
+    // Insertion of lifetime.start/end intrinsics may require a cleanup, so we
+    // pass the branch that we already generated into the handler.
+    EmitBranchThroughCleanup(BreakContinueStack.back().BreakBlock, condBr);
     Builder.ClearInsertionPoint();
   } else
   // HLSL Change End - incorporate unconditional branch blocks into loops

+ 2 - 1
tools/clang/lib/CodeGen/CodeGenFunction.h

@@ -685,7 +685,8 @@ public:
   /// EmitBranchThroughCleanup - Emit a branch from the current insert
   /// block through the normal cleanup handling code (if any) and then
   /// on to \arg Dest.
-  void EmitBranchThroughCleanup(JumpDest Dest);
+  // HLSL Change - allow to use pre-generated branch
+  void EmitBranchThroughCleanup(JumpDest Dest, llvm::BranchInst *PreExistingBr = nullptr);
   
   /// isObviouslyBranchWithoutCleanups - Return true if a branch to the
   /// specified destination obviously has no cleanups to run.  'false' is always

+ 1 - 2
tools/clang/test/HLSLFileCheck/hlsl/control_flow/basic_blocks/cbuf_memcpy_replace.hlsl

@@ -170,7 +170,6 @@ int uncond_if_else(uint i, int j)
 //CHECK: define i32 @"\01?entry_memcpy@@YAHHH@Z"(i32 %i, i32 %ct)
 //CHECK: call %dx.types.CBufRet.i32 @dx.op.cbufferLoadLegacy.i32
 //CHECK: extractvalue %dx.types.CBufRet.i32
-//CHECK: icmp eq i32
 //CHECK: phi i32
 //CHECK: ret i32
 // This should allow the complete RAUW replacement
@@ -181,7 +180,7 @@ int entry_memcpy(int i, int ct)
 
   istruct = istructs[i];
 
-  int ival;
+  int ival = 0;
 
   for (; i < ct; ++i)
     ival += istruct.ival;

+ 0 - 2
tools/clang/test/HLSLFileCheck/hlsl/control_flow/return/whole_scope_returned_loop.hlsl

@@ -6,8 +6,6 @@ float main(float4 a:A) : SV_Target {
 // Init bReturned.
 // CHECK:%[[bReturned:.*]] = alloca i1
 // CHECK-NEXT:store i1 false, i1* %[[bReturned]]
-// Init retVal to 0.
-// CHECK:store float undef
   float r = a.w;
 
 // CHECK: [[if_then:.*]] ; preds =

+ 228 - 0
tools/clang/test/HLSLFileCheck/hlsl/lifetimes/lifetimes.hlsl

@@ -0,0 +1,228 @@
+// RUN: %dxc -T lib_6_6 %s  | FileCheck %s
+
+//
+// Non-SSA arrays should have lifetimes within the correct scope.
+//
+// CHECK: define i32 @"\01?if_scoped_array@@YAHHH@Z"
+// CHECK: alloca
+// CHECK: icmp
+// CHECK: br i1
+// CHECK: call void @llvm.lifetime.start
+// CHECK: br label
+// CHECK: load i32
+// CHECK: call void @llvm.lifetime.end
+// CHECK: br label
+// CHECK: phi i32
+// CHECK: load i32
+// CHECK: store i32
+// CHECK: br i1
+// CHECK: phi i32
+// CHECK: ret i32
+export
+int if_scoped_array(int n, int c)
+{
+  int res = c;
+
+  if (n > 0) {
+    int arr[200];
+
+    // Fake some dynamic initialization so the array can't be optimzed away.
+    for (int i = 0; i < n; ++i) {
+        arr[i] = arr[c - i];
+    }
+
+    res = arr[c];
+  }
+
+  return res;
+}
+
+//
+// Escaping structs should have lifetimes within the correct scope.
+//
+// CHECK: define void @"\01?loop_scoped_escaping_struct@@YAXH@Z"(i32 %n)
+// CHECK: %[[alloca:.*]] = alloca %struct.MyStruct
+// CHECK: ret
+// CHECK: phi i32
+// CHECK-NEXT: bitcast
+// CHECK-NEXT: call void @llvm.lifetime.start
+// CHECK-NEXT: call float @"\01?func@@YAMUMyStruct@@@Z"(%struct.MyStruct* nonnull %[[alloca]])
+// CHECK-NEXT: call void @llvm.lifetime.end
+// CHECK: br i1
+struct MyStruct {
+  float x;
+};
+
+float func(MyStruct data);
+
+export
+void loop_scoped_escaping_struct(int n)
+{
+  for (int i = 0; i < n; ++i) {
+    MyStruct data;
+    func(data);
+  }
+}
+
+//
+// Loop-scoped structs that are passed as inout should have lifetimes
+// within the correct scope and should not produce values live across multiple
+// loop iterations (=no loop phi nodes).
+//
+// CHECK: define i32 @"\01?loop_scoped_escaping_struct_write@@YAHH@Z"(i32 %n)
+// CHECK: %[[alloca:.*]] = alloca %struct.MyStruct
+// CHECK: br i1
+// CHECK: phi i32
+// CHECK-NEXT: ret
+// CHECK: phi i32
+// CHECK-NEXT: phi i32
+// CHECK-NOT: phi float
+// CHECK-NEXT: bitcast
+// CHECK-NEXT: call void @llvm.lifetime.start
+// CHECK-NOT: store
+// CHECK-NEXT: call void @"\01?func2@@YAXUMyStruct@@@Z"(%struct.MyStruct* nonnull %[[alloca]])
+// CHECK-NEXT: getelementptr
+// CHECK-NEXT: load
+// CHECK: call void @llvm.lifetime.end
+// CHECK: br i1
+void func2(inout MyStruct data);
+
+export
+int loop_scoped_escaping_struct_write(int n)
+{
+  int res = 0;
+  for (int i = 0; i < n; ++i) {
+    MyStruct data;
+    func2(data);
+    res += data.x;
+  }
+  return res;
+}
+
+//
+// Loop-scoped structs that can be promoted to registers should not produce
+// values considered live across multiple loop iterations (= no loop phi nodes).
+//
+// Make sure there is only one loop phi node, which is the induction var.
+// CHECK: define i32 @"\01?loop_scoped_struct_conditional_init@@YAHHH@Z"(i32 %n, i32 %c1)
+// CHECK-NOT: alloca
+// CHECK: phi i32
+// CHECK-NEXT: ret i32
+// CHECK-NOT: phi float
+// CHECK: phi i32
+// CHECK-NOT: phi float
+// CHECK-NEXT: call void @"\01?expensiveComputation
+// CHECK-NEXT: icmp
+// CHECK-NEXT: br i1
+// CHECK: dx.op.rawBufferLoad
+// CHECK: phi float
+RWStructuredBuffer<MyStruct> g_rwbuf : register(u0);
+
+void expensiveComputation();
+
+export
+int loop_scoped_struct_conditional_init(int n, int c1)
+{
+  int res = n;
+
+  for (int i = 0; i < n; ++i) {
+    expensiveComputation(); // s mut not be considered live here.
+
+    MyStruct s;
+
+    // Initialize struct conditionally.
+    // NOTE: If some optimization decides to flatten the if statement or if the
+    //       computation could be hoisted out of the loop, the phi with undef
+    //       below will be replaced by the non-undef value (which is a valid
+    //       "specialization" of undef).
+    if (c1 < 0)
+      s.x = g_rwbuf[i - c1].x;
+
+    res = s.x; // i or undef.
+  }
+
+  return res; // Result is n if loop wasn't executed, n-1 if it was.
+}
+
+//
+// Another real-world use-case for loop-scoped structs that can be promoted
+// to registers. Again, this should not produce values that are live across
+// multiple loop iterations (= no loop phi nodes).
+// Both the consume and produce calls must be inlined, otherwise the alloca
+// can't be promoted.
+//
+// CHECK: define i32 @"\01?loop_scoped_struct_conditional_assign_from_func_output@@YAHHH@Z"(i32 %n, i32 %c1)
+// CHECK-NOT: alloca
+// CHECK: phi i32
+// CHECK-NOT: phi i32
+// CHECK-NOT: phi float
+void consume(int i, in MyStruct data)
+{
+  // This must be inlined, otherwise the alloca can't be promoted.
+  g_rwbuf[i] = data;
+}
+
+bool produce(in int c, out MyStruct data)
+{
+  if (c > 0) {
+    MyStruct s;
+    s.x = 13;
+    data = s; // <-- Conditional assignment of out-qualified parameter.
+    return true;
+  }
+  return false; // <-- Out-qualified parameter left uninitialized.
+}
+
+export
+int loop_scoped_struct_conditional_assign_from_func_output(int n, int c1)
+{
+  for (int i=0; i<n; ++i) {
+    MyStruct data;
+    bool valid = produce(c1, data); // <-- Without lifetimes, inlining this generates a loop phi using prior iteration's value.
+    if (valid)
+      consume(i, data);
+    expensiveComputation(); // <-- Said phi is alive here, inflating register pressure.
+  }
+  return n;
+}
+
+//
+// Global constants should have lifetimes.
+// The constant array should be hoisted to a constant global.
+//
+// CHECK: define i32 @"\01?global_constant@@YAHH@Z"(i32 %n)
+// CHECK: call void @llvm.lifetime.start
+// CHECK: load i32
+// CHECK: call void @llvm.lifetime.end
+// CHECK: ret i32
+int compute(int i)
+{
+  int arr[] = {0, 1, 2, 3, 4, 5, -1, 13};
+  return arr[i % 8];
+}
+
+export
+int global_constant(int n)
+{
+  return compute(n);
+}
+
+//
+// Global constants should have lifetimes within the correct scope.
+// The constant array should be hoisted to a constant global with lifetime
+// only inside the loop.
+//
+// CHECK: define i32 @"\01?global_constant2@@YAHH@Z"(i32 %n)
+// CHECK: phi i32
+// CHECK: phi i32
+// CHECK: call void @llvm.lifetime.start
+// CHECK: load i32
+// CHECK: call void @llvm.lifetime.end
+export
+int global_constant2(int n)
+{
+  int res = 0;
+  for (int i = 0; i < n; ++i)
+    res += compute(i);
+  return res;
+}

+ 39 - 0
tools/clang/test/HLSLFileCheck/hlsl/lifetimes/lifetimes_force_zero_flag.hlsl

@@ -0,0 +1,39 @@
+// RUN: %dxc -T lib_6_6 -force-zero-store-lifetimes %s  | FileCheck %s
+
+//
+// Same test as in lifetimes.hlsl, but expecting zeroinitializer store
+// instead of lifetime intrinsics due to flag -force-zero-store-lifetimes.
+//
+// CHECK: define i32 @"\01?if_scoped_array@@YAHHH@Z"
+// CHECK: alloca
+// CHECK: icmp
+// CHECK: br i1
+// CHECK: store [200 x i32] zeroinitializer
+// CHECK: br label
+// CHECK: load i32
+// CHECK: store [200 x i32] zeroinitializer
+// CHECK: br label
+// CHECK: phi i32
+// CHECK: load i32
+// CHECK: store i32
+// CHECK: br i1
+// CHECK: phi i32
+// CHECK: ret i32
+export
+int if_scoped_array(int n, int c)
+{
+  int res = c;
+
+  if (n > 0) {
+    int arr[200];
+
+    // Fake some dynamic initialization so the array can't be optimzed away.
+    for (int i = 0; i < n; ++i) {
+        arr[i] = arr[c - i];
+    }
+
+    res = arr[c];
+  }
+
+  return res;
+}

+ 232 - 0
tools/clang/test/HLSLFileCheck/hlsl/lifetimes/lifetimes_lib_6_3.hlsl

@@ -0,0 +1,232 @@
+// RUN: %dxc -T lib_6_3 -enable-lifetime-markers %s  | FileCheck %s
+
+// This file is identical to lifetimes.hlsl except that it tests for
+// undef stores instead of lifetime intrinsics (fallback for earlier
+// SM and validator versions).
+
+//
+// Non-SSA arrays should have lifetimes within the correct scope.
+//
+// CHECK: define i32 @"\01?if_scoped_array@@YAHHH@Z"
+// CHECK: alloca
+// CHECK: icmp
+// CHECK: br i1
+// CHECK: store [200 x i32] undef
+// CHECK: br label
+// CHECK: load i32
+// CHECK: store [200 x i32] undef
+// CHECK: br label
+// CHECK: phi i32
+// CHECK: load i32
+// CHECK: store i32
+// CHECK: br i1
+// CHECK: phi i32
+// CHECK: ret i32
+export
+int if_scoped_array(int n, int c)
+{
+  int res = c;
+
+  if (n > 0) {
+    int arr[200];
+
+    // Fake some dynamic initialization so the array can't be optimzed away.
+    for (int i = 0; i < n; ++i) {
+        arr[i] = arr[c - i];
+    }
+
+    res = arr[c];
+  }
+
+  return res;
+}
+
+//
+// Escaping structs should have lifetimes within the correct scope.
+//
+// CHECK: define void @"\01?loop_scoped_escaping_struct@@YAXH@Z"(i32 %n)
+// CHECK: %[[alloca:.*]] = alloca %struct.MyStruct
+// CHECK: ret
+// CHECK: phi i32
+// CHECK-NEXT: store %struct.MyStruct undef
+// CHECK-NEXT: call float @"\01?func@@YAMUMyStruct@@@Z"(%struct.MyStruct* nonnull %[[alloca]])
+// CHECK-NEXT: store %struct.MyStruct undef
+// CHECK: br i1
+struct MyStruct {
+  float x;
+};
+
+float func(MyStruct data);
+
+export
+void loop_scoped_escaping_struct(int n)
+{
+  for (int i = 0; i < n; ++i) {
+    MyStruct data;
+    func(data);
+  }
+}
+
+//
+// Loop-scoped structs that are passed as inout should have lifetimes
+// within the correct scope and should not produce values live across multiple
+// loop iterations (=no loop phi nodes).
+//
+// CHECK: define i32 @"\01?loop_scoped_escaping_struct_write@@YAHH@Z"(i32 %n)
+// CHECK: %[[alloca:.*]] = alloca %struct.MyStruct
+// CHECK: br i1
+// CHECK: phi i32
+// CHECK-NEXT: ret
+// CHECK: phi i32
+// CHECK-NEXT: phi i32
+// CHECK-NOT: phi float
+// CHECK-NEXT: store %struct.MyStruct undef
+// CHECK-NOT: store
+// CHECK-NEXT: call void @"\01?func2@@YAXUMyStruct@@@Z"(%struct.MyStruct* nonnull %[[alloca]])
+// CHECK-NEXT: getelementptr
+// CHECK-NEXT: load
+// CHECK: store %struct.MyStruct undef
+// CHECK: br i1
+void func2(inout MyStruct data);
+
+export
+int loop_scoped_escaping_struct_write(int n)
+{
+  int res = 0;
+  for (int i = 0; i < n; ++i) {
+    MyStruct data;
+    func2(data);
+    res += data.x;
+  }
+  return res;
+}
+
+//
+// Loop-scoped structs that can be promoted to registers should not produce
+// values considered live across multiple loop iterations (= no loop phi nodes).
+//
+// Make sure there is only one loop phi node, which is the induction var.
+// CHECK: define i32 @"\01?loop_scoped_struct_conditional_init@@YAHHH@Z"(i32 %n, i32 %c1)
+// CHECK-NOT: alloca
+// CHECK: phi i32
+// CHECK-NEXT: ret i32
+// CHECK-NOT: phi float
+// CHECK: phi i32
+// CHECK-NOT: phi float
+// CHECK-NEXT: call void @"\01?expensiveComputation
+// CHECK-NEXT: icmp
+// CHECK-NEXT: br i1
+// CHECK: dx.op.rawBufferLoad
+// CHECK: phi float
+RWStructuredBuffer<MyStruct> g_rwbuf : register(u0);
+
+void expensiveComputation();
+
+export
+int loop_scoped_struct_conditional_init(int n, int c1)
+{
+  int res = n;
+
+  for (int i = 0; i < n; ++i) {
+    expensiveComputation(); // s mut not be considered live here.
+
+    MyStruct s;
+
+    // Initialize struct conditionally.
+    // NOTE: If some optimization decides to flatten the if statement or if the
+    //       computation could be hoisted out of the loop, the phi with undef
+    //       below will be replaced by the non-undef value (which is a valid
+    //       "specialization" of undef).
+    if (c1 < 0)
+      s.x = g_rwbuf[i - c1].x;
+
+    res = s.x; // i or undef.
+  }
+
+  return res; // Result is n if loop wasn't executed, n-1 if it was.
+}
+
+//
+// Another real-world use-case for loop-scoped structs that can be promoted
+// to registers. Again, this should not produce values that are live across
+// multiple loop iterations (= no loop phi nodes).
+// Both the consume and produce calls must be inlined, otherwise the alloca
+// can't be promoted.
+//
+// CHECK: define i32 @"\01?loop_scoped_struct_conditional_assign_from_func_output@@YAHHH@Z"(i32 %n, i32 %c1)
+// CHECK-NOT: alloca
+// CHECK: phi i32
+// CHECK-NOT: phi i32
+// CHECK-NOT: phi float
+void consume(int i, in MyStruct data)
+{
+  // This must be inlined, otherwise the alloca can't be promoted.
+  g_rwbuf[i] = data;
+}
+
+bool produce(in int c, out MyStruct data)
+{
+  if (c > 0) {
+    MyStruct s;
+    s.x = 13;
+    data = s; // <-- Conditional assignment of out-qualified parameter.
+    return true;
+  }
+  return false; // <-- Out-qualified parameter left uninitialized.
+}
+
+export
+int loop_scoped_struct_conditional_assign_from_func_output(int n, int c1)
+{
+  for (int i=0; i<n; ++i) {
+    MyStruct data;
+    bool valid = produce(c1, data); // <-- Without lifetimes, inlining this generates a loop phi using prior iteration's value.
+    if (valid)
+      consume(i, data);
+    expensiveComputation(); // <-- Said phi is alive here, inflating register pressure.
+  }
+  return n;
+}
+
+//
+// The constant array should be hoisted to a constant global.
+// There should be no store of undef or 0 that would overwrite
+// the initializer.
+//
+// CHECK: define i32 @"\01?global_constant@@YAHH@Z"(i32 %n)
+// CHECK-NOT: store [8 x i32] undef
+// CHECK: load i32
+// CHECK-NOT: store [8 x i32] undef
+// CHECK: ret i32
+int compute(int i)
+{
+  int arr[] = {0, 1, 2, 3, 4, 5, -1, 13};
+  return arr[i % 8];
+}
+
+export
+int global_constant(int n)
+{
+  return compute(n);
+}
+
+//
+// The constant array should be hoisted to a constant global with lifetime
+// only inside the loop.
+// There should be no store of undef or 0 that would overwrite
+// the initializer.
+//
+// CHECK: define i32 @"\01?global_constant2@@YAHH@Z"(i32 %n)
+// CHECK: phi i32
+// CHECK: phi i32
+// CHECK-NOT: store [8 x i32] undef
+// CHECK: load i32
+// CHECK-NOT: store [8 x i32] undef
+export
+int global_constant2(int n)
+{
+  int res = 0;
+  for (int i = 0; i < n; ++i)
+    res += compute(i);
+  return res;
+}

+ 91 - 0
tools/clang/test/HLSLFileCheck/hlsl/lifetimes/lifetimes_loop_live_vals.hlsl

@@ -0,0 +1,91 @@
+// RUN: %dxc -T lib_6_6 %s | FileCheck %s
+
+// Regression tests where enabling lifetimes caused some inefficiencies due to
+// missing cleanup optimizations.
+// The first two tests are modeled to make it easy to compare against stock
+// LLVM: The code is converted easily to standard C++.
+
+//------------------------------------------------------------------------------
+// CHECK: define void @"\01?test@@YAXHAIAM@Z"
+// CHECK-NOT: undef
+bool done(int);
+float loop_code();
+
+void fn(in int loopCount, out float res) {
+  for (int i = 0; i < loopCount; i++) {
+    float f = loop_code();
+    if (done(i)) {
+      res = f;
+      return;
+    }
+  }
+  res = 1;
+}
+
+export
+void test(in int loopCount, out float res) {
+  res = 0;
+  float f;
+  fn(loopCount, f);
+  if (f > 0)
+    res = f;
+}
+
+//------------------------------------------------------------------------------
+// CHECK: define void @"\01?fn2@@YAXHAIAM@Z"
+// CHECK-NOT: undef
+export
+void fn2(in int loopCount, out float res) {
+  for (int i = 0; i < loopCount; i++) {
+    float f = loop_code();
+    if (done(i)) {
+      res = f;
+      return;
+    }
+    if (done(-i)) {
+      res = 2;
+      return;
+    }
+  }
+  res = 1;
+}
+
+export
+void test2(in int loopCount, out float res) {
+  float f;
+  fn2(loopCount, f);
+  res = f;
+}
+
+//------------------------------------------------------------------------------
+// There must not be any phi with undef (or any undef in general) in the final
+// code.
+// There can be 'undef' in the metadata, so we limit the check until metadata
+// starts.
+
+// CHECK: define void @"\01?main@@YAXAIAM@Z"
+// CHECK-NOT: undef
+// CHECK: !dx.version
+int loopCountGlobal;
+
+void fn3(out float res) {
+  for (int i = 0; i < loopCountGlobal; i++) {
+    float f = loop_code();
+    if (done(i)) {
+      res = f;
+      return;
+    }
+    if (done(-i)) {
+      res = 2;
+      return;
+    }
+  }
+  res = 1;
+}
+
+export
+void main(out float res : OUT) {
+  float f;
+  fn3(f);
+  res = f;
+}

+ 62 - 0
tools/clang/test/HLSLFileCheck/hlsl/lifetimes/lifetimes_replacememcpy.hlsl

@@ -0,0 +1,62 @@
+// RUN: %dxc -T lib_6_6 %s  | FileCheck %s
+// RUN: %dxc -T lib_6_3 %s  | FileCheck %s
+
+//
+// Regression test for a case where a memcpy gets replaced. If lifetime
+// intrinsics are not correctly removed or made conservative, this can
+// lead to cases with invalid lifetimes.
+//
+
+// CHECK: @[[constname:.*]] = internal unnamed_addr constant [2 x float] [float 1.000000e+00, float 3.000000e+00]
+// CHECK: define float @"\01?memcpy_replace@@YAMH@Z"(i32 %i)
+// CHECK: getelementptr inbounds [2 x float], [2 x float]* @[[constname]], i32 0, i32 %i
+// CHECK: load float
+// CHECK: ret float
+struct MyStruct2 {
+ float x[2];
+ float y;
+};
+
+MyStruct2 init() {
+  // The layout of the struct and the way we initialize it has to be
+  // complex enough that a memcpy is generated.
+  MyStruct2 s;
+  s.y = 3;
+  s.x[0] = 1;
+  s.x[1] = s.y;
+  return s;
+}
+
+export
+float memcpy_replace(int i) {
+  MyStruct2 s = init();
+  // Memcpy from inlined alloca to local alloca of s happens here.
+  //   Memcpy replacement replaces s by the inlined one.
+  // Lifetime of inlined alloca ends here
+
+  // Access local variable here again.
+  // If everything works correctly, this should be a GEP to a constant
+  // that has 1, 3 as its elements. When lifetimes were not removed
+  // conservatively during memcpy removal, they caused this to load
+  // uninitialized memory, represented by a load from a zeroinitialized
+  // constant.
+  return s.x[i];
+}
+
+// This is a slightly more complex variant that exposes the same issue.
+//
+//float func3(in float x);
+//
+//export
+//float memcpy_replace2() {
+//  float res = 0;
+//  MyStruct2 s = init();
+//
+//  [loop]
+//  for (uint i = 0; i < 3; i++) {
+//    res += func3(s.y);
+//  }
+//  res *= s.x[res];
+//
+//  return res;
+//}

+ 4 - 2
tools/clang/test/HLSLFileCheck/passes/hl/sroa_hlsl/memcpy_preuser.hlsl

@@ -19,7 +19,9 @@
 // CHECK: fadd
 // CHECK: select i1
 // CHECK: cbufferLoadLegacy
-// CHECK: ret void
+// CHECK: add nuw nsw
+// CHECK: icmp
+// CHECK: br i1
 struct OuterStruct
 {
   float fval;
@@ -34,9 +36,9 @@ cbuffer cbuf : register(b1)
 float main(int doit : A) : SV_Target
 {
   float res = 0.0;
+  OuterStruct oStruct;
   // Need a loop so the dest user can come before the memcpy
   for (int i = 0; i < doit; i++) {
-    OuterStruct oStruct;
     // This should be expressable as a select unless a bunch of mem stuff gets crammed in
     if(i%2 == 0) {
       res += oStruct.fval2;

+ 1 - 0
tools/clang/test/HLSLFileCheck/passes/llvm/simplifycfg/fold-cond-branch-on-phi.hlsl

@@ -7,6 +7,7 @@
 // CHECK: %[[cond:.+]] = phi i1
 // CHECK-SAME: [ false
 // CHECK: br i1 %[[cond]]
+// CHECK: @main
 
 cbuffer cb : register(b0) {
   uint a,b,c,d,e,f,g,h,i,j,k,l,m,n;

+ 1 - 1
tools/clang/test/HLSLFileCheck/samples/MiniEngine/FXAAPass2HCS.hlsl

@@ -4,9 +4,9 @@
 // CHECK: bufferLoad
 // CHECK: textureLoad
 // CHECK: FAbs
-// CHECK: sampleLevel
 // CHECK: FMin
 // CHECK: FMax
+// CHECK: sampleLevel
 // CHECK: textureStore
 
 

+ 1 - 1
tools/clang/test/HLSLFileCheck/samples/MiniEngine/FXAAPass2HDebugCS.hlsl

@@ -4,9 +4,9 @@
 // CHECK: bufferLoad
 // CHECK: textureLoad
 // CHECK: FAbs
-// CHECK: sampleLevel
 // CHECK: FMin
 // CHECK: FMax
+// CHECK: sampleLevel
 // CHECK: textureStore
 
 //

+ 1 - 1
tools/clang/test/HLSLFileCheck/samples/MiniEngine/FXAAPass2VCS.hlsl

@@ -4,9 +4,9 @@
 // CHECK: bufferLoad
 // CHECK: textureLoad
 // CHECK: FAbs
-// CHECK: sampleLevel
 // CHECK: FMin
 // CHECK: FMax
+// CHECK: sampleLevel
 // CHECK: textureStore
 
 //

+ 1 - 1
tools/clang/test/HLSLFileCheck/samples/MiniEngine/FXAAPass2VDebugCS.hlsl

@@ -4,9 +4,9 @@
 // CHECK: bufferLoad
 // CHECK: textureLoad
 // CHECK: FAbs
-// CHECK: sampleLevel
 // CHECK: FMin
 // CHECK: FMax
+// CHECK: sampleLevel
 // CHECK: textureStore
 
 //

+ 1 - 1
tools/clang/test/HLSLFileCheck/samples/MiniEngine/GenerateHistogramCS.hlsl

@@ -3,8 +3,8 @@
 // CHECK: flattenedThreadIdInGroup
 // CHECK: threadId
 // CHECK: barrier
-// CHECK: textureLoad
 // CHECK: AtomicAdd
+// CHECK: textureLoad
 
 //
 // Copyright (c) Microsoft. All rights reserved.

+ 1 - 1
tools/clang/test/HLSLFileCheck/samples/d3d11/ComputeShaderSort11.hlsl

@@ -6,10 +6,10 @@
 // CHECK: addrspace(3)
 // CHECK: barrier
 // CHECK: addrspace(3)
-// CHECK: barrier
 // CHECK: addrspace(3)
 // CHECK: barrier
 // CHECK: addrspace(3)
+// CHECK: barrier
 // CHECK: bufferStore
 
 //--------------------------------------------------------------------------------------

+ 1 - 0
tools/clang/test/HLSLFileCheck/shader_targets/library/inout_struct_mismatch.hlsl

@@ -5,6 +5,7 @@
 // CHECK-NOT: bitcast
 // CHECK-NOT: CallStruct
 // CHECK: ParamStruct
+// CHECK: call void @llvm.lifetime.start
 // CHECK-NOT: bitcast
 // CHECK-NOT: CallStruct
 // CHECK-LABEL: ret <4 x float>

+ 5 - 1
tools/clang/test/HLSLFileCheck/shader_targets/library/lib_arg_flatten/lib_arg_flatten2.hlsl

@@ -1,6 +1,10 @@
-// RUN: %dxc -T lib_6_3 -auto-binding-space 11 -default-linkage external %s | FileCheck %s
+// RUN: %dxc -T lib_6_6 -auto-binding-space 11 -default-linkage external %s | FileCheck %s
 
 // Make sure no undef in test3.
+// NOTE: With the introduction of lifetime intrinsics, this test was moved to
+//       lib_6_6 since for earlier versions, the fallback mechanism would
+//       generate store undef instead of lifetime.end, which accidentally
+//       failed this test.
 // CHECK: define <4 x float>
 // CHECK: insertelement <2 x float> undef
 // CHECK: insertelement <4 x float> undef

+ 1 - 2
tools/clang/test/HLSLFileCheck/shader_targets/library/lib_arg_flatten/lib_empty_struct_arg.hlsl

@@ -1,4 +1,4 @@
-// RUN: %dxc -T lib_6_3 -auto-binding-space 11 -default-linkage external %s | FileCheck %s
+// RUN: %dxc -T lib_6_6 -auto-binding-space 11 -default-linkage external %s | FileCheck %s
 
 // Make sure calls with empty struct params are well-behaved
 
@@ -7,7 +7,6 @@
 // CHECK-NOT:load
 // CHECK-NOT:store
 // CHECK-DAG: call float @"\01?test@@YAMUT@@@Z"(%struct.T*
-// CHECK: ret float
 
 
 struct T {

+ 2 - 0
tools/clang/tools/dxcompiler/dxcompilerobj.cpp

@@ -1146,6 +1146,8 @@ public:
     compiler.getCodeGenOpts().HLSLPreciseOutputs = Opts.PreciseOutputs;
     compiler.getCodeGenOpts().MainFileName = pMainFile;
     compiler.getCodeGenOpts().HLSLPrintAfterAll = Opts.PrintAfterAll;
+    compiler.getCodeGenOpts().HLSLForceZeroStoreLifetimes = Opts.ForceZeroStoreLifetimes;
+    compiler.getCodeGenOpts().HLSLEnableLifetimeMarkers = Opts.EnableLifetimeMarkers;
 
     // Translate signature packing options
     if (Opts.PackPrefixStable)

+ 279 - 11
tools/clang/unittests/HLSL/ExecutionTest.cpp

@@ -291,6 +291,7 @@ public:
   TEST_METHOD(SaturateTest);
   TEST_METHOD(SignTest);
   TEST_METHOD(Int64Test);
+  TEST_METHOD(LifetimeIntrinsicTest)
   TEST_METHOD(WaveIntrinsicsTest);
   TEST_METHOD(WaveIntrinsicsDDITest);
   TEST_METHOD(WaveIntrinsicsInPSTest);
@@ -582,7 +583,7 @@ public:
   template <class Ty>
   const wchar_t* BasicShaderModelTest_GetFormatString();
                                       
-  void CompileFromText(LPCSTR pText, LPCWSTR pEntryPoint, LPCWSTR pTargetProfile, ID3DBlob **ppBlob) {
+  void CompileFromText(LPCSTR pText, LPCWSTR pEntryPoint, LPCWSTR pTargetProfile, ID3DBlob **ppBlob, LPCWSTR *pOptions = nullptr, int numOptions = 0) {
     VERIFY_SUCCEEDED(m_support.Initialize());
     CComPtr<IDxcCompiler> pCompiler;
     CComPtr<IDxcLibrary> pLibrary;
@@ -592,7 +593,7 @@ public:
     VERIFY_SUCCEEDED(m_support.CreateInstance(CLSID_DxcCompiler, &pCompiler));
     VERIFY_SUCCEEDED(m_support.CreateInstance(CLSID_DxcLibrary, &pLibrary));
     VERIFY_SUCCEEDED(pLibrary->CreateBlobWithEncodingFromPinned(pText, (UINT32)strlen(pText), CP_UTF8, &pTextBlob));
-    VERIFY_SUCCEEDED(pCompiler->Compile(pTextBlob, L"hlsl.hlsl", pEntryPoint, pTargetProfile, nullptr, 0, nullptr, 0, nullptr, &pResult));
+    VERIFY_SUCCEEDED(pCompiler->Compile(pTextBlob, L"hlsl.hlsl", pEntryPoint, pTargetProfile, pOptions, numOptions, nullptr, 0, nullptr, &pResult));
     VERIFY_SUCCEEDED(pResult->GetStatus(&resultCode));
     if (FAILED(resultCode)) {
       CComPtr<IDxcBlobEncoding> errors;
@@ -605,25 +606,29 @@ public:
     VERIFY_SUCCEEDED(pResult->GetResult((IDxcBlob **)ppBlob));
   }
 
-  void CreateComputeCommandQueue(ID3D12Device *pDevice, LPCWSTR pName, ID3D12CommandQueue **ppCommandQueue) {
+  void CreateCommandQueue(ID3D12Device *pDevice, LPCWSTR pName, ID3D12CommandQueue **ppCommandQueue, D3D12_COMMAND_LIST_TYPE type) {
     D3D12_COMMAND_QUEUE_DESC queueDesc = {};
     queueDesc.Flags = D3D12_COMMAND_QUEUE_FLAG_NONE;
-    queueDesc.Type = D3D12_COMMAND_LIST_TYPE_COMPUTE;
+    queueDesc.Type = type;
     VERIFY_SUCCEEDED(pDevice->CreateCommandQueue(&queueDesc, IID_PPV_ARGS(ppCommandQueue)));
     VERIFY_SUCCEEDED((*ppCommandQueue)->SetName(pName));
   }
 
-  void CreateComputePSO(ID3D12Device *pDevice, ID3D12RootSignature *pRootSignature, LPCSTR pShader, ID3D12PipelineState **ppComputeState) {
+  void CreateComputeCommandQueue(ID3D12Device *pDevice, LPCWSTR pName, ID3D12CommandQueue **ppCommandQueue) {
+    CreateCommandQueue(pDevice, pName, ppCommandQueue, D3D12_COMMAND_LIST_TYPE_COMPUTE);
+  }
+
+  void CreateComputePSO(ID3D12Device *pDevice, ID3D12RootSignature *pRootSignature, LPCSTR pShader, LPCWSTR pTargetProfile, ID3D12PipelineState **ppComputeState, LPCWSTR *pOptions = nullptr, int numOptions = 0) {
     CComPtr<ID3DBlob> pComputeShader;
 
     // Load and compile shaders.
     if (UseDxbc()) {
 #ifndef _HLK_CONF
-      DXBCFromText(pShader, L"main", L"cs_6_0", &pComputeShader);
+      DXBCFromText(pShader, L"main", pTargetProfile, &pComputeShader);
 #endif
     }
     else {
-      CompileFromText(pShader, L"main", L"cs_6_0", &pComputeShader);
+      CompileFromText(pShader, L"main", pTargetProfile, &pComputeShader, pOptions, numOptions);
     }
 
     // Describe and create the compute pipeline state object (PSO).
@@ -635,7 +640,8 @@ public:
   }
 
   bool CreateDevice(_COM_Outptr_ ID3D12Device **ppDevice,
-                    D3D_SHADER_MODEL testModel = D3D_SHADER_MODEL_6_0, bool skipUnsupported = true) {
+                    D3D_SHADER_MODEL testModel = D3D_SHADER_MODEL_6_0, bool skipUnsupported = true,
+                    bool enableRayTracing = false) {
     if (testModel > HIGHEST_SHADER_MODEL) {
       UINT minor = (UINT)testModel & 0x0f;
       LogCommentFmt(L"Installed SDK does not support "
@@ -647,7 +653,7 @@ public:
 
       return false;
     }
-    const D3D_FEATURE_LEVEL FeatureLevelRequired = D3D_FEATURE_LEVEL_11_0;
+    const D3D_FEATURE_LEVEL FeatureLevelRequired = enableRayTracing ? D3D_FEATURE_LEVEL_12_0 : D3D_FEATURE_LEVEL_11_0;
     CComPtr<IDXGIFactory4> factory;
     CComPtr<ID3D12Device> pDevice;
 
@@ -1099,6 +1105,11 @@ public:
   }
 
   void RunRWByteBufferComputeTest(ID3D12Device *pDevice, LPCSTR shader, std::vector<uint32_t> &values);
+  void RunLifetimeIntrinsicTest(ID3D12Device *pDevice, LPCSTR shader, D3D_SHADER_MODEL shaderModel, bool useLibTarget, LPCWSTR *pOptions, int numOptions, std::vector<uint32_t> &values);
+  void RunLifetimeIntrinsicComputeTest(ID3D12Device *pDevice, LPCSTR pShader, CComPtr<ID3D12DescriptorHeap>& pUavHeap, CComPtr<ID3D12RootSignature>& pRootSignature,
+                                       LPCWSTR pTargetProfile, LPCWSTR *pOptions, int numOptions, std::vector<uint32_t> &values);
+  void RunLifetimeIntrinsicLibTest(ID3D12Device5 *pDevice, LPCSTR pShader, CComPtr<ID3D12DescriptorHeap>& pUavHeap, CComPtr<ID3D12RootSignature>& pRootSignature,
+                                   LPCWSTR pTargetProfile, LPCWSTR *pOptions, int numOptions, std::vector<uint32_t> &values);
 
   void SetDescriptorHeap(ID3D12GraphicsCommandList *pCommandList, ID3D12DescriptorHeap *pHeap) {
     ID3D12DescriptorHeap *const pHeaps[1] = { pHeap };
@@ -1218,7 +1229,7 @@ void ExecutionTest::RunRWByteBufferComputeTest(ID3D12Device *pDevice, LPCSTR pSh
 
   // Create pipeline state object.
   CComPtr<ID3D12PipelineState> pComputeState;
-  CreateComputePSO(pDevice, pRootSignature, pShader, &pComputeState);
+  CreateComputePSO(pDevice, pRootSignature, pShader, L"cs_6_0", &pComputeState);
 
   // Create a command allocator and list for compute.
   VERIFY_SUCCEEDED(pDevice->CreateCommandAllocator(D3D12_COMMAND_LIST_TYPE_COMPUTE, IID_PPV_ARGS(&pCommandAllocator)));
@@ -1272,6 +1283,263 @@ void ExecutionTest::RunRWByteBufferComputeTest(ID3D12Device *pDevice, LPCSTR pSh
   WaitForSignal(pCommandQueue, FO);
 }
 
+void ExecutionTest::RunLifetimeIntrinsicComputeTest(ID3D12Device *pDevice, LPCSTR pShader, CComPtr<ID3D12DescriptorHeap>& pUavHeap, CComPtr<ID3D12RootSignature>& pRootSignature,
+                                                    LPCWSTR pTargetProfile, LPCWSTR *pOptions, int numOptions, std::vector<uint32_t> &values) {
+  // Create command queue.
+  CComPtr<ID3D12CommandQueue> pCommandQueue;
+  CreateComputeCommandQueue(pDevice, L"RunLifetimeIntrinsicTest Command Queue", &pCommandQueue);
+
+  FenceObj FO;
+  InitFenceObj(pDevice, &FO);
+
+  // Compile shader "main" and create pipeline state object.
+  CComPtr<ID3D12PipelineState> pComputeState;
+  CreateComputePSO(pDevice, pRootSignature, pShader, pTargetProfile, &pComputeState, pOptions, numOptions);
+
+  // Create a command allocator and list for compute.
+  CComPtr<ID3D12CommandAllocator> pCommandAllocator;
+  CComPtr<ID3D12GraphicsCommandList> pCommandList;
+  VERIFY_SUCCEEDED(pDevice->CreateCommandAllocator(D3D12_COMMAND_LIST_TYPE_COMPUTE, IID_PPV_ARGS(&pCommandAllocator)));
+  VERIFY_SUCCEEDED(pDevice->CreateCommandList(0, D3D12_COMMAND_LIST_TYPE_COMPUTE, pCommandAllocator, pComputeState, IID_PPV_ARGS(&pCommandList)));
+  pCommandList->SetName(L"ExecutionTest::RunLifetimeIntrinsicTest Command List");
+
+  // Set up UAV resource.
+  const UINT valueSizeInBytes = (UINT)values.size() * sizeof(uint32_t);
+  CComPtr<ID3D12Resource> pUavResource;
+  CComPtr<ID3D12Resource> pReadBuffer;
+  CComPtr<ID3D12Resource> pUploadResource;
+  CreateTestUavs(pDevice, pCommandList, values.data(), valueSizeInBytes, &pUavResource, &pReadBuffer, &pUploadResource);
+  VERIFY_SUCCEEDED(pUavResource->SetName(L"RunLifetimeIntrinsicTest UAV"));
+  VERIFY_SUCCEEDED(pReadBuffer->SetName(L"RunLifetimeIntrinsicTest UAV Read Buffer"));
+  VERIFY_SUCCEEDED(pUploadResource->SetName(L"RunLifetimeIntrinsicTest UAV Upload Buffer"));
+
+  // Close the command list and execute it to perform the GPU setup.
+  pCommandList->Close();
+  ExecuteCommandList(pCommandQueue, pCommandList);
+  WaitForSignal(pCommandQueue, FO);
+  VERIFY_SUCCEEDED(pCommandAllocator->Reset());
+  VERIFY_SUCCEEDED(pCommandList->Reset(pCommandAllocator, pComputeState));
+
+  // Run the compute shader and copy the results back to readable memory.
+  {
+    D3D12_UNORDERED_ACCESS_VIEW_DESC uavDesc = {};
+    uavDesc.Format = DXGI_FORMAT_R32_TYPELESS;
+    uavDesc.ViewDimension = D3D12_UAV_DIMENSION_BUFFER;
+    uavDesc.Buffer.FirstElement = 0;
+    uavDesc.Buffer.NumElements = (UINT)values.size();
+    uavDesc.Buffer.StructureByteStride = 0;
+    uavDesc.Buffer.CounterOffsetInBytes = 0;
+    uavDesc.Buffer.Flags = D3D12_BUFFER_UAV_FLAG_RAW;
+    CD3DX12_CPU_DESCRIPTOR_HANDLE uavHandle(pUavHeap->GetCPUDescriptorHandleForHeapStart());
+    CD3DX12_GPU_DESCRIPTOR_HANDLE uavHandleGpu(pUavHeap->GetGPUDescriptorHandleForHeapStart());
+    pDevice->CreateUnorderedAccessView(pUavResource, nullptr, &uavDesc, uavHandle);
+    SetDescriptorHeap(pCommandList, pUavHeap);
+    pCommandList->SetComputeRootSignature(pRootSignature);
+    pCommandList->SetComputeRootDescriptorTable(0, uavHandleGpu);
+  }
+
+  static const int DispatchGroupX = 1;
+  static const int DispatchGroupY = 1;
+  static const int DispatchGroupZ = 1;
+  pCommandList->Dispatch(DispatchGroupX, DispatchGroupY, DispatchGroupZ);
+  RecordTransitionBarrier(pCommandList, pUavResource, D3D12_RESOURCE_STATE_UNORDERED_ACCESS, D3D12_RESOURCE_STATE_COPY_SOURCE);
+  pCommandList->CopyResource(pReadBuffer, pUavResource);
+  pCommandList->Close();
+  ExecuteCommandList(pCommandQueue, pCommandList);
+  WaitForSignal(pCommandQueue, FO);
+  {
+    MappedData mappedData(pReadBuffer, valueSizeInBytes);
+    uint32_t *pData = (uint32_t *)mappedData.data();
+    memcpy(values.data(), pData, (size_t)valueSizeInBytes);
+  }
+  WaitForSignal(pCommandQueue, FO);
+}
+
+void ExecutionTest::RunLifetimeIntrinsicLibTest(ID3D12Device5 *pDevice, LPCSTR pShader, CComPtr<ID3D12DescriptorHeap>& pUavHeap, CComPtr<ID3D12RootSignature>& pRootSignature,
+                                                LPCWSTR pTargetProfile, LPCWSTR *pOptions, int numOptions, std::vector<uint32_t> &values) {
+  // Create command queue.
+  CComPtr<ID3D12CommandQueue> pCommandQueue;
+  CreateCommandQueue(pDevice, L"RunLifetimeIntrinsicTest Command Queue", &pCommandQueue, D3D12_COMMAND_LIST_TYPE_DIRECT);
+
+  FenceObj FO;
+  InitFenceObj(pDevice, &FO);
+
+  // Compile raygen shader.
+  CComPtr<ID3DBlob> pShaderLib;
+  CompileFromText(pShader, L"RayGen", pTargetProfile, &pShaderLib, pOptions, numOptions);
+
+  // Describe and create the RT pipeline state object (RTPSO).
+  CD3DX12_STATE_OBJECT_DESC stateObjectDesc(D3D12_STATE_OBJECT_TYPE_RAYTRACING_PIPELINE);
+  auto lib = stateObjectDesc.CreateSubobject<CD3DX12_DXIL_LIBRARY_SUBOBJECT>();
+  CD3DX12_SHADER_BYTECODE byteCode(pShaderLib);
+  lib->SetDXILLibrary(&byteCode);
+  lib->DefineExport(L"RayGen");
+
+  const int payloadCount = 4;
+  const int attributeCount = 2;
+  const int maxRecursion = 2;
+  stateObjectDesc.CreateSubobject<CD3DX12_RAYTRACING_SHADER_CONFIG_SUBOBJECT>()->Config(payloadCount * sizeof(float), attributeCount * sizeof(float));
+  stateObjectDesc.CreateSubobject<CD3DX12_RAYTRACING_PIPELINE_CONFIG_SUBOBJECT>()->Config(maxRecursion);
+
+  // Create (local!) root sig subobject and associate with  shader.
+  auto localRootSigSubObj = stateObjectDesc.CreateSubobject<CD3DX12_LOCAL_ROOT_SIGNATURE_SUBOBJECT>();
+  localRootSigSubObj->SetRootSignature(pRootSignature);
+  auto x = stateObjectDesc.CreateSubobject<CD3DX12_SUBOBJECT_TO_EXPORTS_ASSOCIATION_SUBOBJECT>();
+  x->SetSubobjectToAssociate(*localRootSigSubObj);
+  x->AddExport(L"RayGen");
+
+  CComPtr<ID3D12StateObject> pStateObject;
+  VERIFY_SUCCEEDED(pDevice->CreateStateObject(stateObjectDesc, IID_PPV_ARGS(&pStateObject)));
+
+  // Create a command allocator and list.
+  CComPtr<ID3D12CommandAllocator> pCommandAllocator;
+  CComPtr<ID3D12GraphicsCommandList4> pCommandList;
+  VERIFY_SUCCEEDED(pDevice->CreateCommandAllocator(D3D12_COMMAND_LIST_TYPE_DIRECT, IID_PPV_ARGS(&pCommandAllocator)));
+  VERIFY_SUCCEEDED(pDevice->CreateCommandList(0, D3D12_COMMAND_LIST_TYPE_DIRECT, pCommandAllocator, nullptr, IID_PPV_ARGS(&pCommandList)));
+  pCommandList->SetPipelineState1(pStateObject);
+  pCommandList->SetName(L"ExecutionTest::RunLifetimeIntrinsicTest Command List");
+
+  // Close the command list and execute it to kick-off compilation in the driver.
+  // NOTE: We don't care about anything else, so we're not setting up any resources and don't actually execute the shader.
+  pCommandList->Close();
+  ExecuteCommandList(pCommandQueue, pCommandList);
+  WaitForSignal(pCommandQueue, FO);
+}
+
+void ExecutionTest::RunLifetimeIntrinsicTest(ID3D12Device *pDevice, LPCSTR pShader, D3D_SHADER_MODEL shaderModel, bool useLibTarget,
+                                             LPCWSTR *pOptions, int numOptions, std::vector<uint32_t> &values) {
+  LPCWSTR pTargetProfile;
+  switch (shaderModel) {
+      default: pTargetProfile = useLibTarget ? L"lib_6_3" : L"cs_6_0"; break; // Default to 6.3 for lib, 6.0 otherwise.
+      case D3D_SHADER_MODEL_6_0: pTargetProfile = useLibTarget ? L"lib_6_0" : L"cs_6_0"; break;
+      case D3D_SHADER_MODEL_6_3: pTargetProfile = useLibTarget ? L"lib_6_3" : L"cs_6_3"; break;
+      case D3D_SHADER_MODEL_6_5: pTargetProfile = useLibTarget ? L"lib_6_5" : L"cs_6_5"; break;
+      case D3D_SHADER_MODEL_6_6: pTargetProfile = useLibTarget ? L"lib_6_6" : L"cs_6_6"; break;
+  }
+
+  // Describe a UAV descriptor heap.
+  D3D12_DESCRIPTOR_HEAP_DESC heapDesc = {};
+  heapDesc.NumDescriptors = 1;
+  heapDesc.Type = D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAV;
+  heapDesc.Flags = D3D12_DESCRIPTOR_HEAP_FLAG_SHADER_VISIBLE;
+
+  // Create the UAV descriptor heap.
+  CComPtr<ID3D12DescriptorHeap> pUavHeap;
+  VERIFY_SUCCEEDED(pDevice->CreateDescriptorHeap(&heapDesc, IID_PPV_ARGS(&pUavHeap)));
+
+  // Create root signature.
+  CComPtr<ID3D12RootSignature> pRootSignature;
+  {
+    CD3DX12_DESCRIPTOR_RANGE ranges[1];
+    ranges[0].Init(D3D12_DESCRIPTOR_RANGE_TYPE_UAV, 1, 0, 0, 0);
+
+    CD3DX12_ROOT_PARAMETER rootParameters[1];
+    rootParameters[0].InitAsDescriptorTable(1, &ranges[0], D3D12_SHADER_VISIBILITY_ALL);
+
+    CD3DX12_ROOT_SIGNATURE_DESC rootSignatureDesc;
+    D3D12_ROOT_SIGNATURE_FLAGS rootSigFlag = useLibTarget ? D3D12_ROOT_SIGNATURE_FLAG_LOCAL_ROOT_SIGNATURE : D3D12_ROOT_SIGNATURE_FLAG_NONE;
+    rootSignatureDesc.Init(_countof(rootParameters), rootParameters, 0, nullptr, rootSigFlag);
+
+    CreateRootSignatureFromDesc(pDevice, &rootSignatureDesc, &pRootSignature);
+  }
+
+  if (useLibTarget)
+    RunLifetimeIntrinsicLibTest(reinterpret_cast<ID3D12Device5*>(pDevice), pShader, pUavHeap, pRootSignature, pTargetProfile, pOptions, numOptions, values);
+  else
+    RunLifetimeIntrinsicComputeTest(pDevice, pShader, pUavHeap, pRootSignature, pTargetProfile, pOptions, numOptions, values);
+}
+
+TEST_F(ExecutionTest, LifetimeIntrinsicTest) {
+  // The only thing we test here is that existence of lifetime intrinsics or
+  // their fallback replacement (store undef or store zeroinitializer) do not
+  // cause any issues in the runtime and driver stack.
+  // The easiest way to force placement of intrinsics is to create an array in
+  // a local scope that is dynamically indexed. It must not be optimized away,
+  // so we do some bogus initialization that prevents this. Since all the code
+  // is guarded by a conditional that is dynamically always false, the actual
+  // effect of the shader is that the same value that was read is written back.
+  static const char* pShader = R"(
+    RWByteAddressBuffer g_bab : register(u0);
+
+    void fn(uint GI) {
+      const uint addr = GI * 4;
+      const int val = g_bab.Load(addr);
+      int res = val;
+      if (val < 0) { // Never true.
+        int arr[200];
+        for (int i = 0; i < 200; ++i) {
+            arr[i] = arr[val - i];
+        }
+        res += arr[val];
+      }
+      g_bab.Store(addr, (uint)res);
+    }
+
+    [numthreads(8,8,1)]
+    void main(uint GI : SV_GroupIndex) {
+      fn(GI);
+    }
+
+    [shader("raygeneration")]
+    void RayGen() {
+      const uint d = DispatchRaysIndex().x;
+      const uint g = g > 64 ? 63 : g;
+      fn(g);
+    }
+  )";
+  static const int NumThreadsX = 8;
+  static const int NumThreadsY = 8;
+  static const int NumThreadsZ = 1;
+  static const int ThreadsPerGroup = NumThreadsX * NumThreadsY * NumThreadsZ;
+  static const int DispatchGroupCount = 1;
+
+  // TODO: There's probably a lot of things in the rest of this test that could be stripped away.
+
+  CComPtr<ID3D12Device5> pDevice;
+  if (!CreateDevice(reinterpret_cast<ID3D12Device**>(&pDevice), D3D_SHADER_MODEL_6_5, true, true)) // TODO: We need 6.6!
+    return;
+
+  std::vector<uint32_t> values;
+  SetupComputeValuePattern(values, ThreadsPerGroup * DispatchGroupCount);
+
+  // Run a number of tests for different configurations that will cause
+  // lifetime intrinsics to be placed directly, be replaced by a zeroinitializer
+  // store, or be replaced by an undef store.
+  LPCWSTR pOptions15[] = {L"/validator-version 1.5"};
+  LPCWSTR pOptions16[] = {L"/validator-version 1.6", L"/Vd"};
+
+  VERIFY_ARE_EQUAL(values[1], (uint32_t)1);
+
+  // Test regular shader with zeroinitializer store.
+  RunLifetimeIntrinsicTest(pDevice, pShader, D3D_SHADER_MODEL_6_0, false, pOptions15, _countof(pOptions15), values);
+  VERIFY_ARE_EQUAL(values[1], (uint32_t)1);
+
+  // Test library with zeroinitializer store.
+  RunLifetimeIntrinsicTest(pDevice, pShader, D3D_SHADER_MODEL_6_3, true, pOptions15, _countof(pOptions15), values);
+  VERIFY_ARE_EQUAL(values[1], (uint32_t)1);
+
+  // Testing SM 6.6 and validator version 1.6 requires experimental shaders
+  // being turned on.
+  if (!m_ExperimentalModeEnabled)
+      return;
+
+  // Test regular shader with undef store.
+  RunLifetimeIntrinsicTest(pDevice, pShader, D3D_SHADER_MODEL_6_0, false, pOptions16, _countof(pOptions16), values);
+  VERIFY_ARE_EQUAL(values[1], (uint32_t)1);
+
+  // Test library with undef store.
+  RunLifetimeIntrinsicTest(pDevice, pShader, D3D_SHADER_MODEL_6_3, true, pOptions16, _countof(pOptions16), values);
+  VERIFY_ARE_EQUAL(values[1], (uint32_t)1);
+
+  // Test regular shader with lifetime intrinsics.
+  RunLifetimeIntrinsicTest(pDevice, pShader, D3D_SHADER_MODEL_6_5, false, pOptions16, _countof(pOptions16), values); // TODO: Test 6.6 here!
+  VERIFY_ARE_EQUAL(values[1], (uint32_t)1);
+
+  // Test library with lifetime intrinsics.
+  RunLifetimeIntrinsicTest(pDevice, pShader, D3D_SHADER_MODEL_6_5, true, pOptions16, _countof(pOptions16), values); // TODO: Test 6.6 here!
+  VERIFY_ARE_EQUAL(values[1], (uint32_t)1);
+}
+
 TEST_F(ExecutionTest, BasicComputeTest) {
 #ifndef _HLK_CONF
   //
@@ -1700,7 +1968,7 @@ TEST_F(ExecutionTest, WaveIntrinsicsTest) {
 
   // Create pipeline state object.
   CComPtr<ID3D12PipelineState> pComputeState;
-  CreateComputePSO(pDevice, pRootSignature, pShader, &pComputeState);
+  CreateComputePSO(pDevice, pRootSignature, pShader, L"cs_6_0", &pComputeState);
 
   // Create a command allocator and list for compute.
   VERIFY_SUCCEEDED(pDevice->CreateCommandAllocator(D3D12_COMMAND_LIST_TYPE_COMPUTE, IID_PPV_ARGS(&pCommandAllocator)));