OK, really slow first run (with tons of cpu usage), really fast second run, definitely sounds like a compilation issue.
NOTE: Some of the following may be buried in those stacktraces ... I'll leave that for Bret to ascertain :-)
-----
optimization timeout limit (default 10)
sproc optimize timeout limit (default 40)
Would be interesting to see what you've got for these 2 config settings. [Both represent a percentage of *estimated* execution time - query vs sproc - at which point the optimation phase is aborted and the 'best' plan at that point is used.]
If either is set lower than the default values (or worse, set to 0 == no timeout), then it would be interesting to see what happens if set these parameters back to their defaults and run the test again. [Obviously non-zero settings cause the optimizer to quit formulating query plans ... possibly before the best query plan is found.]
If you are running with the defaults then there's an issue with what the optimizer is using as the *estimated* execution time.
optimization goal
The next item I'd look at is your default optimization goal.
Are you running the same optimization goal in both the 15.0.3 and 15.7 dataservers? If not, what happens if you run your 15.7 test under the same optimization goal as is in use in 15.0.3? [NOTE: You can either change the optgoal at the sp_configure level or at the session level via 'set plan optgoal <optgoal>'.]
Hmmm ... is the 15.0.3 version of the proc by any chance running under the basic_optimization goal, or perhaps in (12.5) compatibility mode?
allrows_dss and allrows_mix require the optimizer to consider several additional features/options that aren't considered with allows_oltp.
If you're not running allrows_oltp, would be interesting to see what happens if you force the proc to run under allrows_oltp (to eliminate some of the options/features the optimizer has to work through). [set plan optgoal allrows_oltp; go; exec <proc> with recompile]
process vs threaded kernel mode
I haven't attempted to tune queries by changing the kernel mode. I'm not saying it won't help (to switch back to process mode), I just can't provide any input ... one way or the other ... regarding the effect of the kernel mode on the optimizer.
simplification of queries
Assuming the 15.0.3 and 15.7 dataservers are configured similarly (config settings, memory settings, cache settings, etc), and the metadata is comparable (table/column/index structures, availability and value of statistics, etc), I'm probably going to fall back on some tuning basics ...
- datatype mismatches
- query complexity (to include ambiguity, eg, non-ANSI 'group by' statements) [It's been my experience that if you have problems understanding the logic of a query, then chances are the optimizer will, too]
- if there are multiple queries then try to isolate each query (eg, place in separate child procs) to see if there's one particular query that's eating up most of the compilation time
- if you can track most of the time to a particularly complex query ... see if it's possible to simplify the query in order to reduce the optimizer's workload; this could entail breaking a large multi-UNION(all) query into standalone queries; this could include pulling a derived table out into a separate #table, etc
- consider having the optimizer spit out the associated abstract query plans (AQPs) for the proc's queries (set option show_abstract_plan on); you could then hardcode those AQPs into the proc thus eliminating the need for the optimizer to process the queries [NOTE: If you haven't used AQPs ... they're quite powerful but poorly documented, and will likely be hard/impossible to understand for the next person that attempts to modify the proc; you've been warned ;-) ]
While I would expect the 15.7 optimizer to perform better/faster than the 15.0.3 optimizer, I wouldn't be surprised to see a degradation if a) you were running in compatibility mode in 15.0.3 but allrows_* in 15.7 or b) you've hit a bug/limitation of the newer 15.7 optimizer.