Is this delay seen if the procedure is executed a second time fairly soon?
As it sounds like it was a constant problem when "with recompile" was on, I'm wondering if a good deal of the time might be in compiling the procedures query plan - which wouldn't have to be done on a second execution unless the plan subsequently aged out of procedure cache.
Does the proc contain any obviously complicated queries (lots of joins in any query)?
Executing dbcc stacktrace on the spid a few times while it is in this state may provide insight on what it is doing.
set switch on 3604
go
dbcc stacktrace(<spid>)
go
-bret