Let's start from a few basics and then get into the nitty a bit. The difference between threaded mode and process mode could be slight - a lot depends on the application, use of engine groups/thread pools, workload factors, etc. The big differences are:
1 - In process mode, the OS kernel doesn't realize that the engines are all part of the same parent process and consequently can schedule however it realizes. In threaded kernel, the OS kernel knows that the engines are all part of the same parent and can more evenly distribute the workload across available cpu cores/threads based on cpu scheduling. For example, on Linux (x86/64), in process mode, you can simulate threaded mode workload distribution/performance by binding (cpubind) the engines to every odd virtual cpu to get the engines to span the cores more evenly. With threaded mode, you don't have to do the cpubind - the OS automatically distributes the workload better (and this was the design goal of threaded kernel - to allow OS/HW to better manage workload distribution).
2 - In process mode, we have to do tcp socket migration to a network engine as each engine is a separate process. The spid (in ASE) can run on any engine (within whatever restrictions of engine groups are created), but whenever it needs to do network IO, we have to put the process to sleep and wake it up on the network engine. In threaded mode, there is only a single process - so no more tcp socket migration. Instead there are network task(s) that can manage the network IO. Again, how much difference this makes depends on how often tasks are put to sleep waiting to run on their network engine.
3 - IO polling has been moved to separate threads vs. every engine doing disk/network polling. Depends on the amount of IO being done, how much difference this makes.
Earlier, you posed a question about threading vs. engines - it was a question that has been asked/answered a number of times. Unfortunately, the answer depends on which platform you are on and how much actual disk IO you are doing. I think you are on Linux (x86/64) in which case if you have HT enabled, we do not recommend exceeding number of cores + 50% (so 32 core/64 thread host would support a max of 48 engines). Disk & Network IO threads are not counted against that number nor are the blocking pool (as it is usually the IO that needs the blocking calls). However, any user created thread pools *do* count and are also counted against the 'maximum online engines' configuration.
A second major difference between threaded and process mode is how they use CPU resources. Under process mode, the scheduler used the runnable process search count (rpsc - default was 2000) and if we looped through that search rpsc times and didn't have any outstanding disk IOs, we would yield the cpu. CPU utilization times were somewhat approximated, loosely based on cpu clock times in milliseconds. However, if you remember, we had no real insight into the actual OS cpu utilization and could really only report how much of the cpu that ASE got was spent on IO, idle or user tasks. Soooo.....100% in ASE could be 50% in OS.
In threaded mode, we actually started counting the number of clock ticks the threads were active and switch to microseconds to estimate the clock time for the ticks. The most accurate measurement of ASE cpu utilization comes from monThread with the various *Ticks columns. We internally then estimate the actual times and report in percentages as before - net effect is that monEngine (and sp_sysmon) can report different CPU utilization for exact same workload when run under two different kernel modes. As for me, in threaded kernel, I tend to use the *Ticks columns for reporting CPU usage.
For example, it the above sp_sysmon - your engine utilization is showing ~70% average for User, ~14% IO and ~15% Idle.... but that is 70% of the ~37% thread utilization. Think of it this way....you have an idle timeout of 100 (100 microseconds)...so the thread is going to run for 100 microseconds. Of that time, it really is only active 37 microseconds and the other 63 microseconds it is idle. Of the 37 microseconds active, 70% (26us) is spent on user tasks, 14% (5us) is spent on IO and 15% (6us) is idle time within the spids. Does that make sense???
Then we get to the interrupted sleeps....after the 100 microseconds, the thread is put to sleep. When an IO completes (e.g. disk IO), the OS is forced to wake up the thread - which is an interrupted sleep. Generally, we don't like to see a lot of interrupted sleeps and consistent quantities of them indicates that likely you need to increase the idle timeout.....A better starting number for OLTP might be 250 instead of 100....but only if you can keep the engines busy....in your case, your engines/threads aren't very busy.....and scaling them back a tad might be more useful (e.g. drop to 60 engines vs. 70+).
A big question that comes up is how many disk/network controllers are needed (syb_system_pool). A rule of thumb I often use as a starting point is that you need 1 network controller per every 4 engines on x86/64 - but this is a bit of a fuzzy number. In reality, what you want to watch is the idleTicks in monThread for the threads in monIOController. If you see the network or disk tasks starting to get much above 70% utilization (e.g. ~30% idle ticks) then you might want to consider adding one. I would be a little more careful adding disk controllers - a lot of the issue there is OS dependent/other limitations that adding disk controllers are likely not going to overcome. For example, in a recent discussion vis-à-vis Solaris ZFS volume manager (zvol) it was noted that the default number of concurrent (outstanding) IOs per zvol was 10. The more devices in the volume, however, the quicker the zvol manager could pass off the IO - so while the setting was tunable, the first thing to do was to increase the number of devices and also increase the limit. This is also true of Linux, where the limit to the number of IOs per device is defaulted to 128 (and we often suggest increasing this to 1024). IOs greater than these limits are held in OS queue (e.g. the max AIO OS kernel settings sets the size of the AIO queue) until it fills at which point it backspills to ASE. The point being that it is extremely easy for even the single disk task to saturate the IO subsystem - adding a second is not likely going to help. It should also be helpful to understand that the engines still submit the disk IO's - the disk task simply does the IO poll and alerts spids that IO has completed. On the otherhand, network tasks seem to do a better job of offloading work from each other (you can measure ms/wait on waitevent 251 as a key consideration) - especially if the OS kernel has been tuned appropriately. However, even in network controller threads you have to tune the OS and understand the OS....for example, Solaris 10 doesn't support Receive Side Scaling (RSS) while Solaris 11 does....what does this mean - it means under Solaris 10, all network (and disk IO) interrupts are handled by Socket 0 - which translates to the fact that while multiple network tasks likely are necessary, there will be significant limits under Solaris 10 as to how well they scale vs. Solaris 11. BTW - most later (e.g. RHEL 6.2+) Linux kernels support RSS.
BTW - a key consideration is that while we still do essential QA under process kernel, many of the new features/aspects are really stress tested/heavily QA's under threaded kernel. For example, there is some consideration whether the new CI interface for RepAgent with synchronous rep will even be supported under process kernel - certainly for the beta we do not and only support it under threaded kernel. In addition, we only support threaded kernel for SAP apps - so it gets a lot more attention (and additional testing in that regard).....sooooo.....even if you don't see any difference in production, sticking with process kernel may cause significant issues/limitations in the future.
As with MDA, if you want a nice pretty book answer - read the books. If you want (instead) my recommendations on what to do and what really works.....I will tell you. If you don't like the answer - sorry - but it is what works based on customers I interact with. YMMV.