Performance Optimization: Rules of Engagement

by Marcin Kaluza 17. August 2009 08:24

On my way to work from London’s Waterloo Station one day I noticed a building on Southwark St which got me intrigued: “Kirkaldy’s Testing and Experimenting Works”. As it turns out there is a rather fascinating industrial history behind the building but it would not be worth mentioning here if it was not for a motto above the entrance: “Facts Not Opinions”. I can hardly imagine an area of software development where the motto would be more applicable than performance engineering. You see, far too many problems with performance come from the fact that we spend our time and resources “optimizing” code in areas which do not need optimization at all. We oftentimes do it because “everybody knows that you should do XYZ” or because we want to mitigate perceived performance risks by taking “proactive” action (aka premature optimization). If we were to follow the mantra of Mr Kirkaldy, we could avoid all of the above by doing just one thing: testing and measuring (and perhaps experimenting). So if you were to stop reading just now, please take this no 1 rule of performance optimization with you: measure first. Measuring is not only important when fixing code: it is also vital if you want to evaluate risk of potential design approach. So instead of doing “XYX because everybody knows we should”, whack a quick prototype and take it for a spin in a profiler

image

One of my favourite performance myths is that you should “always cache WCF service proxies because they are expensive to create” (and of course everybody knows that). As I have heard this technique specifically mentioned in context of ASP .NET web app running in IIS I could immediately hear alarm bells ringing for miles… The problems with sharing proxies between IIS sessions/threads are numerous but I will not bother you with the details here, my main doubt was if WCF proxy can be efficiently shared between multiple threads using it (executing methods on it) at the same time. So I created a simple WCF service with one method simulating 5 sec wait. I set the instance mode “per call” and then started calling the service from 5 threads on the client side using the same proxy shared between all of the them. I used a ManualResetEvent to start the threads simultaneously and expected them to finish 5 seconds later (give or take a millisecond or two). Guess what: they did not, as they blocked each other on some of the WCF internals and the whole process took 20 seconds instead of 5. So now imagine what would have happened if you used this approach on a busy website: your “clients” would effectively be queuing to get access to the WCF service and you would end up with potentially massive scalability issue. To make things worse creating WCF proxies is nowadays relatively cheap (provided that you know how to do it efficiently). The moral of the story is simple: when in doubt – measure. Do not apply performance “optimisations” blindly simply because everybody knows that you should….

As good and beneficial as performance “measuring” can be, when doing so you may often come across a phenomenon known in quantum physics as the paradox of Schrödinger's Cat. To put it simply by measuring you may (and most likely will) influence the value being measured. It is important to mention it here as profiling  a live system may become infeasible simply because it would slow it down to an unacceptable level. The level of performance degradation may vary from several percent (in case of tracing SQL being executed using SQL Profiler) to several hundred percent when using code profiler. Keep that in mind when testing your software as this once again illustrates that it is far better to do performance testing in development rather than fight problems in production when your ability to measure may be seriously hampered.

On of the funniest performance bugs I have ever come across was caused by a “tracing infrastructure” which strangely enough took extreme amount of time to do its job. As it turned out someone decided that it would be great to produce output in XML so that it can be processed later in a more structured way than a plain text. The only problem was that XmlSerializer used to create this output was created every time anyone tried to produce some trace output. In comparison with WCF proxies XmlSerializers are extremely expensive to create and this obviously had detrimental impact on application using tracing extensively. I find it rather amusing as tracing is one of the basic tools which can help you measure performance, as long of course as it does not influence it too much…:)

If there is one thing which is certain about software performance though, it is the fact that you can take pretty much any piece of code and make it run faster. For starters if you do not like managed code and overheads of JIT and garbage collection you can go unmanaged and rewrite the piece in say C/C++. You could take it further and perhaps go down to assembler. Still not fast enough? So how about assembler optimised for a particular processor making use of its unique features? Or maybe offload some of the workload to the GPU? I could go along these lines for quite a while but the truth is that every step you make along this route is exponentially more expensive and at a certain point you will make very little progress for a lot of someone’s money. So the next golden rule of performance optimisation is make it only as fast as it needs to be (keep it cheap). This rule sort of eliminates vague performance requirements along the lines “the site is slow” or “make the app faster please”.  In order to tackle any performance problem, the statement of it has to be more precise, more along the lines of “the process of submitting an order is taking 15 sec server-side and we want it to take no more than 3 seconds under average load of 250 orders/minute”. In other words you have to know exactly what the problem is and what is the acceptance criteria before you start any work.

I have to admit here that oftentimes I am tasked with "just sorting this out” when performance of a particular part of the application becomes simply unacceptable form user’s  perspective. Lack of clear performance expectations in such cases is perhaps understandable: it is quite difficult to expect the end user to state that “opening a document should take 1547ms per MB of content”. Other than this the acceptability will depend on how often the task has to be performed, how quickly the user needs it done etc. So sometimes you just have to take him/her through iterative process which stops when he says “yeah, that’s pretty good, I can live with that”.

So say that you have a clear problem statement, agreed expectations, you fire up a profiler and method X() comes up on top of the list consuming 90% of the time. It would be easy to assume that all we have to do now is to somehow optimise X() but surprisingly this would probably be… a mistake! The rule no 4 of code optimisation is to fully understand the call stack before you start optimising anything. Way too many times I have seen developers “jump into action” and try and optimise the code of a method which could be completely eliminated! Elimination is by far the cheapest option: deleting code does not cost much and you immediately improve performance by almost infinite number of percent (I’ll leave it to you to provide a proof for the latter statement:). It may seem as I am not being serious here but you would be surprised how many times I have seen an application execute a piece of code just to discard the results immediately.

And last but not least as developers we sometimes fall into a trap of gold plating: it is often tempting to fix issues you may spot here and there while profiling but the first and foremost question you should be asking is what will be the benefit of it? A method may seem to be inefficient (by the looks of the code), say sequential search which could be replaced with a more efficient dictionary-type lookup, but if profiler indicates that the code is responsible for 1% of overall execution time, my advice is simple: do not bother. I have fallen into this trap in the past and before you know it you end up with “small” changes in 50 different source files and suddenly none of the unit tests seem to work. So the last rule is: go for maximum results with minimum changes, even if it means that you have to leave behind some ugly code which you would love to fix. Once your bottleneck has been eliminated, sure as hell another one will pop its ugly head so keep tackling them one by one until you reach acceptable results. And when you reach a situation when making one thing faster slows something else, as it often happens in database optimization, it means that you are “herding the cats” as we call it on my project and you probably have to apply major refactoring exercise.

My current project has a dialog box with a tree view which used to take several seconds to open. On closer investigation we realised that the problem lies in how child elements of each tree node are retrieved: the algorithm used sequential search through a list of all elements stored in memory along the lines of var myChildren = allElements.Where( element => element.ParentID == this.ID).ToList(). As the dialog used WPF with hierarchical data template, each element in the list had to perform sequential search for its children which gives not so nice o-n-squared type of algorithm. The performance was bad with ~1000 of elements but when the number of elements increased overnight to 4000, resulting 16 fold increase in execution time was unacceptable. You may think that the solution would be to rework the algorithm and this indeed was considered for a while. But in line with “measure” , “keep it cheap”  and “make it as only fast as it needs to be” rules the fix proved to be very simple. As it turned out the major problem was not the algorithm as such but the fact that ParentID property was expensive to evaluate, and even more so if it had to be invoked 16 000 000 times. The final solution was a new 3 lines of code long method IsChildOf(int parentID) which reduced the execution time by a factor of 60. Now that is what I call a result: 6000% improvement for 3 lines of code.

Performance Matters

by Marcin Kaluza 8. August 2009 08:26

The very definition of software performance will vary depending on whom you ask. If you asked the end user he would immediately mention the “speed” of the application he has to work with. If you asked the CIO he would probably define performance as “throughput” measured in transactions per seconds. Finally if you asked an IT guy who has to deal with the hardware end of the system he would say that he needs scalability so that his duties are limited to provisioning more hardware when demand increases. All these elements: response time, throughput and scalability are desired components of software performance.

I have spent last 12 months working pretty much continuously on performance optimisation and James Saul asked me to share some of my findings with a wider audience. To start somewhere I went on to dig up some resources on wikipedia and came across an interesting article on performance engineering. According to the article one of the objectives of this discipline is to “Increase business revenue by ensuring the system can process transactions within the requisite timeframe”. In other words performance is money and there is probably no better example of how it is lost than total meltdown of the Debenham’s website which took place just before last Christmas. I have to admit here that I have no idea what went wrong at Debenhams but I can easily imagine a number of ways to build a software product which breaks under heavy load. As they say there is more than one way to skin a cat and build poor quality software but this time round I will focus primarily on the “process” issues, rather than particular technical aspects.

Small database syndrome (aka SDS)

Personally I think that SDS is the major contributor towards building poorly performing programmes: if the development team works against a tiny database, they are very likely to get in serious trouble further down the line and there are a number of reasons for it. The most obvious is the fact that there will be more data (surprise, surprise) so naturally more work will be required to get whatever you want out of the database. Secondly, query plans will be turned upside down in light of larger tables and distribution of data will influence it heavily as well. And last but not least when working against a small datasets it is impossible to spot any potential performance problems as everything will (or at least should) execute rather quickly.

The best example of spectacular “volume related” failure I have witnessed not so long ago is an application which when fired for the first time against fully populated database, executed 40 000 SQL queries during its start-up and the whole procedure took a better part of 40 minutes. To add insult to an injury, some of the tables involved in the queries were missing rather crucial indexes while others were indexed incorrectly (not that it matters a lot when you execute 40 000 queries to start one instance of the app). This potential fiasco made everyone involved in the project somewhat embarrassed and steps were taken to avoid such mishaps in the future. Luckily for the team this accident happened early enough in the project lifecycle and fixing it was relatively cheap and easy. But as you can hopefully see from this example SDS is a serious risk and I find it somewhat difficult to understand that people oftentimes try to find all possible excuses not to use properly sized database for development and/or testing. The one I hear most often is related to cost, measured in terms of either time or money; resources which someone has to spend to produce the data. But given the availability of data generation tools like the one provided by Redgate this is indeed a very poor explanation. It is even worse considering that cost of maintaining such a dataset is just a fraction of the total cost of the project.

“We will have better hardware in production”

This is another one of my favourites which I hear a lot when people testing an application realise that something is not quite right performance-wise. Accepting that the app is sluggish usually means that someone has to admit to a failure of sorts and nobody likes it. So people usually go into denial and try to find excuses not to tackle the problem now and then. If you consider that most of us developers work on single processor machines, it is not hard to see how people may fall into this trap, but even so basic calculations often prove that hoping to kill the problem with hardware may be nothing more than wishful thinking. Let me illustrate it with an example: lets consider a sample operation which takes 10 seconds on a modern single processor PC with plenty of RAM. It is easy to imagine that production hardware may be 20 times more powerful, leading to a false conclusion that in production the same process will take 1/20th of 10 seconds i.e. 500 ms. Job done. The failure in such reasoning is first of all an assumption that the production hardware will be serving one user at a time or that concurrent user load will have no influence on performance. Secondly the more powerful hardware may be indeed 20 times the capacity of the PC, but this capacity will be available only when you are able to parallelise the algorithm! If the original process is sequential (single-threaded) in nature, adding more processors to the server will not change response time at all. So the only conclusion we can draw from running software on inferior hardware is that if it works on a PC, there is a chance that it will work on a big server, provided of course that the software is scalable. On the other hand if it does not work well on your PC, the chances that it will ever work anywhere else under substantially heavier user load are close to zero.

“We have no [time|requirement|money|resources] for performance testing”

Some wise people say that if you have no time for testing, than you better have time for fixing last minute bugs and patching the app. The same is pretty much true when it comes to performance. When building systems which will potentially face high user load it is absolutely imperative that load and stress testing have to be executed unless you want to face similar fate as the website I have mentioned earlier. I may be biased here because I like to load test software, but load testing the app is probably the best way to make sure that it actually works. Let me give you an example here: about 18 months ago I participated in a POC at Microsoft’s working on a a website for an airline. Together with another guy from EMC we were responsible for doing the back end of the system: the database and WCF based app server. As we finished our job earlier than expected I decided that it would not be a bad idea to actually make use of available resources and take that thing for a spin a see what it can do. The app server was running on an 8-way 64 bit machine with ample amounts of RAM so I whacked some unit tests simulating users’ journeys through the website, plugged the whole lot to a VSTS load testing machine and pressed the green button. As soon as I pressed it we discovered that the whole thing grinds to a rather embarrassing halt within several seconds from being started… After a bit of head scratching we decided that it is a rather good idea to close server connections once you are done with them and repeated the test scoring a rather measly result of 100 method invocation per second. To cut the long story short during the next few days we have discovered that from performance point of view it is actually wiser to use ADO .NET rather than LINQ to SQL, that when building high performance systems it is better to have network cables which work at full capacity of the switch they are connected to, and that SQL Server 2008 rocks and it would take load from 3 app servers before it became fully saturated. In the meantime our load testing machine ran out of puff and we needed to use two more 4-way boxes in order to generate enough load to saturate the app server. The end result of this exercise was 18 fold (sic!) increase in system performance not to mention the fact that it was happy working for hours with no end. And when it came to presentation of the finished website everyone was raving about how quick the whole thing was. The moral of the story is however that things will inevitably break under heavy load. If you load test them before handing them to the users, chances are that they will get much more robust system and you as application developer will save yourself potentially huge embarrassment.

PS: I know that this post is barely technical, but I promise to improve next time round :)

Dog slow WPF transparency

by Naeem Khedarun 5. February 2009 22:50

It's been awhile since my last post, I've been busy at work so had to take a break from my TeamCity exploits, and then I got side-tracked building a little utility for myself.

The application is built in WPF and is yet another .NET natural language command window, but with some neat tricks. It was however, performing absolutely terribly, and I thought I only had myself to blame. Initially I thought it was me patching into some unmanaged funtions for some of the jiggery-pokery, as outlined below.

zrclip-001p2b8c83ba.png

However, I was unable to work out why SOO much time was being taken in the GetMessageW and DispatchMessage funtions, it was a real mind-f**k. After exausting all possibilities, I tried some random attacks here and there, one of them was turning off the transparency on the main window, and low and behold, the application is now performing super-quick, but why? Some googling turn up this.

So unfortunately, if your on an unpatched Vista or XP, you will need to get one of the following patches...

Vista

XP

According to the thread it seems the hotfix is already in Vista SP1 and XP SP3. Glad I found this one before... doesn't bear thinking about; my fault for not updating to SP1.

Categories: WPF | Performance