Archive | May, 2011

My Thoughts on Out Of Memory Issues

28 May

Recently I came across the post in LinkedIn Discussion (Performance Specialist) group where the poster was asking about how to fix or identify or reproduce out of memory exceptions. Lot of people suggested him a correct process which I agree are in fact some of the steps which we follow while fixing the out of memory errors. However in addition those steps, normally there are many other things which one has to take care while debugging OOM issues. There many steps of analysis like measuring heap size, checking finalizer threads, and reading thread trace etc which needs to be done after taking dumps. Dumps will give you the information required to troubleshoot the issue but will never say where the issue lays precisely, its up to the person who is using that information to find it out as why we are getting out of memory issues after reading and interpreting all the information present in the dump file. So obviously skill sets and basics come into the picture here. I would call it more of Reverse software engineering. We read information backwards towards the written code.

I have come across many such projects wherein I have seen OOM issues across Java and .net platforms and to some extent we have successfully debugged OOM issues in the past. However I would like to point that OOM issues cannot be debugged by the performance engineer alone. These types of issues also require inputs from the Environment folks and Development Folks. Without their involvement, we will never have permanent fix for this for the simple reason that Environment folks are the ones who monitors and controls the environment and development folks are the ones who knows the code better because we know that they are the one who has coded the application. Memory issues manifests either due to bad environments settings or Faulty working code or incorrect scenario settings.

So first let me give you some of my experience where I have come across the OOM exceptions and my thoughts as why we normally see the OOM exceptions,

The first case where I have seen Out of memory exception are those application which are displaying the 100’s of 1000’s of data and out of these user selects one or 2 records to satisfy his business purpose. The core problem with displaying huge amount of data is that, it consumes lot of memory and your platform needs to hold this much amount of data in the RAM. Though by keeping data in memory has some performance improvement, it does have impact on memory management. You are holding the data which is not required by the users and you are keeping that in RAM. In fact this is typical symptom of bad design.

The Second case where I have seen Out of Memory exceptions are when we have lot of unhandled exceptions and these are not handled correctly in the application code. Exceptions consume the memory resources and they are unhealthy to the application performance.

The Third case is where you have large amount of view state or session state. I remember testing some applications wherein size of the view state generated were in the range of 1 to 5mb and in some cases was going up to 10mb.Problem with this approach is that you are passing too much data back and forth across the wire which increase overheads for serialization and deserialization process which in turn consumes lot of memory.

The Fourth case was where the server configuration setting was totally incorrect. Almost every server provides you with the setting to set the physical and virtual memory. Normally there isn’t a fixed setting one can use for their applications and most people initially want to go with default setting which comes with web servers. In most of the cases default setting does the job, however there are cases where default settings will not be of much use, in such cases it makes more sense to do a couple of load tests and determine the optimum memory setting for your application. In addition to memory setting there is also other setting which impacts memory consumptions.

Then the Fifth case was something where in the application was encountering the deadlocks situations among threads and data was getting frequently written by these threads into the memory. Deadlocks can happen for variety of reasons, normally whenever the deadlocks happens in the CLR threads or JDK threads, it increase the possibility of OOM conditions. Sometimes Large objects do not get garbage collected due to these deadlocks and we see that they are not Garbage collected because the finalizer was blocked due to various reasons.

We do have many tools to identify the out of memory conditions and take memory dumps and thread dumps. However the person taking and analyzing dumps also needs to thoroughly skilled in various aspects of memory management, garbage collections methods and processes, thread management and various other technical stuff involved. He also needs to be aware of programming concepts involved in the application. Since the topic on this is very vast, I would not like to into details and would share with you some links which I use whenever I get those OOM issue.

Here are those links, However please do note that these are for .net platform and I am sure they will provide you with excellent insights as what exactly are out of memory exceptions and how to deal with it. For Java platform, I will make that as a separate post as there are some points which differ to some extent compare .net.

http://support.microsoft.com/kb/911716

http://blogs.iis.net/webtopics/archive/2009/05/22/troubleshooting-system-outofmemoryexceptions-in-asp-net.aspx

http://blogs.msdn.com/b/tom/archive/2007/11/26/high-memory-cpu-or-other-performance-problems-with-net.aspx

http://www.onpreinit.com/2009/09/windbg-sos-dumpdatatables-aspxpages-etc.html

http://www.stevestechspot.com/default.aspx#a994b5c77-6fbc-4994-9623-3772b0505469

http://blogs.iis.net/webtopics/archive/2009/03/12/high-memory-due-to-system-weakreference.aspx

http://naveensrinivasan.com

http://support.microsoft.com/kb/892277

http://blogs.msdn.com/b/tess/archive/2009/04/16/net-exceptions-quick-windbg-sos-tip-on-how-to-dump-all-the-net-exceptions-on-the-heap.aspx

http://blogs.msdn.com/b/tess/archive/2005/11/30/are-you-aware-that-you-have-thrown-over-40-000-exceptions-in-the-last-3-hours.aspx#9557706

http://www.slideshare.net/CoryFoy/debugging-net-applications-with-windbg/download

http://blogs.msdn.com/tess/archive/2008/03/17/net-debugging-demos-lab-6-memory-leak-review.aspx

http://www.codeproject.com/KB/dotnet/Memory_Leak_Detection.aspx

http://msdn.microsoft.com/en-us/magazine/cc163491.aspx

http://support.microsoft.com/kb/919790

http://blogs.msdn.com/b/dougste/archive/2006/10/17/clrdebug.aspx

http://www.simple-talk.com/dotnet/.net-framework/a-look-at-exceptions-in-.net-applications/

http://blogs.msdn.com/b/santiagocanepa/archive/2011/02/28/memory-based-recycling-in-iis-6-0.aspx

Advertisements

Processes When to follow and When not to follow..

27 May

Processes sometimes helps and sometimes do not help. There are situation which require process improvement and there are also situation where by following process in itself becomes a bottleneck. Blindly following the processes means that we are wasting valuable time and energy without understanding or knowing that process is indeed bringing any value add on to the table.

There are lot of situation where processes bring value add to the table, some of them are Release management processes, Environment Management Process, Testing Process, Estimating and costing processes. However sometimes these same process can become a bottleneck in certain conditions like releasing the product to the client to meet his time demand without fully educating him about risks or bypassing the development team in all together to troubleshoot the key critical issue like high CPU usage or memory issues, just because the environment where the error shows up is owned by the Environment team.

Sometimes there are cases where by following the process in itself consumes lot of time and later on people understand that issue which they are triaging do not belong to them and so cannot be solved by them. However they are bound to work in those areas for sometime for the reason that they own that functional area. Sometimes people use processes as their tool to defend/avoid/initiate activities which can or cannot bring positive results to the project.

Sometimes in large companies, due to various reasons people follow processes very strictly but again just because we work in large companies do not necessarily mean that we should not check for any value add. If involving development team in early stages helps to fix the issue quick and fast, then it makes more sense to take their help and fix the issue rather than wait for environment folks to do some trial and error and come out with fix. Look at the amount of time we are losing here.

So the best way to judge whether the process is bringing any value add to the table or not is to ask self questions like Will following this process xxx do my job faster or I still have some other better options to explore ? If your answer is yes, then I suggest go ahead and show that you can indeed bring value to the table. Saving project time is also a good value add.

Tips to Reproduce the Performance Issues

23 May

Whenever the performance tester logs an incident for performance issues, the first questions ,development team asks is “ how to we reproduce the incident “ and sometimes they often refuse to agree that there exists any performance issues in the application for the simple reason that incident highlighted is often not reproducible in their environment.

There exists a norm in software development industry that if the incident logged cannot be reproduced by the tester or by the person highlighting it, then it cannot be resolved or solved for the simple reason that not enough information is available to resolve the issue or understand the issue.

I believe that performance incidents are hard to reproduce and do not often occur under functional or manual environments, so it becomes real hard to say as what really happened to the application under load. But however there are some features which most load testing tools provide which can help the performance engineers to reproduce the defect and to know as what the inputs were given to the application at that point that triggered the issues to surface under load test.

In order reproduce the incidents, performance engineer’s needs to have understanding of the functionality of the application along with technical details of the application. In case if the information is not available, then he needs to ask this information with relevant stakeholders before logging an incident. This really helps so that everyone involved stays in the same page. So in this post I would be highlighting some of the features which LoadRunner provides which when used effectively can be helpful in reproducing the incidents.

LoadRunner provides rich set of features which can be used for reproducing incidents which often cannot be reproduced manually, below are some the setting,

  • Enable Snapshot on Error: This feature I believe was introduced in 8.x versions of LoadRunner. Whenever the error occurs, it takes the snapshot of the page and saves it in Vuser logs. However it needs to be enabled in the run time setting of the script. Often snapshot or screenshot are the ones development team requires to believe that error has indeed occurred. So I suggest this feature needs to enabled if you believe that you have some performance issues in the application. However please do keep in mind that enabling this also consumes the LG’s resources.
  • Logging Functions: LoadRunner provides rich set of functions that can be used to log messages to file. However I prefer to use out put message function and disable logging completely. By using output message function (lr_output_message) function, I don’t need to have complete logging enabled, and I see all information which I want to see. I also suggest logging all the correlated values along with user defined parameters using output message function for the reason that under load, one can really find out as what values were given ,what values were captured from the server response and what values were not captured. Once we have the data from output message function, we can reuse the same data and try to reproduce it manually.
  • Extended logging: LoadRunner also provides us the features wherein we can see client request made and server response received from the servers. There might be some cases where in we would like to see as what request client has send that triggered the error in the application under load, in such a cases extended logging can be enabled. However please do note that it impacts response time and consumes lot of load generator resources. Extended logging in LoadRunner helps us to see as what used defined data parameters were used, what response was send by the server and extended trace of the function calls made by the LoadRunner. In short, it shows you the complete trace as what flows in the wire for the user. However this requires the performance engineer have sufficient knowledge to read and interpret the data captured.
  • Iteration Number: LoadRunner provides the feature wherein users can log the iteration number. Under load, each user does many iteration and uses many data points from the user defined data files; it becomes really confusing to know what data was used in which iteration while reproducing the error. So I suggest that one needs to log iteration number along with parameter used in the script either in the beginning of the action block or depending on the requirement. Iteration number along with logged information when correlated with snapshot on errors helps in most of the cases to come out with clear knowledge of the issue found.

However there are some cases of incidents where in spite of having all information, one might not be able to reproduce the incident. For such cases, I suggest that you isolate the scenario and run it having the relevant stakeholders monitoring at their respective ends.

Basics of Scenario Designs

15 May

Let’s talk something about scenario design for doing Load testing. Recently I received a mail from one of my old friend, saying that someone from Big 3 companies asked him the following questions about scenario designing for Load Testing.

  • We are running a test with 2000 users for duration of 4 hours. 1 Iteration takes about 10 mins so for 4 hours how many transactions will be achieved.
  • We want to run a test for 8 hours and complete 8 iterations.How do we design a scenario.

Scenario as per me is imitating the end user’s journey via application using scripted approach. So scenario in addition to following the end user’s path also needs to incorporate the browser behavior which end user uses while walking through the business process. Scenario also needs to ensure that it contains the overall picture of the application usage like how many people are using the application at any point of time, are those people going to use it for entire 8 hours or some folks might drop off after sometime.

Sometimes for some of the application, scenario also contains the details of specific environment which the end users are going to use for accessing the applications. For those environment specific parameters needs to be incorporated in the scenarios. Good example for these applications are mobile applications .

However mostly scenarios should contain the below details in it,

  • Ramp Up: Does the application users increase over time, if yes then we need to have things in scenario design.
  • Ramp Down: Does the application usage tapers down towards the end of day/hour. If yes, then we need to have these things in scenario design.
  • Steady State: Does the application users remain constant throughout the time, if yes, then we need to have these things in scenario design.
  • Duration: How long are they going to use it, do end users do the transaction all day or for specific amount of time.
  • Location: Where are the users of the application based? Are they from same city or different countries? If users are distributed across the globe, then it does add another layer of math to evaluate host of other parameter.
  • Volumes: These could be the transaction/hits per second or Throughput or Pages per second or could be any other metric which we are attempting to achieve so that we can come up with some assessment whether the system can meet the business requirement successfully. Scenarios in all cases needs to achieve some predetermined objectives and only then we can that scenario has successful.

Now all these above points are high level points which one needs to take care while designing the scenario. Since we follow the scripted path to achieve our objectives, there are also certain low level parameters in the scripts that impacts scenario design and needs to be taken care, those are

  • Think time given the scripts: These are nothing but the pause time taken by the end users.
  • Pacing given the scripts: Since the scripted approach is much faster than manual approach, it is important that we have some kind of pacing requirement to control the behavior of the scripts. If user are doing 10 similar transaction per business process , then in real time, we assume that they are going to pause for sometime after doing x transaction out of those 10 transaction. He may feel like having a coffee after each transaction to celebrate successful transaction. Pacing time in scripts reduces the load on the servers.
  • Browser emulation: In real time, most users prefer to use browser cache to cache the frequently used data, so it’s important that scripts needs to have similar setup.
  • Network connection: If the users are based in open internet, then it makes more sense to connect to the application directly in the internet with out proxy servers. It’s basically how we connect to the application. Network connections via scripts / proxy servers add another layer of math in scenario designing.
  • Iteration Setting: Sometimes to achieve the business volumes, we might also need to set up some iteration count in the scenario design.
  • User behavior: How are the users using the applications, are they doing login only once or many times during the day. If they are doing login only once, then scripts needs to update this so as to reflect this behavior. In case of authentication is happening via some third party servers , then it makes more sense to check for this behavior ,since most SSO servers are not transactional servers. Once retrieved they all cache the credentials. This is my understanding; however please check this once again with your SSO admins as how they have implemented SSO.
  • Cookies: Sometimes cookies needs to taken care explicitly. Session cookies are good example where custom handling of cookies is required.
  • Headers: Sometimes custom headers need to be handled explicitly to achieve the some technical objectives.

All these high level and low level points needs to communicated and signed off with the relevant stakeholders before commencing the load tests on the application. Some folks also add this information to their test plan and I feel this is best practice.

Given that they are host of factors which one needs to take care while designing the scenario, questions asked to my friend seems to lack most of this information. However I do agree that these questions tests the high level math involved in scenario designing. Now coming back to these questions,

We are running a test with 2000 users for duration of 4 hours. 1 Iteration takes about 10 mins so for 4 hours how many transactions will be achieved.

1 iteration means for one user it takes around 10 mins. For this lets assume that we do not have any think time in the scripts and there is not browser cache involved. So if 1 iteration has around x transactions in it, then its going to achieve x transaction in 10 mins.So in an hour only 4 iterations can be done, so we can achieve only 4x transaction in an hr.So for 4hrs , it becomes total of 16 iteration with 16x transactions. This is only information I can determine from the given questions. However as per me , achieving 16x transaction is also not possible sometimes given that there are host of factors which impacts transactions like if we get errors midway, or if we have high response time ,then its quite a possibility that given numbers cannot be achieved.

We want to run a test for 8 hours and complete 8 iterations.How do we design a scenario.

There are number of ways to achieve this like running the script and noting down the time it takes to complete and then adding rest of the time either as think time or asking the script to sleep. However running this scenario is a challenge in itself for various reasons.

Hard to reproduce incidents – 1

11 May

Does Functional testing add value to Performance testing? Do performance testers need to know Functional testing? I would say yes it does add lot of value to performance testing. Performance testing is nothing but testing concurrently the functionality of the application and seeing if performs well under many concurrent users.

Performance testing also uncovers the functionality related issues which cannot be done by functional testing alone. The main reason as why functional testing in itself is inadequate is that it follows the “I “process rather than “WE” process. “I” being the single, means at any point of time, only one person is doing the business process Whereas “WE” means many people doing activities on that functionality.

There are many examples to showcase as why functional testing by itself is insufficient however I would like to point to one specific example where in Functional testing was inadequate.

We had a multi tier application where in authentication was happening via Siteminder and this Siteminder was responsible for providing page level authentication to the users.

For each page, it use to send the SMSession cookie back the client and this cookie was updated for every request. So every time, client was getting the updated cookie from the Siteminder.In short, these cookies were an kind of identity to the Siteminder boxes that right users with right credentials were in fact browsing the site protected by it.

Now let’s assume that we have around 30 http requests for a business process, so client will be getting at the max 30 updated cookies from Siteminder web agent for the single users. Everything works fine with single user in the browser session manually, no issues at all. With the help of browsers we test for Siteminder functionality to ensure that cookies are in fact updated for each request, check if the cookies has all the required attributes like expiry date, length etc as per the specifications. So everything looks good for the user using the application.

However for any reasons, sometimes we start seeing the below page which is nothing but the redirect by IIS toward web agent which in turn sends the request towards policy servers for checking the login credential of the users. Everything looks perfect by design and works excellent for the single user sessions. The page displayed is also perfectly fine as per design.

Siteminder_Redirect

However during load testing if we continuously start seeing this above page after every click, and then it becomes a problem and a bottleneck to the functionality itself. Now imagine the Google checkout functionality protected by the Siteminder and users after filling the payment details starts seeing this page, Immediately questions will start popping up in the minds of the users as  what will happen to the transactions that has been entered into the page by the user? Does there exist a risk that the transaction might become corrupt? Or they might lose the transaction itself depending on the reasons that is causing this page to pop up in the midway.All these doubts which is caused by this sudden change in the page,might make you lose the confidence of the end users.

Well I believe Functional testing if done correctly and with out of the box thinking strategy can help to mitigate this type of issues along with at least 40% of saving on time lines for Performance testing  or security testing which projects teams start in the later part of the project lifecycle.

This particular Siteminder issue was very interesting one, which I had seen. The reason this issue was interesting was that the above page was not easily reproducible and was happening once in the months while doing load testing.Since Siteminder solution’s are normally implemented across the portfolio of projects,I believe the impact due to this issue would definitely be very high on the bottom lines.

I have done some working on how this error can be reproduced and why we were seeing this error during load testing. Since this issue is logged with CA for their investigation, I will wait for couple of more days for have some more clarity on this. However I assure that I will post my findings on this one.

%d bloggers like this: