Cutting Through Opinions
Click Here to Listen to this topic as a "The ROOT Cause" podcast.
Click Here to Subscribe through
Click Here to Subscribe through
There are IT application problems that have been unresolved for many weeks or even years, or there may be a new problem as a result of a recent move or change in the application infrastructure environment. Often, different IT departments are engaged in extensive troubleshooting and reach a stage where each is convinced that the problem is not in their area. This can result in a long-standing problem that is often never completely resolved.
This document describes how an expert Network & Application Performance Analyst can solve these problems, leveraging a root cause analysis approach. Resolution is the primary goal. The same approach can be used to establish a baseline for an application in an existing environment, prior to moving to a new server environment. This process involves single transaction testing or small load testing exercises in order to create a baseline of how the application behaves under controlled circumstances. Ideally, this is done in the original production environment first. This helps address the user’s experience in the new production environment by providing a baseline against which to measure the application’s behavior, and is beneficial for the NAPA to evaluate performance when troubleshooting. With a properly set and documented baseline the success rate of finding performance problems in a new production environment is extremely high.
Application & Network Root Cause Analysis
The Interpath Technologies Root Cause Analysis approach is a step by step method that can be used to set an application performance baseline, and to discover the root cause of application failures. Every application failure happens for a number of reasons. In a multi-tiered enterprise environment, there is a progression of actions and consequences that lead to a failure. A Root Cause Analysis (RCA) investigation traces the cause and effect trail from the symptoms back to the root cause - much like a detective solving a crime.
Finding the root cause of an application or network problem is extremely challenging due to the number of different components and the relationships between these components. The RCA approach analyzes application execution at the transaction level looking at the lowest common element in the IT networking environment – the TCP/IP packets.
Pre-Requisite Skills and Tool Sets
Many tools will provide metrics, but only a Sniffer ®/ Ethereal ® or WireShark ® tool in the hands of an experienced analyst that specializes in such a tool, will allow users to see each and every TCP/IP packet that moves between all of the components involved in an enterprise system. Analysis of the packets will show exact packet's contents such as the following:
- How much time it takes to get from device to device.
- Exactly how long each device took to reply and anything out of the normally expected pattern.
- Any network or application issues.
- Irrefutable proof of what is discovered, which is key in getting vendors and different internal departments aligned.
Effective transaction analysis requires more than just a suite of tools, it requires highly specialized skills as well as very strong skills in all aspects of enterprise networking including servers, switches/routers, operating systems, security, documentation, diplomacy and project management. The Network & Application Performance Analyst must also have excellent research and analysis skills.
The following information will assure the effective use of the Network & Application Performance Analyst time and the execution of a smooth, efficient Root Cause Analysis process.
- LAN diagrams – required for proper placement of the Sniffer ®/ Ethereal ® or WireShark ® protocol analyzers. The analyst must understand the physical path that the packets will take as they move from server to server to router, etc.
- WAN information – required to learn about the pathways between participating networks/locations.
- Host name/IP address of all server interfaces – required to identify the servers involved with the problem application.
- Server types – required to understand the role of each server (e.g.,file server, database server, application server, web server, scheduler, etc.)
- Application design – including data flow diagrams and/or an interview with the application subject matter expert. The ANPA must understand the following about the application:
- TCP/UDP ports used and their roles.
- Port Data Flow – in most cases, the ANPA will need to create this diagram showing the flow of data and on which ports it moves between all servers involved.
RCA Process Overview
Step 1: Kickoff
The Network & Application Performance Analyst meets with the application SME(s) where they diagram the Application’s Port Data Flow — The Application's Flow Diagram. The Network & Application Performance Analyst and the Client's Subject Matter Expert become the first two members of the Network Application Performance Analysis Team. The Application or Network Flow diagram shows the flow of all traffic between all participants such as Web Servers, internal and external Firewalls, Application Servers, Database Servers, etc.
The Network or Application Flow Diagram allows the Network & Application Performance Analyst to understand what is happening at a Protocol Level as well as a Network Level. The flow of data from the User, through all other devices and back again to the User is The Interpath. Diagraming the Interpath makes possible the proper placement of Sniffer ®/ Ethereal ®/ WireShark ® or other protocol analyzer to allow visibility into the entire transaction. This process has a very high success rate in finding the root causes of long standing problems.
Step 2: Deploy Protocol Analyzers
(Sniffer ®/ Ethereal ® WireShark ® or other tool.)
The diagrams are used to trace out the actual physical data paths taken by packets to and from switches, servers, routers, firewalls and any other infrastructure components involved. Based on this knowledge and the number of protocol analyzers available for testing, the Test Plan is created defining where to place the protocol analyzers, and what transactions will be analyzed. The goal is to gain 100% visibility into the movement of data between all participating servers, switches, etc. Often, this is not possible for reasons such as time, budgets, resources or technical issues. It is not necessary to have 100% visibility, but it is highly desirable where possible.
With the test plan complete, the equipment is shipped to the defined locations. A local resource is required to physically place the Sniffer ®/ Ethereal ® WireShark ® or other tool once they arrive. Port Mirror / SPAN will be created as required. This is also limited by the capabilities of the local Switches and Routers to do Port Mirroring / SPANing.
Step 3: Perform Transaction Testing
The application SME works with the ANPA by executing specific transactions while the Sniffer ®/ Ethereal ® WireShark ® or other tool captures trace files (every packet moving in the transaction). Through this process, the ANPA gains an understanding of how the application data is flowing in transactions generated by a typical user. Many transactions are tested but only a single transaction is tested at one time.
Step 4: Analyze and Diagnose Problem
Using the trace files, the Network & Application Performance Analyst works with the application SME to spot where the application or network failed to perform to specifications. With the problem areas identified, the Network & Application Performance Analyst will start to investigate specific components within that area. Root causes of application infrastructure problems will typically be in one or more of the following areas:
- Error in the rules, or changes that are incompatible with ports required by applications on the other side.
- Firewall blocking or dropping monitoring protocols used by application monitoring tools.
- Negative impact due to inefficiently deployed Network Intrusion Detection Sensors.
- Incorrect OS or software configurations in the server or workstation
- Replication issues, also possible with server
- Slow database or trouble with the data
- Incorrect server hardware configuration
- Addressing issues
- DNS issues
- Directory issues
- Authentication issues
- Incorrect switch, router or NIC configuration
- WAN issues
- Conflicts with other applications on the server (s)
- Bad components
- Build issues
- Replication issues, also possible with application
Step 5: Document
The Network & Application Performance Analyst will create a “living” document during the entire process to provide Technical Staff and Project Managers with an audit trail into the progress of the team and the ability to answer questions given to them by management.
The final document must provide incontestable proof of the root cause to allow all departments to accept the results. It will provide an executive summary of the causes and recommended solutions to the problem. It will contain a detailed technical section that clearly explains the process of the testing and the reasons for the conclusions as well as any recommendations.
Such documentation will contain snapshots of protocol analyzer trace files and diagrams illustrating the interaction of the various components. The application port data flow diagram will also be included.
To see Case Studies, Click Here
For more information on this article, Click Here.