A Wider View: Experimentations and ruminations on Oracle performance analysis

Subscribe to A Wider View: Experimentations and ruminations on Oracle performance analysis feed
Oracle Database performance tuning and analysis experimentation and ruminations.Craig Shallahamer, President/Founder, OraPubhttp://www.blogger.com/profile/04109635337570098781noreply@blogger.comBlogger95125
Updated: 17 hours 51 min ago

What Is That Light-Green Oracle Database CPU Wait Time?

Tue, 2015-02-10 20:10
This page has been permenately moved. Please CLICK HERE to be redirected.

Thanks, Craig.
What Really Is That Light-Green Oracle Database CPU Wait Time?


Have you ever wondered what that light-green "cpu wait time" really means in Oracle Enterprise Manager? It's what I call, the "gap" time. The "gap" time is the "missing" or the "leftover" time when DB Time does not equal the DB CPU (foreground process CPU consumption) plus the non-idle wait time. And, it happens more often than you might think.

If you have ever noticed that the database time seems too large, then you need to read this article. And, if you really want to know what the light-green "cpu wait time" in your OEM charts is, then you need to read this article. It's that good.

If you're serious about Oracle performance tuning and analysis, you'll want to know I just posted my complete 2015 public training schedule. It's on the main OraPub.com page HERE. Remember, alumni receive a 50% discount...crazy I know.

My Experiment Shows...
My experiment shows a strong relationship between the "gap" time and operating system CPU utilization. This means that a significant portion of the "gap" time is Oracle foreground processes sitting in the CPU run queue ready to consume CPU. This CPU run queue time is not part of DB CPU but it part of DB Time. So, when the CPU run queue time increases, so does DB Time and so does the "gap" time. And I have the data to show it! And you can run the same experiment yourself.

Let me put this another way. Most of the DB Time "gap" is Oracle foreground processes waiting in the operating system CPU run queue so they can eventually and truly consume CPU.

This is really important: When an Oracle foreground process is not consuming CPU but is sitting in the CPU run queue, Oracle Active Session History (ASH) facility records the session sample state as "CPU" and if the Oracle process is a foreground process (not a background process) Oracle's time model records this time as DB Time but not DB CPU. So in both the ASH and time model cases, someone must do some math to calculate this "cpu wait time".

But that name... "cpu wait"!

CPU Wait Time Is A Lousy Name

"CPU wait time" is a lousy name. Why? Mainly because it has caused lots of confusion and speculation. The name would be more appropriately called something like, "cpu queue time." Three reasons come to mind.

First, wait time means something special to Oracle DBAs. To an Oracle DBA anything associate with a "wait" should have a wait event name, a wait occurance, the time should be instrumented (i.e., measured) and should be recorded in the many wait interface related views, such as v$system_event or v$session.

Second, from an Oracle perspective the process is truly "on cpu" because the process is not "waiting." Remember, an Oracle session is either in one of two states; CPU or WAIT. There is no third choice. So the words "CPU Wait" are really confusing.

Third, from an OS perspective or simply a non-Oracle perspective, the Oracle process is sitting in the CPU run queue.

I'm sure in some Oracle Corporation meeting the words "cpu wait" were considered a great idea, but it has caused lots of confusion. And I'm sure it's here to stay.

What Does This "CPU WAIT" Look Like In OEM?

In OEM, the "cpu wait" is a light green color. I grabbed a publically available screenshot off the internet and posted it below. Look familiar? 


OK, so it's really easy to spot in OEM. And if you've seen it before you know EXACTLY what I'm referring to.

What Is CPU Wait Time?

First, let's review what we do know.

1. DB CPU is true Oracle foreground process CPU consumption as reported by the OS through a system call, such as getrusage.

2. CPU Wait time is derived, that is, somebody at Oracle wrote code to calculate the "cpu wait" time.

3. CPU Wait time is a lousy name because it causes lots of confusion.

4. CPU Wait time is shown in OEM as a light green color. DB CPU is shown as a dark/normal green color.

Second, I need to define what I'll call the DB Time "gap." This is not error and I am not implying something is wrong with database time, that it's not useful or anything like that. All I am saying is that sometimes DB Time does not equal DB CPU plus the non-idle wait time. Let's put that in a formula:

DB Time = DB CPU + non Idle Wait Time + gap

Really, What Is CPU Wait Time?

Now I'm ready to answer the question, "What is CPU WAIT time?" Here is the answer stated multiple ways.

"CPU Wait" time is Oracle foreground process OS CPU run queue time.

I ran an experiment (detailed below) and as the OS CPU utilization increased, so did the DB Time "gap" implying that the gap is CPU run queue time or at least a significant part of it.

I ran an experiment and there was a strong correlation between OS CPU utilization and the DB Time "gap" implying that the gap is CPU run queue time.

I ran an experiment and using queuing theory I was able to predict the "gap" time implying that the gap is CPU run queue time. (Whoops... sorry. That's what I'll present in my next post!)

So I'm very comfortable stating that when DB Time is greater than Oracle process CPU consumption plus the non-idle wait time, it's probably the result of Oracle foreground process CPU run queue time.

Yes, there could be some math problems on Oracle's side, there could be uninstrumented time (for sure it's happened before), the operating system could be reporting bogus values or a host of other potential issues. But unless there is an obvious wrong value, I'm sticking with the experimental evidence.

Now I'm going to show the experimental "evidence" that is, that the DB Time "gap" time correlates with the OS CPU utilization.

Let The Data Drive Our Understanding

You can download all the data collection scripts, raw experimental data, Mathematica notepad files, graphic files, etc HERE in a single zip file.

You should be able to run the experiment on any Linux Oracle test system. All you need is a logical IO load and for that I used my free opload tool which, you can download HERE.

The experiment placed an increasing logical IO load on an Linux Oracle 12c system until the operating system CPU utilization exceeded 90%. The load was increased 18 times. During each of the 18 loads, I gathered 31 three minute samples. Each sample contains busy time (v$osstat), idle time (v$osstat), logical IO (v$sysstat "session logical reads"), non-idle wait time (v$system_event where wait_class != 'Idle'), DB CPU (v$sys_time_model), background cpu time (v$sys_time_model), database time (v$sys_time_model DB time) and the sample time (dual table current_timestamp).

The CPU utilization was calculated using the "busy idle" method that I blog about HERE. This method is detailed in my Utilization On Steroids online video seminar.

The workload is defined as the logical IOs per second, lio/s.

Below is a table summarizing the experimental data. The times shown are the averages. If you look at the actual raw experimental data contained in the analysis pack, you'll notice the data is very consistent. This is not suprising since the load I placed should produce a very consistent workload.

Do you see the gaps? Look closely at load 18. The DB Time is 8891.4 seconds. But the sum of DB CPU (996.8 seconds) and the non-idle wait time (2719.2) seconds only equals 3716.0. Yet DB Time is 8891.4. So the "gap" is 5175.3 which is DB Time (8891.3) minus DB CPU (996.8) minus the non-idle wait time (2719.2).

Note: Load 11 and 12 where excluded because of a problem with my data collection. Sorry.

While we can numberically see the DB Time "gap" increase as the CPU utilization increases, check out the graphic in the next section!

The Correlation Between CPU Utilization And DB Time Gap
We can numerically and visually see that as the CPU utilization increases, so does the DB Time "gap." But is there a strong mathematical correlation? To determine this, I used all the experimental samples (except load 11 and 12). Because there was 17 different workloads and with each workload I gathered 31 samples, the correlation comprises of something like 527 samples. Pretty good sample set I'd say.

The correlation coefficient is a strong 0.891. The strongest is 1.0 and the weakest is 0.

Graphically, here is the scatterplot showing the relationship between the CPU utilization and the workload.

Don't expect the DB Time "gap" and OS CPU utilization correlation to be perfect. Remember that DB Time does not include Oracle background process CPU consumption, yet it is obviously part of the OS CPU utilization.

Summary
My experiment indicated the light-green "CPU wait time" is primarily Oracle foreground process operating system CPU run queue time. This is DB Time "gap" time.

My experiment also showed the "gap" time is highly correlated with CPU utilization. Which means, as the CPU utilization increases, so does the "gap" time.

If there are Oracle Database instrumentation bugs or a host of other potential problems, that will also affect the "gap" time.

If you want a more complete and detailed DB Time formula is would be this:

DB Time = DB CPU + Non Idle Wait Time + gap time

In my next post, I'll show you how to calculate the gap time based on queuing theory!

Thanks for reading!

Craig.








Categories: DBA Blogs

How To Approach Different Oracle Database Performance Problems

Tue, 2015-02-03 19:32
This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.How To Approach Different Oracle Database Performance ProblemsJump Start Your Oracle Database Tuning Effort

Every Oracle Database Administrator will tell you no two performance problems are the same. But a seasoned Oracle DBA recognizes there are similarities...patterns. Fast problem pattern recognition allows us to minimize diagnosis time, so we can focus on developing amazing solutions.

I tend to group Oracle performance problems into four patterns. Quickly exploring these four patterns is what this article is all about.


You Can Not Possibly List Every Problem And Solution
When I teach, some Oracle Database Administrators want me to outline every conceivable problem along with the solution. Not only is the thought of this exhausting, it's not possible. Even my Stori product uses pattern matching. One of the keys to becoming a fantastic performance analyst is the ability quickly look at a problem and then decided which diagnosis approach is the best. For example, if you don't know the problem SQL (assuming there is one) tracing is not likely to be your best approach.

The Four Oracle Database Performance Patterns
Here are the four performance patterns I tend to group problems into.

The SQL Is Known
Many times there is a well know SQL statement that is responsible for the poor performance. While I will always do a quick Oracle Time Based Analysis (see below) and verify the accused SQL, I will directly attack this problem by tuning with SQL specific diagnostic and tuning tools.

But... I will also ask a senior application user, if the users are using the application correctly. Sometimes new applications users try and use a new application like their old application. It's like trying to drive a car with moving your feet as you are riding a bicycle... not going to work and it's dangerous!

Business Process Specific

I find that when the business is seriously affected by application performance issues, that's when the "limited budget" is suddenly not so limited. When managers and their business's are affected they want action.

When I'm approached to help solve a problem, I always ask how the business is being affected. If I keep hearing about a specific business process or application module I know two things.

First, there are many SQL statements involved. Second, the problem is bounded by a business process or application. This is when I start the diagnostic process with an Oracle Time Based Analysis approach which, will result in multiple solutions to the same problem.

As I teach in my online seminar How To Tune Oracle With An AWR Report, user feel performance through time. So, if our analysis is time based we can create a quantitative link between our analysis and their experience. If our analysis creates solutions that reduce time, then we can expect the user experience to improve. This combined with my "3 Circle" approach yields spot-on solutions very quickly.

While an Oracle Time Based Analysis is amazing, because Oracle does not instrument CPU consumption we can't answer the question, "What's Oracle doing with all that CPU?" If you want to drill into this topic check out my online seminar, Detailing Oracle CPU Consumption: The Missing Link.

It's Just Slow
How many times have I experienced this... It's Just Slow!


If what the user is attempting to explain is true, the performance issue is affecting a wide range of business processes. The problem is probably not a single issue (but could be) and clearly the key SQL is not know. Again, this is a perfect problem scenario to apply an Oracle Time Based Analysis.

The reason I say this is because an OTBA will look at the problem from multiple perspectives, categorize Oracle time and develop solutions to reduce those big categories of time. If you also do Unit Of Work Time Based Analysis, you can an even anticipate the impact of your solutions! Do an OraPub website search HERE or search my blog for UOWTBA.
Random Incident That Quickly Appears And Vanishes
This is the most difficult problem to fix. Mainly because the problem "randomly" appears and can't be duplicated. (Don't even bother calling Oracle Support to help in this situation.) Furthermore, it's too quick for an AWR report to show it's activity and you don't want to impact the production system by gathering tons of detailed performance statistics.

Even a solid Oracle Time Based Analysis will likely not help in this situation. Again, the problem is performance data collection and retention. The instrumented AWR or Statpack data does not provide enough detail. What we need step-by-step activity...like a timeline.

Because this type of problem scares both DBAs and business managers, you will likely need to answer questions like this:

  • What is that blip all about?
  • Did this impact users?
  • Has it happened before?
  • Will it happen again?
  • What should we do about it?

The only way I know how to truly diagnose a problem like this is to do a session-level time-line analysis. Thankfully, this is possible using the Oracle Active Session History data. Both v$active_session_history and dba_hist_active_sess_history are absolutely key in solving problems like this.

ASH samples Oracle Database session activity once each second (by default). This is very different than measuring how long something takes, which is the data an AWR report is based upon. Because sampling is non-continuous, a lot of detail can be collected, stored and analyzed.

A time-line type of analysis is so important, I enhanced my ASH tools in my OraPub System Monitor (OSM) toolkit to provide this type of analysis. If you want to check them out, download the OSM toolkit HERE, install it and read the osm/interactive/ash-readme.txt file.

As an example, using these tools you can construct an incident time-line like this:

HH:MM:SS.FFF User/Process Notes
------------ ------------- -----------------
15:18:28.796 suspect (837) started the massive update (see SQL below)

15:28:00.389 user (57) application hung (row lock on TM_SHEET_LINE_EXPLOR)
15:28:30.486 user (74) application hung (row lock on TM_SHEET_LINE_EXPLOR)
15:29:30.??? - row locks becomes the top wait event (16 locked users)
15:29:50.749 user (83) application hung (row lock on TM_SHEET_LINE_EXPLOR)

15:30:20.871 user (837) suspect broke out of update (implied)
15:30:20.871 user (57) application returned
15:30:20.871 user (74) application returned
15:30:20.871 user (83) application returned

15:30:30.905 smon (721) first smon action since before 15:25:00 (os thread startup)
15:30:50.974 user (837) first wait for undo - suspect broke out of update
15:30:50.974 - 225 active session, now top event (wait for a undo record)

15:33:41.636 smon (721) last PQ event (PX Deq: Test for msg)
15:33:41.636 user (837) application returned to suspect. Undo completed
15:33:51.670 smon (721) last related event (DFS lock handle)

Without ASH seemingly random problems would be a virtually impossible nightmare scenario for an Oracle DBA.

Summary
It's true. You need the right tool for the job. And the same is true when diagnosing Oracle Database performance. What I've done above is group probably 90% of the problems we face as Oracle DBAs into four categories. And each of these categories needs a special kind of tool and/or diagnosis method.

Once we recognize the problem pattern and get the best tool/method involved to diagnosis the problem, then we will know the time spent developing amazing solutions is time well spent.

Enjoy your work!

Craig.


Categories: DBA Blogs

I Have Lots Of Oracle Database Server Power But Performance Is Slow/Bad

Mon, 2015-01-19 19:14
This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.I Have Lots Of Oracle Database Server Power But Performance Is Slow/Bad

Oracle Database parallelism and serialization is what we as Oracle Database Administrators live and die for. You have a screaming fast Oracle Database system and there is lots of computing power available.

But performance is unacceptable; users are screaming, the phone is ringing, and those fancy dashboards are flashing like it's Christmastime.

What is going on?! What can I do about it?! That's what this post is about.

Learn By Doing

Back in December on the third day of my Oracle Performance Firefighting class, I had each student bring in an AWR report from one of their systems that was giving them problems. (I keep my classes small, giving each student time to do their analysis and time for us to talk about it as a class.)
Get this:
Half of the systems had a similar "problem." I think it's important every DBA understands this "problem" because it's more common than most people believe.

There was plenty of computing power and the key SQL statement they cared about was a batch job. What was the core problem? The quick answer is "serialization" that is, a lack of parallelism. Exploring this using a very large production system AWR report and coming up with solutions is what this posting is all about.

Serialization Is Death

In Oracle systems, serialization is death and parallelism is life. Follow this line: business, end user, application designer, DBA, Oracle Database Kernel Architects (or whatever their title is), OS Administrators, OS designers, CPU designers and IO subsystem designers all have something in common. They work hard to parallelize tasks. Just one example: Oracle is designed to have multiple background and foreground processes running parallel.

But all this parallelization effort can be wasted and minimized if a process turns into a serial work stream (at any level; Oracle, OS, business, etc.). The result is "slowness" because the wall time increases.

Available Power And Slowness Equals Opportunity

When I tune Oracle Database systems, I look for opportunities. And each of my solutions will specifically target an opportunity. When I see unused power and complaints of slowness, I look for ways to increase parallelism. Why? Because having available power combined with slowness likely means a serialization limitation exists.

For sure serialization may be necessary. Two examples come to mind; Oracle database memory serialization control (think: latch and mutex) and business rules.

But if I can find a way to increase performance by using up available power by increasing parallelism, I'll likely be able to turn a slow serialization situation into a screaming fast parallelization situation!

How To Recognize A Serially Constrained System

It's easy to recognize a serially constrained system. Ask yourself these two questions. First, is there available CPU or IO power? Second, are there complaints of application "slowness." If the answer to both of these questions is "Yes" then there is likely a serialization issue. Furthermore, the general solution is to use the available resources to our advantage. That is, find areas to increase parallelization, which will use the available resources and improve performance.

If you have the power, use it! What are you saving it for?

(There may be a very good answer to the "saving" but I'll save that for another article.)

Can I Be Out Of CPU And Be Serially Constrained?
Yes. An Oracle Database system can be serially constrained and be out of OS resources. A great example of this is when there is a raging Oracle memory serialization issue. If you see both significant Oracle latching or mutex wait time combined with a raging CPU bottleneck, you likely have a serialization issue... an Oracle Database memory structure access serialization issue.

So, while available power on a "slow" system likely means we have a serially constrained system there are situations in Oracle with a raging CPU bottleneck that also means there is likely a serialization issue.

Find Out: Is There Available CPU Power?

Here Is A Real Life Situation. To simplify, I'm going to focus on only instance number one. Look at instance number one in the below picture.


The above AWR report snippet shows RAC node #1 OS CPU utilization at 15%. This means that over the AWR report snapshot interval, the average CPU utilization was 15%. I never initially trust an AWR report for calculated results. Plus it's good practice to do the math yourself. If you use the super fast busy-idle method I have outlined in THIS POST and detailed in my online seminar, Utilization On Steroids, the utilization calculates to 16% ( 0.5/(0.5+2.7)=0.16 ). So the AWR Report's 15% for CPU "% Busy" looks to be correct.

Clearly with an average CPU utilization of 15%, we have an opportunity to use the unused CPU power to our advantage.

Find Out: Is There Available IO Power?

I am looking for fast IO responsiveness. That is, a low response time. A great way to get a quick view of IO subsystem responsiveness is to look at the average wait time for the event, db file sequential read.

The wait event, db file sequential read is the time it takes to read a single block synchronously. I like to call it a pure IO read call: a) what time is it? b) make the IO call and wait until you get it, c) what time is it? d) calculate the delta and you have the wait time...and the IO read call response time! If you want more details, I wrote about this HERE, which includes a short video.

For our system, let's figure out the single block IO subsystem read response time. Using the same AWR report, here is a screen shot of the Top Time Events.


Again, I'm just going to focus on the first instance. If you look closely (middle right area), you'll see for instance number one, the average db file sequential read time wait time is 2.22ms. That's fast!

There is no way a physical spinning disk is going to return a block in 2.22ms. This means that many of Oracle's single block read calls are be satisfied through some non-Oracle cache. Perhaps an OS cache or an IO subsystem cache. We can't tell, but we do know the block was NOT an Oracle's buffer cache because the db file sequential wait means the block was not found in Oracle's buffer cache.

A single block synchronous IO read call with an average of 2.22ms means there is available IO read capacity and probably available write capacity as well. Again, just like with the OS CPU subsystem, we have unused power that we will try and use to our advantage.

At this point, I will assume there is also plenty of memory and network capacity available. So, the bottom line is we have a "slow" system combined with available CPU and available IO power. Wow! That is a great situation to be in. I call this, "low hanging fruit."

Real Life: Looking For The "Slow" SQL

At the top of this post, I mentioned that in my Firefighting class in each of the "serialization" cases, there was a key SQL statement that was part of a larger batch process. Keep in mind, that at this point in the analysis I did NOT know this. All I knew was that users were complaining and there was plenty of CPU and IO resources.

Usually, in this situation there is a relatively long running process. There could be lots of quick SQL statement involved, but usually this is not the case. And I'm hoping there is a key long running SQL statement that can be parallelized.

Long running can roughly be translated into "high elapsed time." I've have written a number of articles about elapsed time (search my blog for: elapsed time) and even have a free tool with which, you can gather to get more than simply the average elapsed time. And I have online seminars that touch on this subject: Tuning Oracle Using An AWR Report and also, Using Skewed Performance Data To Your Advantage. So there are lots of useful resources on this topic.

In the AWR report, I'm going to look closely at the SQL Statistics, in particular the "SQL ordered by Elapsed Time (Global)." What I really want is the statistics only for instance one, that is, not global. But that's all I have available. Plus the DBAs will/should know if the key SQL statement(s) are run on instance one. Here's the report.


In the report above, look at the elapsed times (second column on the left). Now looking right, find the "Execs", that is, the executions column. The execution column is the number of completed executions within this AWR snapshot range. If the executions is zero, this means the SQL did not complete during the snapshot interval, that is before the ending snapshot.

If you're wondering, these top elapsed time SQL statements are involved in batch processing. When I look at this, I see opportunity, fruit waiting to be harvested!

And I love this: Every DBA in the class in this situation said, "Oh! I know about this SQL. It's always causing problems." Now it's time to do something about it!

Real Life: Putting This All Together

We have identified available CPU and IO capacity. And we have identified THE elapsed time SQL statement. While I'm a pretty laid back kind of guy, at this point I start to apply some pressure. Why? Because the users are complaining, we have identified both an opportunity, the cause of the problem and the general solutions.

There are two general solutions:

1. Do less work. You want to empty a candy dish faster? Then start with less candy in the dish! If you want a SQL statement to run faster, tune the SQL so it touches less blocks.

2. Do the same amount of work, but group the work and run each group at the same time. This is parallelization! This is why the total elapsed time will not decrease (it will probably increase a little) but the wall time will likely decrease... and dramatically! Here is a LINK to posting that contains a short video demonstrating the difference between elapsed time and wall time.

How To Parallelize (in summary)

There are many different ways to parallelize. But the goal is the same: use the available resources to reduce wall time (not necessarily the elapsed time). Perhaps the application can be redesigned to run in parallel streams. But that can take a very long time and be a real hassle. But in many cases, it's the best long term solution.

If you are short on time, are licensed for Oracle Parallel Query and the SQL has been optimized (oh boy... how many times have all heard that before), you likely can use Oracle PQ. And of course, even if the SQL is not optimized, you can still run PQ and performance may be fantastic.

By the way, adding faster IO disks or more IO disks (what is a "disk" is nowadays anyways) will likely NOT work. Remember the IO subsystem is performing wonderfully.

Thanks for reading and enjoy the mystery of your work!

Craig.
Categories: DBA Blogs

Do The LGWRs Always Sleep For The Full Three Seconds?

Mon, 2015-01-12 16:01
This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.Do Oracle Database LGWRs (10g, 11g, 12c) Always Sleep For The Full Three Seconds?
Back in June I wrote (included a video) about the Oracle Database log writer(s) "3 second sleep rule." That's the rule we were all taught by our instructors when we started learning about Oracle yet never really knew if it was true. In that post, I demonstrated Oracle Database log writer background processes are normally put to sleep for three seconds.

In this post, I want to answer a related but different question.

Do Oracle Database log writer background processes ALWAYS sleep for the full three seconds? Our initial response would likely be, "Of course not! Because what if a foreground process commits during the three second sleep? The log writer(s) must wake up." That would make sense.

But, is this really true and what else could we learn by digging into this? I created an experiment to check this out, and that is what this post is all about.

The Experiment
In my June post I demonstrated the Three Second Rule. You will see this again below. But in this experiment we are looking for a situation when one of the 12c log writers wakes BEFORE their three second sleep.

You can download the experimental script I detail below HERE.

This is really tricky to demonstrate because of all the processes involved. There is a the Oracle foreground process and in 12c, there are multiple log writer background processes. Because this is experiment follows a timeline, I needed to gather the process activity data and then somehow merge it all together in a way that we humans can understand.

What I did was to do an operating system trace ( strace ) each process ( strace -p $lgwr )  with the timestamp option ( strace -p $lgwr -tt ) sending each process's the output to a separate file ( strace -p $lgwr -tt -o lgwr.txt ). This was done to all four processes and of course, I needed to start the scripts to run in the background. Shown directly below are the log writer strace details.

lgwr=`ps -eaf | grep $sid | grep lgwr | awk '{print $2}'`
lg00=`ps -eaf | grep $sid | grep lg00 | awk '{print $2}'`
lg01=`ps -eaf | grep $sid | grep lg01 | awk '{print $2}'`

echo "lgwr=$lgwr lg00=$lg00 lg01=$lg01"

strace -p $lgwr -tt -o lgwr.str &
strace -p $lg00 -tt -o lg00.str &
strace -p $lg01 -tt -o lg01.str &

Once the log writers were being traced, I connected to sqlplus and launched the below text in the background as well.

drop table bogus;
create table bogus as select * from dba_objects where object_id in (83395,176271,176279,176280);
select * from bogus;
commit;
exec dbms_lock.sleep(2.1);

exec dbms_lock.sleep(2.2);
exec dbms_lock.sleep(2.3);
update bogus set object_name='83395' where object_id=83395;
exec dbms_lock.sleep(3.1);
update bogus set object_name='176271' where object_id=176271;
exec dbms_lock.sleep(3.2);
update bogus set object_name='176279' where object_id=176279;
exec dbms_lock.sleep(3.3);
update bogus set object_name='176280' where object_id=176280;
exec dbms_lock.sleep(3.4);
commit;
exec dbms_lock.sleep(3.5);
update bogus set object_name='89567' where object_id=89567;
exec dbms_lock.sleep(3.6);
commit;
exec dbms_lock.sleep(3.7);
exit;

Once the sqlplus session was connected,

sqlplus system/manager @/tmp/runit.bogus &
sleep 2

I grabbed it's OS process id and started an OS trace on it as well:

svpr=`ps -eaf | grep -v grep | grep oracle$sid | awk '{print $2}' `
echo "svpr=$svpr"

strace -p $svpr -tt -o svpr.str &

Then I slept for 30 seconds, killed the tracing processes (not the log writers!):

sleep 30

for pid in `ps -eaf | grep -v grep | grep strace | awk '{print $2}'`
do
echo "killing pid $pid"
kill -2 $pid
done

Then I merged the trace files, sorted them by time, got rid of stuff in the trace files I didn't want to see and put the results into a final "clean" file.

merge=/tmp/strace.merge.bogus
rm -f $merge
for fn in lgwr lg00 lg01 svpr
do
cat ${fn}.str | awk -v FN=$fn '{print $1 " " FN " " $2 " " $3 " " $4 " " $5 " " $6 " " $7 " " $8 " " $9}' >> $merge
done

ls -ltr $merge
date
cat $merge | sort > /tmp/final.bogus

cat /tmp/final.bogus | grep -v times | grep -v getrusage | grep -v "svpr lseek" | grep -v clock_gettime | grep -v gettimeofday | grep -v "svpr read" | grep -v "svpr write" > /tmp/final.bogus.clean

The amazing thing is... this actually worked! Here is the output below:

19:11:41.981934 svpr semtimedop(7503875, {{34, -1, 0}}, 1, {2, 200000000}) =
19:11:42.859905 lg01 semtimedop(7503875, {{19, -1, 0}}, 1, {3, 0}) =
19:11:43.986421 lgwr semtimedop(7503875, {{16, -1, 0}}, 1, {3, 0}) =
19:11:44.186404 svpr semtimedop(7503875, {{34, -1, 0}}, 1, {2, 300000000}) =
19:11:44.982768 lg00 semtimedop(7503875, {{18, -1, 0}}, 1, {3, 0}) =
19:11:45.860871 lg01 semtimedop(7503875, {{19, -1, 0}}, 1, {3, 0}) =
19:11:46.499014 svpr semtimedop(7503875, {{34, -1, 0}}, 1, {3, 100000000}) =
19:11:46.989885 lgwr semtimedop(7503875, {{16, -1, 0}}, 1, {3, 0}) =
19:11:47.983782 lg00 semtimedop(7503875, {{18, -1, 0}}, 1, {3, 0}) =
19:11:48.861837 lg01 semtimedop(7503875, {{19, -1, 0}}, 1, {3, 0}) =
19:11:49.608154 svpr semtimedop(7503875, {{34, -1, 0}}, 1, {3, 200000000}) =
19:11:49.993520 lgwr semtimedop(7503875, {{16, -1, 0}}, 1, {3, 0}) =
19:11:50.984737 lg00 semtimedop(7503875, {{18, -1, 0}}, 1, {3, 0}) =
19:11:51.862921 lg01 semtimedop(7503875, {{19, -1, 0}}, 1, {3, 0}) =
19:11:52.817751 svpr semtimedop(7503875, {{34, -1, 0}}, 1, {3, 300000000}) =
19:11:52.997116 lgwr semtimedop(7503875, {{16, -1, 0}}, 1, {3, 0}) =
19:11:53.985784 lg00 semtimedop(7503875, {{18, -1, 0}}, 1, {3, 0}) =
19:11:54.863809 lg01 semtimedop(7503875, {{19, -1, 0}}, 1, {3, 0}) =
19:11:55.998974 lgwr open("/proc/41955/stat", O_RDONLY) = 19
19:11:55.999029 lgwr read(19, "41955 (ora_pmon_prod35) S 1 4195"..., 999) =
19:11:55.999075 lgwr close(19) = 0
19:11:55.999746 lgwr semtimedop(7503875, {{16, -1, 0}}, 1, {3, 0}) =
19:11:56.127326 svpr semtimedop(7503875, {{34, -1, 0}}, 1, {3, 400000000}) =
19:11:56.986935 lg00 semtimedop(7503875, {{18, -1, 0}}, 1, {3, 0}) =
19:11:57.864930 lg01 semtimedop(7503875, {{19, -1, 0}}, 1, {3, 0}) =
19:11:59.003212 lgwr semtimedop(7503875, {{16, -1, 0}}, 1, {3, 0}) =
19:11:59.531161 svpr semctl(7503875, 16, SETVAL, 0x7fff00000001) = 0
19:11:59.531544 lgwr semctl(7503875, 18, SETVAL, 0x7fff00000001) = 0
19:11:59.532204 lg00 pwrite(256, "\1\"\0\0\311\21\0\0\354\277\0\0\20\200{\356\220\6\0\0\r\0\0\0\367^K\5\1\0\0\0"..., 2048, 2331136) = 2048
19:11:59.532317 lgwr semtimedop(7503875, {{16, -1, 0}}, 1, {2, 480000000}) =
19:11:59.532680 svpr semtimedop(7503875, {{34, -1, 0}}, 1, {0, 100000000}) =
19:11:59.537202 lg00 semctl(7503875, 34, SETVAL, 0x7fff00000001) = 0
19:11:59.537263 lg00 semctl(7503875, 16, SETVAL, 0x7fff00000001) = 0
19:11:59.537350 lg00 semtimedop(7503875, {{18, -1, 0}}, 1, {3, 0}) =
19:11:59.538483 lgwr semtimedop(7503875, {{16, -1, 0}}, 1, {2, 470000000}) =
19:11:59.540574 svpr semtimedop(7503875, {{34, -1, 0}}, 1, {3, 500000000}) =
19:12:00.865928 lg01 semtimedop(7503875, {{19, -1, 0}}, 1, {3, 0}) =
19:12:02.011876 lgwr semtimedop(7503875, {{16, -1, 0}}, 1, {3, 0}) =
19:12:02.537887 lg00 semtimedop(7503875, {{18, -1, 0}}, 1, {3, 0}) =
19:12:03.050381 svpr semtimedop(7503875, {{34, -1, 0}}, 1, {3, 600000000}) =
19:12:03.866796 lg01 semtimedop(7503875, {{19, -1, 0}}, 1, {3, 0}) =
19:12:05.014819 lgwr semtimedop(7503875, {{16, -1, 0}}, 1, {3, 0}) =
19:12:05.538797 lg00 semtimedop(7503875, {{18, -1, 0}}, 1, {3, 0}) =
19:12:06.657075 svpr semtimedop(7503875, {{34, -1, 0}}, 1, {3, 700000000}) =
19:12:06.867922 lg01 semtimedop(7503875, {{19, -1, 0}}, 1, {3, 0}) =
19:12:08.017814 lgwr semtimedop(7503875, {{16, -1, 0}}, 1, {3, 0}) =
19:12:08.539750 lg00 semtimedop(7503875, {{18, -1, 0}}, 1, {3, 0}) =
19:12:09.868825 lg01 semtimedop(7503875, {{19, -1, 0}}, 1, {3, 0}

There is a lot of detail in the above output. I'm only going to make a few comments that pertain to the objectives of this post.

Oracle is using the semaphore call semtimedop to sleep. The beauty of this call, is it allow the process to be woken (that is, signaled) by another process! Keep that mind as you follow the timeline.

Here we go:

19:11:41.981934. Notice the server process' "2, 2" and later the "2,3" and "3, 1" and "3, 2"? This is the result of the dbms_lock.sleep commands contained in the sqlplus script!

19:11:42.859905. Notice lg01 and the other log writer background processes always have a "3, 0" semtimedop call? That is their "3 second sleep."

Look at the first few lgwr entries. I've listed them here:

19:11:43.986421
19:11:46.989885
19:11:49.993520
19:11:52.997116

Notice anything strange about the above times? They are all just about 3 seconds apart of from each other. That's the 3 second sleep in action. But that's not the focus of this post. So let's move on.

Read this slow: I want to focus on just one part of the output which, is shown below. Notice the server process is sleeping for 3.4 seconds. If you look at the sqlplus script (near the top of this post), immediately after the 3.4 second sleep the server process issues a commit. Therefore, because the 3.4 sleep starts at 19:11:56.1 and I'm expecting to see some log writer activity in 3.4 seconds. This would be at 19.11.59.5. This could occur in the middle of the log writer 3 second sleep, which means we will likely see a log writer kick into action before their 3 second sleep completes! Let's take a look.

19:11:56.127326 svpr semtimedop(7503875, {{34, -1, 0}}, 1, {3, 400000000}) =
19:11:56.986935 lg00 semtimedop(7503875, {{18, -1, 0}}, 1, {3, 0}) =
19:11:57.864930 lg01 semtimedop(7503875, {{19, -1, 0}}, 1, {3, 0}) =
19:11:59.003212 lgwr semtimedop(7503875, {{16, -1, 0}}, 1, {3, 0}) =
19:11:59.531161 svpr semctl(7503875, 16, SETVAL, 0x7fff00000001) = 0
19:11:59.531544 lgwr semctl(7503875, 18, SETVAL, 0x7fff00000001) = 0
19:11:59.532204 lg00 pwrite(256, "\1\"\0\0\311\21\0\0\354\277\0\0\20\200{\356\220\6\0\0\r\0\0\0\367^K\5\1\0\0\0"..., 2048, 2331136) = 2048
19:11:59.532317 lgwr semtimedop(7503875, {{16, -1, 0}}, 1, {2, 480000000})

We can see the server process 3.4 second sleep starting at time 19:11:56.1 and we can see the sleep end and the server process' next command begin at the expected time of 19:11:59.5. Next in the trace file output is result of the commit. The commit results in the wake of both the lgwr and lg00 background processes.

But notice the lgwr background process started one of its 3 second sleeps at 19:11:59.0 which means it doesn't want to wake until 19:12:02.0. But look at when the lgwr process woke up. It woke up at 19.11.59.5 which is clearly before the expected time of 19:12:02.0. What you just noticed was the lgwr background process was signaled to wake up before its three second sleep completed.

But why did the lgwr need to be woken up? Because the server process' redo must be immediately written.

But it gets even better because the lgwr background process doesn't do the redo write! The lgwr process signals the lg00 process to do the write, which we can see occurs at time 19:11:59:5. Wow. Amazing!

What We Can Learn From This
Personally, I love these kinds of postings because we can see Oracle in action and demonstrating what we believe to be true. So what does all this actually demonstrate? Here's a list:

  1. We can see the 12c log writers involved. Not only lgwr.
  2. All log writer background process initiate a sleep for the default three seconds. I have seen situations where it is not three seconds, but it appears the default is three seconds.
  3. The server process signals the lgwr process to write immediately after a commit is issued.
  4. The server process signals the lgwr process to write using a semaphore.
  5. The log writers (starting in 12c) can signal each other using semaphores. We saw lgwr signal the lg00 background process to write.
  6. The server process was performing updates over 10+ a second period, yet its redo was not written to disk until it committed. This demonstrates that ALL redo is not flushed every three seconds. (This is probably not what you learned... unless you joined one of my Oracle Performance Firefighting classes.)
  7. The log writers while normally put to sleep for three seconds, can be woken in the middle for an urgent task (like writing committed data to an online redo log).

I hope you enjoyed this post!

Thanks for reading,

Craig.
Categories: DBA Blogs

Can A Background Process Impact A Foreground Process And Its Database Time?

Mon, 2014-12-29 19:03
This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.Can A Background Process Impact A Foreground Process And Its Database Time?
Have you ever heard someone say, "Background processes do not impact foreground processes because they run in the background and in parallel with foreground processes." I've heard this hundreds of times!

While doing some performance research I came across a great example of how an Oracle Database background process can directly and significantly impact a foreground process.

The above quote represents a masterfully constructed lie; it contains both a lie and a truth. The mix of a truth and a lie make understanding the reality of the situation difficult. In this post, I'll explain the truth, delve into the lie and relate it all to foreground process database time.

By the way, I am in no way saying there is something wrong with or incorrect about DB Time. I want to ensure this is clear from the very beginning of this post.

Just so there is no confusion, an Oracle foreground process is sometimes also called a server process or a shadow process. These can terms can be used interchangeably in this post.

The Truth
Clearly background and foreground processes operate in parallel. I don't think any DBA would deny this. As I frequently say, "serialization is death and parallelism is life!" A simple "ps" command will visually show both Oracle background and foreground processes at work. But this in no way implies they do not impact each other's activity and performance.

In fact, we hope they do impact each other! Can you imagine what performance would be with the background processes NOT running in parallel?! What a performance nightmare that would be. But this where the "no impact" lie lives.

The Lie
Most senior DBAs can point to a specific situation where Oracle cache buffer chain latch contention affected multiple foreground sessions. In this situation, foreground sessions were franticly trying to acquire a popular cache buffer chain latch. But this is a foreground session versus foreground session situation. While this is example is important, this post is about when a background process impacts a foreground process.

Have you every committed a transaction and it hangs while the foreground process is waiting on "log file switch (checkpoint incomplete)" or even worse "log file switch (archiving needed)" event? All the foreground process knows is that its statement can't finish because a required log switch has not occurred because a checkpoint is incomplete. What the server process does not know is the checkpoint (CKPT), the database writer (DBWR) and the log writer (LGWR) background processes are involved. There is a good chance the database writer is frantically writing dirty buffers to the database (dbf) files so the LGWR can safely overwrite the associated redo in the next online redo log.

For example, if a server process issued a commit during the checkpoint, it will wait until the checkpoint is complete and the log writer has switched and can write into the next redo log. So, while the log writer background processes is probably waiting on "log file parallel write" and the database writer is burning CPU and waiting on "db file parallel write", the foreground processes are effectively hung.

This is a classic example of how a background process can impact the performance of a foreground process.

A Demonstration Of The Lie
Here's a quick demonstration of the above situation. On an existing database in my lab, I created two 4MB redo logs and dropped all the other redo logs. I started a DML intensive workload. According to the alert.log file, the redo logs where switching every couple of seconds! Take a look at this:
$ tail -f /home/oracle/base/diag/rdbms/prod30/prod30/trace/alert*log
Thread 1 cannot allocate new log, sequence 2365
Checkpoint not complete
Current log# 4 seq# 2364 mem# 0: /home/oradata/prod30/redoA1.log
Mon Dec 29 11:02:09 2014
Thread 1 advanced to log sequence 2365 (LGWR switch)
Current log# 5 seq# 2365 mem# 0: /home/oradata/prod30/redoA2.log
Thread 1 cannot allocate new log, sequence 2366
Checkpoint not complete
Current log# 5 seq# 2365 mem# 0: /home/oradata/prod30/redoA2.log
Thread 1 advanced to log sequence 2366 (LGWR switch)
Current log# 4 seq# 2366 mem# 0: /home/oradata/prod30/redoA1.log
Thread 1 cannot allocate new log, sequence 2367
Checkpoint not complete
Current log# 4 seq# 2366 mem# 0: /home/oradata/prod30/redoA1.log
Thread 1 advanced to log sequence 2367 (LGWR switch)
Current log# 5 seq# 2367 mem# 0: /home/oradata/prod30/redoA2.log
Thread 1 cannot allocate new log, sequence 2368
Checkpoint not complete
Current log# 5 seq# 2367 mem# 0: /home/oradata/prod30/redoA2.log
Mon Dec 29 11:02:20 2014

Obviously not what you want to see on a production Oracle system! (But my guess many of you have.)

Using my OSM realtime session sampler tool (rss.sql - related blog posting HERE) I sampled the log writer every half a second. (There is only one log writer background process because this is an Oracle 11g database, not an Oracle Database 12c system.) If the log writer session showed up in v$session as an active session, it would be picked up by rss.sql.  Both "ON CPU" and "WAIT" states are collected. Here is a sample of the output.


It's very obvious the log writer is doing some writing. But we can't tell from the above output if the process is impacting other sessions. It would have also been very interesting to sample the database writer also, but I didn't do that. To determine if the background processes are impacting other sessions, I needed to find a foreground session that was doing some commits. I noticed that session 133, a foreground process was busy doing some DML and committing as it processed its work. Just as with the log writer background process, I sampled this foreground process once every 0.5 second. Here's a sample of the output.


Wow. The foreground process is waiting a lot for the current checkpoint to be completed! So... this means the foreground process is being effectively halted until the background processes involved with the checkpoint have finished their work.

This is a great example of how Oracle background processes can impact the performance of an Oracle foreground process.

But let's be clear. Without the background processes, performance would be even worse. Why? Because all that work done in parallel and in the background would have to be done by each foreground process AND all that work would have to be closely controlled and coordinated. And that, would be a performance nightmare!

DB Time Impact On The Foreground Process

Just for the fun of it, I wrote a script to investigate DB Time, CPU consumption, non-idle wait time and the wait time for the "log file switch wait (checkpoint incomplete)" wait event for the foreground process mentioned above (session 133). The script simply gathers some session details, sleeps for 120 seconds, again gathers some session details, calculates the differences and displays the results. You can download the script HERE. Below is the output for the foreground process, session 133.
SQL> @ckpttest.sql 133

Table dropped.

Table created.

PL/SQL procedure successfully completed.

CPU_S_DELTA NIW_S_DELTA DB_TIME_S_DELTA CHECK_IMPL_WAIT_S
----------- ----------- --------------- -----------------
2.362 117.71 119.973692 112.42

1 row selected.

Here is a quick description of the output columns.

  • CPU_S_DELTA is the CPU seconds consumed by session 133, which is the time model statistic DB CPU.
  • NIW_S_DELTA is the non-idle wait time for session 133, in seconds.
  • DB_TIME_S_DELTA is the DB Time statistic for session 133, which is the time model statistic DB Time.
  • CHECK_IMPL_WAIT_S is the wait time only for event "log file switch (checkpoint incomplete)" for session 133, in seconds.

Does the time fit together as we expect? The "log file switch..." wait time is part of the non-idle wait time. The DB Time total is very close to the CPU time plus the non-idle wait time. Everything seems to add up nicely.

To summarize: Oracle background processes directly impacted the database time for a foreground process.

In Conclusion...
First, for sure Oracle foreground and background processes impact each other...by design for increased performance. Sometimes on real production Oracle Database systems things get messy and work that we hoped would be done in parallel must become momentarily serialized. The log file switch example above, is an example of this.

Second, the next time someone tells you that an Oracle background process does not impact the performance of a foreground process, ask them if they have experienced a "log file switch checkpoint incomplete" situation. Pause until they say, "Yes." Then just look at them and don't say a word. After a few seconds you may see a "oh... I get it." look on their face. But if not, simply point them to this post.

Thanks for reading and enjoy your work!

Craig.




Categories: DBA Blogs

Is Oracle Database Time Correct? Something Is Not Quite Right.

Mon, 2014-12-22 19:17
This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.Is Oracle Database Time Correct? Something Is Not Quite Right.

Oracle Database tuning and performance analysis is usually based on time. As I blogged HERE, the Oracle "database time" statistic is more interesting than simply "time spent in the database." It is the sum of CPU consumption and non-idle wait time. And Elapsed Time is the sum of all the database time related to perhaps a session or a SQL statement execution. However...

If you do the fundamental math, you'll notice the numbers don't always add up. In fact, they rarely match. In this posting, I want to demonstrate this mismatch and I want you to see this on your systems!

I'll include experimental data from a number of different systems, the statistical analysis (including pictures) and provide a tool you can download for free from OraPub.com to check out the situation on your systems.

Checking DB Time Math

DB Time is defined as "time spent in the database." This is the sum of Oracle process CPU consumption plus non-idle wait time. Usually we don't derive DB Time. The Time Model Statistics view v$sys_time_mode contains the DB Time statistic. But this view also contains the DB CPU statistic. Since there is no sum of non-idle wait time, most people just assume everything is fine.

However, if you run the simple query below on your system, it could look something this:
SQL> l
1 select db_time_s, db_cpu_s, tot_ni_wt_s
2 from (select value/1000000 db_time_s from v$sys_time_model where stat_name = 'DB time' ),
3 (select value/1000000 db_cpu_s from v$sys_time_model where stat_name = 'DB CPU' ),
4* (select sum(TIME_WAITED_MICRO_FG)/1000000 tot_ni_wt_s from v$system_event where wait_class != 'Idle' )
SQL> /

DB_TIME_S DB_CPU_S TOT_NI_WT_S
---------- ---------- -----------
330165.527 231403.925 119942.952

1 row selected.
If you add up the DB CPU and the total non-idle wait time, the value is 351,346.877. Woops! 351K does not equal 330K. What happened on my Oracle Database 12c (12.1.0.2.0)? As I have demonstrated in this POSTING (which contains videos of this) and in my online seminar training HERE, many times DB Time does nearly equal DB CPU plus the non-idle wait time. But clearly in the above situation something does not seem quite right.

Checking DB Time On Your Systems
To demonstrate the possibility of a DB Time mismatch, I created a simple plsql tool. You can download this free tool or do an OraPub.com search for "db time tool". The tool, which is easily configurable, takes a number of samples over a period of time and displays the output.


Here is an example of the output.

OraPub DB Time Test v1a 26-Sep-2014. Enjoy but use at your own risk.
.
Starting to collect 11 180 second samples now...
All displayed times are in seconds.
.
anonymer Block abgeschlossen
..........................................................................
... RAW OUTPUT (keep the output for your records and analysis)
..........................................................................
.
sample#, db_time_delta_v , db_cpu_delta_v, tot_ni_wait_delta_v, derived_db_time_delta_v, diff_v, diff_pct_v
.
1, 128,4, 128,254, ,103, 128,357266, ,043, 0
2, 22,014, 3,883, 17,731, 21,614215, ,399, 1,8
3, 1,625, 1,251, ,003, 1,253703, ,371, 22,8
4, 13,967, 12,719, 1,476, 14,194999, -,228, -1,6
5, 41,086, 41,259, ,228, 41,486482, -,4, -1
6, 36,872, 36,466, ,127, 36,593884, ,278, ,8
7, 38,545, 38,71, ,137, 38,847459, -,303, -,8
8, 37,264, 37,341, ,122, 37,463525, -,199, -,5
9, 22,818, 22,866, ,102, 22,967141, -,149, -,7
10, 30,985, 30,614, ,109, 30,723831, ,261, ,8
11, 5,795, 5,445, ,513, 5,958586, -,164, -2,8
.
The test is complete.
.
All displayed times are in seconds.

The output is formatted to make it easy to statistically analyze. The far right column is percent difference between the reported DB Time and the calculated DB Time. In the above example, they are pretty close. Get the tool and try it out on your systems.

Some Actual Examples
I want to quickly show you four examples from a variety of systems. You can download all the data in the "analysis pack" HERE. The data, for each of the four systems, contains the raw DB Time Test output (like in the section above), the statistical numeric analysis output from the statistical package "R", the actual "R" script and the visual analysis using "smooth histograms" also created using "R."

Below is the statistical numeric summary:


About the columns: Only the "craig" system is mine and other are real production or DEV/QA systems. The statistical columns all reference the far right column of the DB Time Test Tool's output, which is the percent difference between the reported DB Time and the calculated DB Time. Each sample set consists of eleven 180 second samples. The P-Value greater than 0.05 means the reported and calculated DB Time differences are normally distributed. This is not important in this analysis, but gives me clues if there is a problem with the data collection.

As you can easily see, two of the system's "DB Times" difference is greater than 10% and one of them was over 20%. The data collected shows that something is not quite right... but that's about it.

What Does This Mean In Our Work?
Clearly something is not quite right. There are a number of possible reasons and this will be focus of my next few articles.

However, I want to say that even though the numbers don't match perfectly and sometimes they are way off, this does not negate the value of a time based analysis. Remember, we not trying to land a man on the moon. We try diagnosing performance to derive solutions that (usually) aim to reduce the database time. I suspect that in all four cases I show, we would not be misled.

But this does highlight the requirement to also analysis performance from a non-Oracle database centric perspective. I always look at the performance situation from an operating system perspective, an Oracle centric perspective and an application (think: SQL, processes, user experience, etc.) perspective. This "3 Circle" analysis will reduce the likelihood of making a tuning diagnosis mistake. So in case DB Time is completely messed up, by diagnosing performance from the other two "circles" you will know something is not right.

If you want to learn more about my "3-Circle" analysis, here are two resources:
  1. Paper. Total Performance Management. Do an OraPub search for "3 circle" and you'll find it.
  2. Online Seminar: Tuning Oracle Using An AWR Report. I go DEEP into an Oracle Time Based Analysis but keeping it day-to-day production system practical.
In my next few articles I will drill down into why there can be a "DB Time mismatch," what to do about it and how to use this knowledge to our advantage.

Enjoy your work! There is nothing quite like analyzing performance and tuning Oracle database systems!!

Craig.





Categories: DBA Blogs

Changing The Number Of Oracle Database 12c Log Writers

Mon, 2014-12-08 22:51
This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.Changing The Number Of Oracle Database 12c Log Writers
In an Oracle Database 12c instance you will likely see multiple log writer (LGWR) background processes. When you first start the Oracle instance you will likely see a parent and two redo workers. This is a very big deal and something many of us have been waiting for - for many years!

While I'm excited about the change, if I can't control the number of LGWRs I could easily find myself once again constrained by the lack of LGWRs!

So, my question is how do I manipulate the number of LGWRs from the default. And what is the default based on? It's these types of questions that led me on this quest. I hope you enjoy the read!


Serialization Is Death
Multiple LGWRs is great news because serialization is death to computing performance. Think of it like this. A computer program is essentially lines of code and each line of code takes a little bit of time to execute. A CPU can only process N lines of code per second. This means every serial executing program has a maximum through capability. With a single log writer (LGWR) background process the amount of redo that can be processed is similarly constrained.

An Example Of Serialization Throughput Limitation
Suppose a CPU can process 1000 instructions per millisecond. Also, assume through some research a DBA determined it takes the LGWR 10 instructions to process 10 KB of redo. (I know DBAs who have taken the time to figure this stuff out.) Given these two pieces of data, how many KB of redo can the CPU theoretically process per second?

? KB of redo/sec = (1000 inst / 1 ms)*(10 KB redo / 10 instr)*(1000 ms / 1 sec)* (1 MB / 1000 KB) = 1000 KB redo/sec

This is a best case scenario. As you can see, any sequential process can become a bottleneck. One solution to this problem is to parallelize.

Note: Back in April of 2010 I posted a series of articles about parallelism. If you are interested in this topic, I highly recommend you READ THE POSTS.

Very Cool! Multiple 12c LGWRs... But Still A Limit?

Since serialization is death... and parallelism is life, I was really excited when I saw on my 12c Oracle instance by default it had two redo workers in addition to the "parent" log writer. On my Oracle version 12.0.1.0.2.0 Linux machine this is what I see:
$ ps -eaf|grep prod40 | grep ora_lg
oracle 54964 1 0 14:37 ? 00:00:00 ora_lgwr_prod40
oracle 54968 1 0 14:37 ? 00:00:00 ora_lg00_prod40
oracle 54972 1 0 14:37 ? 00:00:00 ora_lg01_prod40

This is important. While this is good news, unless Oracle or I have the ability to change and increase the number of LGWR redo workers, at some point the two redo workers, will become saturated bringing us back to the same serial LGWR process situation. So, I want and need some control.

Going Back To Only One LGWR
Interestingly, starting in Oracle Database version 12.0.1.0.2.0 there is an instance parameter _use_single_log_writer. I was able to REDUCE the number LGWRs to only one by setting the instance parameter _use_single_log_writer=TRUE. But that's the wrong direction I want to go!

More Redo Workers: "CPU" Instance Parameters
I tried a variety of CPU related instance parameters with no success. Always two redo workers.

More Redo Workers: Set Event...
Using my OSM script listeventcodes.sql I scanned the Oracle events (not wait events) but was unable to find any related Oracle events. Bummer...

More Redo Workers: More Physical CPUs Needed?
While talking to some DBAs about this, one of them mentioned they heard Oracle sets the number of 12c log writers is based on the number of physical CPUs. Not the number CPU cores but the number of physical CPUs. On a Solaris box with 2 physical CPUs (verified using the command, psrinfo -pv) upon startup there was still on two redo workers.

$ psrinfo -p
2
$ psrinfo -pv
The physical processor has 1 virtual processor (0)
UltraSPARC-III (portid 0 impl 0x14 ver 0x3e clock 900 MHz)
The physical processor has 1 virtual processor (1)
UltraSPARC-III (portid 1 impl 0x14 ver 0x3e clock 900 MHz)

More Redo Workers: Adaptive Behavior?
Looking closely at the Solaris LGWR trace file I repeatedly saw this:

Created 2 redo writer workers (2 groups of 1 each)
kcrfw_slave_adaptive_updatemode: scalable->single group0=375 all=384 delay=144 r
w=7940

*** 2014-12-08 11:33:39.201
Adaptive scalable LGWR disabling workers
kcrfw_slave_adaptive_updatemode: single->scalable redorate=562 switch=23

*** 2014-12-08 15:54:10.972
Adaptive scalable LGWR enabling workers
kcrfw_slave_adaptive_updatemode: scalable->single group0=1377 all=1408 delay=113
rw=6251

*** 2014-12-08 22:01:42.176
Adaptive scalable LGWR disabling workers

It looks to me like Oracle has programed in some sweeeeet logic to adapt the numbers of redo workers based the redo load.

So I created six Oracle sessions that simply inserted rows into a table and ran all six at the same time. But it made no difference in the number of redo workers. No increase or decrease or anything! I let this dml load run for around five minutes. Perhaps that wasn't long enough, the load was not what Oracle was looking for or something else. But the number of redo workers always remained at two.

Summary & Conclusions
It appears at instance startup the default number of Oracle Database 12c redo workers is two. It also appears that Oracle has either already built or is building the ability for Oracle to adapt to changing redo activity by enabling and disabling redo workers. Perhaps the number of physical CPUs (not CPU cores but physical CPUs) plays a part in this algorithm.

While this was not my research objective, I did discover a way to set the number of redo workers back to the traditional single LGWR background process.

While I enjoyed doing the research for this article, it was disappointing that I was unable to influence Oracle to increase the number of redo workers. I sure hope Oracle either gives me control or the adaptive behavior actually works. If not, two redo workers won't be enough for many Oracle systems.

All the best in your Oracle performance endeavors!

Craig.


Categories: DBA Blogs

The Perfect Gift For The Oracle DBA: Top 5 DBA T-Shirts

Mon, 2014-12-01 17:26
This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.The Perfect Gift For The Oracle DBA: Top 5 DBA T-Shirts
It's that time of year again and I can already hear it, "Dad, what do you want for Christmas?" This year I'm taking action. Like forecasting Oracle performance, I'm taking proactive action.

Like most of you reading this, you have a, let's say, unique sense of humor. I stumbled across the ultimate geek website that has an astonishing variety of t-shirts aimed at those rare individuals like us that get a rush in understanding the meaning of an otherwise cryptic message on a t-shirt.

I picked my Top 5 DBA Geek T-Shirts based on the challenges, conflicts and joys of being an Oracle DBA. With each t-shirt I saw, a story came to mind almost immediately. I suspect you will have a similar experience that rings strangely true.

So here they are—the Top 5 T-Shirts For The Oracle DBA:
Number 5: Change Your Password
According to Slash Data the top password is now "Password".  I guess the upper-case "P" makes people feel secure, especially since last years top password was "123456" and EVERYBODY knows thats a stupid password. Thanks to new and improved password requirements, the next most popular password is "12345678". Scary but not surprising.

As Oracle Database Administrators and those who listened to Troy Ligon's presentation last years IOUG conference presentation, passwords are clearly not safe. ANY passwords. Hopefully in the coming years, passwords will be a thing of the past.


Number 4: Show Your Work
Part of my job as a teacher and consultant is to stop behavior like this: I ask a DBA, "I want to understand why you want to make this change to improve performance." And the reply is something like one of these:

  1. Because it has worked on our other systems.
  2. I did a Google search and an expert recommended this.
  3. Because the box is out of CPU power, there is latching issues, so increasing spin_count will help.
  4. Because we have got to do something and quick!

I teach Oracle DBAs to think from the user experience to the CPU cycles developing a chain of cause and effect. If we can understand the cause and effect relationships, perhaps we can disrupt poor performance and turn it to our favor. "Showing your work" and actually writing it down can be really helpful.

Number 3: You Read My T-Shirt
Why do managers and users think their presence in close proximity to mine will improve performance or perhaps increase my productivity? Is that what they learn in Hawaii during "end user training"?

What's worse is when a user or manager wants to talk about it...while I'm obviously in concentrating on a serious problem.

Perhaps if I wear this t-shirt, stand up, turn around and remain silent they will stop talking and get the point. We can only hope.

Number 2: I'm Here Because You Broke Something
Obnoxious but true. Why do users wonder why performance is "slow" when they do a blind query returning ten-million rows and then scroll down looking for the one row they are interested in.... Wow. The problem isn't always the technology... but you know that already.

Hint to Developers: Don't let users do a drop down or a lookup that returns millions or even thousands or even hundreds of rows... Please for the love of performance optimization!


Number 1 (drum roll): Stand Back! I'm Going To Try SCIENCE
One of my goals in optimizing Oracle Database performance is to be quantitative. And whenever possible, repeatable. Add some basic statistics and you've got science. But stand back because, as my family tells me, it does get a little strange sometimes.

But seriously, being a "Quantitative Oracle Performance Analyst" is always my goal because my work is quantifiable, reference-able and sets me up for advanced analysis.


So there you go! Five t-shirts for the serious and sometimes strange Oracle DBA. Not only will these t-shirts prove and reinforce your geeky reputation, but you'll get a small yet satisfying feeling your job is special...though a little strange at times.

All the best in your Oracle performance endeavors!

Craig.
Categories: DBA Blogs

Off May Not Be Totally Off: Is Oracle In-Memory Database 12c (12.1.0.2.0) Faster?

Mon, 2014-11-17 21:56
This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.Off May Not Be Totally Off: Is Oracle In-Memory Database 12c (12.1.0.2.0) Faster?
Most Oracle 12c installations will NOT be using the awesome Oracle Database in-memory features available starting in version 12.1.0.2.0. This experiment is about the performance impact of upgrading to 12c but disabling the in-memory features.

Every experiment I have performed comparing buffer processing rates, clearly shows any version of 12c performs better than 11g. However, in my previous post, my experiment clearly showed a performance decrease after upgrading from 12.1.0.1.0 to 12.1.0.2.0.

This posting is about why this occurred and what to do about it. The bottom line is this: make sure "off" is "totally off."

Turn it totally off, not partially off
What I discovered is by default the in-memory column store feature is not "totally disabled." My experiment clearly indicates that unless the DBA takes action, not only could they be a license agreement violation but a partially disabled in-memory column store slightly slows logical IO processing compared to the 12c non in-memory column store option. Still, any 12c version processes buffer faster than 11g.

My experiment: specific and targeted
This is important: The results I published are based on a very specific and targeted test and not on a real production load. Do not use my results in making a "should I upgrade decision." That would be stupid and an inappropriate use of the my experimental results. But because I publish every aspect of my experiment and it is easily reproducible it is a valid data point with which to have a discussion and also highlight various situations that DBAs need to know about.

You can download all my experimental results HERE. This includes the raw sqlplus output, the data values, the free R statistics package commands, spreadsheet with data nicely formatted and lots of histograms.

The instance parameter settings and results
Let me explain this by first showing the instance parameters and then the experimental results. There are some good lessons to learn!

Pay close attention to the inmemory_force and inmemory_size instance parameters.

SQL> show parameter inmemory

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
inmemory_clause_default string
inmemory_force string DEFAULT
inmemory_max_populate_servers integer 0
inmemory_query string ENABLE
inmemory_size big integer 0
inmemory_trickle_repopulate_servers_ integer 1
percent
optimizer_inmemory_aware boolean TRUE

SQL> show sga

Total System Global Area 7600078848 bytes
Fixed Size 3728544 bytes
Variable Size 1409289056 bytes
Database Buffers 6174015488 bytes
Redo Buffers 13045760 bytes

In my experiment using the above settings the median buffers processing rate was 549.4 LIO/ms. Looking at the inmemory_size and the SGA contents, I assumed the in-memory column store was disabled. If you look at the actual experimental result file "Full ds2-v12-1-0-2-ON.txt", which contain the explain plan of the SQL used in the experiment, there is no mention of the in-memory column store being used. My assumption, which I think is a fair one, was that the in-memory column store had been disabled.

As you'll see I was correct, but only partially correct.

The parameter settings below are when the in-memory column store was totally disabled. They key is changing the default inmemory_force parameter value from DEFAULT to OFF.

SQL> show parameter inmemory

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
inmemory_clause_default string
inmemory_force string OFF
inmemory_max_populate_servers integer 0
inmemory_query string ENABLE
inmemory_size big integer 0
inmemory_trickle_repopulate_servers_ integer 1
percent
optimizer_inmemory_aware boolean TRUE
SQL> show sga

Total System Global Area 7600078848 bytes
Fixed Size 3728544 bytes
Variable Size 1291848544 bytes
Database Buffers 6291456000 bytes
Redo Buffers 13045760 bytes

Again, the SGA does not show any in-memory memory space. In my experiment with the above "totally off" settings, the median buffers processing rate was 573.5 LIO/ms compared to "partially off" 549.4 LIO/ms. Lesson: Make sure off is truly off.

It is an unfair comparison!
It is not fair to compare the "partially off" with the "totally off" test results. Now that I know the default inmemory_force must be changed to OFF, the real comparison should be made with the non in-memory column store version 12.1.0.1.0 and the "totally disabled" in-memory column store version 12.1.0.2.0. This is what I will summarize below. And don't forget all 12c versions showed a significant buffer processing increase compared to 11g.

The key question: Should I upgrade?
You may be thinking, if I'm NOT going to license and use the in-memory column store, should I upgrade to version 12.1.0.2.0? Below is a summary of my experimental results followed by the key points.


1. The non column store version 12.1.0.1.0 was able to process 1.1% more buffers/ms (median: 581.7 vs 573.5) compared to to "totally disabled" in-memory column store version 12.1.0.2.0. While this is statistically significant, a 1.1% buffer processing difference is probably not going to make-or-break your upgrade.

2. Oracle Corporation, I'm told, knows about this situation and is working on a fix. But even if they don't fix it, in my opinion my experimental "data point" would not warrant not upgrading to the in-memory column store version 12.1.0.2.0 even if you are NOT going to use the in-memory features.

3. Visually (see below) the non in-memory version 12.1.0.1.0 and the "totally off" in-memory version 12.1.0.2.0 samples sets look different. But they are pretty close. And as I mentioned above, statistically they are "different."

Note for the statistically curious: The red color 12.1.0.1.0 non in-memory version data set is highly variable. I don't like to see this in my experiments. Usually this occurs when a mixed workload sometimes impacts performance, I don't take enough samples or my sample time duration is too short. To counteract this, in this experiment I captured 31 samples. I also performed the experiment multiple times and the results where similar. What I could have done was used more application data to increase the sample duration time. Perhaps that would have made the data clearer. I could have also used another SQL statement and method to create the logical IO load.

What I learned from this experiment
To summarize this experiment, four things come to mind:

1. If you are not using an Oracle Database feature, completely disable it. My mistake was thinking the in-memory column store was disabled when I set it's memory size to zero and "confirmed" it was off by looking at the SGA contents.

2. All versions of 12c I have tested are clearly faster at processing buffers than any version of 11g.

3. There is a very slight performance decrease when upgrading from Oracle Database version 12.1.0.1.0 to 12.1.0.2.0.

4. It is amazing to me that with all the new features poured into each new Oracle Database version the developers have been able to keep the core buffer processing rate nearly at or below the previous version. That is an incredible accomplishment. While some people may view this posting as a negative hit against the Oracle Database, it is actually a confirmation about how awesome the product is.

All the best in your Oracle performance tuning work!

Craig.




Categories: DBA Blogs

Is Oracle Database 12c (12.1.0.2.0) Faster Than Previous Releases?

Tue, 2014-11-11 01:31
This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.Is Oracle Database 12c (12.1.0.2.0) Faster Than Previous Releases?
I was wondering if the new Oracle Database 12c version 12.1.0.2.0 in-memory column store feature will SLOW performance when it is NOT being used. I think this is a fair question because most Oracle Database systems will NOT be using this feature.

While the new in-memory column store feature is awesome and significant, with each new Oracle feature there is additional kernel code. And if Oracle is not extremely careful, these new lines of Oracle kernel code can slow down the core of Oracle processing, that is, buffer processing in Oracle's buffer cache.

Look at it this way, if a new Oracle release requires 100 more lines of kernel code to be executed to process a single buffer, that will be reflected in how many buffers Oracle can process per second.

To put bluntly, this article is the result of my research comparing core buffer processing rates between Oracle Database versions 11.2.0.2.0, 12.1.0.1.0 and 12.1.0.2.0.

With postings like this, it is very important for everyone to understand the results I publish are based on a very specific and targeted test and not on a real production load. Do not use my results in making a "should I upgrade decision." That would be stupid and an inappropriate use of the my experimental results. But because I publish every aspect of my experiment and it is easily reproducable it is valid data point with which to have a discussion and also highlight various situations that DBAs need to know about.

There are two interesting results from this research project. This article is about the first discovery and my next article will focus on the second. The second is by far the most interesting!

FYI. Back in August of 2013 performed a similar experiment where I compared Oracle database versions 11.2.0.2.0 with 12.1.0.1.0. I posted the article HERE.

Why "Faster" Means More Buffer Gets Processed Per Second
For this experiment when I say "faster" I am referring to raw buffered block processing. When a buffer is touched in the buffer cache it is sometimes called a buffer get or a logical IO. But regardless of the name, every buffer get increases the instance statistic, session logical reads.

I like raw logical IO processing experiments because they are central to all Oracle Database processing. Plus with each new Oracle release, as additional functionality is inserted it is likely more lines of Oracle kernel code will exist. To maintain performance with added functionality is an incredible feat. It's more likely the core buffer processing will be slower because of the new features. Is this case with Oracle's in-memory column store?

How I Setup The Experiment
I have included all the detailed output, scripts, R commands and output, data plots and more in the Analysis Pack that can be downloaded HERE.

There are a lot of ways I could have run this experiment. But two key items must exist for a fare comparison. First, all the processing must be in cache. There can be no physical read activity. Second, the same SQL must be run during the experiment and have the same execution plan. This implies the Oracle 12c column store will NOT be used. A different execution plan is considered "cheating" as a bad plan will clearly loose. Again, this is a very targeted and specific experiment.

The experiment compares the buffer get rates for a given SQL statement. For each Oracle version, I gathered 33 samples and excluded the first two, just to ensure caching was not an issue. The SQL statement runs for around 10 seconds, processes around 10.2M rows and touches around 5M buffers. I checked to ensure the execution plans are the same for each Oracle version. (Again, all the details are in the Analysis Pack for your reading pleasure.)

I ran the experiment on a Dell server. Here are the details:
$ uname -a
Linux sixcore 2.6.39-400.17.1.el6uek.x86_64 #1 SMP Fri Feb 22 18:16:18 PST 2013 x86_64 x86_64 x86_64 GNU/Linux
To make this easier for myself, to perform the test I used my CPU Speed Test tool (version 1i). I blogged about this last month HERE. The latest version of this tool can be downloaded HERE.

The Results, Statistically
Shown below are the experimental results. Remember, the statistic I'm measuring is buffer gets per millisecond.


Details about the above table: The "Normal" column is about if the statistical distribution of the 31 samples is normal. If the p-value (far right column) is greater than 0.05 then I'll say they are normal. In all three cases, the p-value is less than 0.05. If fact, if you look at the histograms contained in the Analysis Pack every histogram is visually clearly not normal. As you would expect the "Average" and the "Median" are the statistical mean and median. The "Max" is the largest value in the sample set. The "Std Dev" is the standard deviation, which is doesn't mean much since our sample sets are not normally distributed.

As I blogged about before the Oracle Database 12c buffer processing is faster than Oracle Database 11g. However, the interesting part is Oracle version with in-memory column store 12.1.0.2.0 is slower then the previous version of 12c, 12.1.0.1.0. In fact, in my experiment the in-memory column store version is around 5.5% slower! This means version 12.1.0.1.0 "out of the box" can process logical buffers around 5.5% faster! Interesting.

In case you're wondering, I used the default out-of-the-box in-memory column store settings for version 12.1.0.2.0. I checked the in-memory size parameter, inmemory_size and it was indeed set to zero. Also, when I startup the Oracle instance there is no mention of the in-memory column store.

Statistically Comparing Each Version
As an important side bar, I did statistically compare the Oracle Database versions. Why? Because while a 5.5% decrease in buffer throughput may seem important, it may not be statistically significant, meaning this difference can not be explained with our sample sets.

So going around saying version 12.1.0.2.0 is "slower" by 5.5% would be misleading. But in my experiment, it would NOT be misleading because the differences in buffer processing are statistically significant. The relevant experimental details are shown below.

Version A Version B Statistical p-value
Difference
---------- ---------- ----------- -------
11.2.0.1.0 12.1.0.1.0 YES 0.0000
11.2.0.1.0 12.1.0.2.0 YES 0.0000
12.1.0.1.0 12.1.0.2.0 YES 0.0000

In all three cases the p-value was less than 0.05 signifying the two sample sets are statistically
different. Again, all the details are in the Analysis Pack.

The chart above shows the histograms of both Oracle Database 12c version sample sets together. Visually they look very separated and different with no data crossover. So from both a numeric and visual perspective there is a real difference between 12.1.0.1.0 and 12.1.0.2.0.


What Does This Mean To Me
To me this is surprising. First, there is a clear buffer processing gain upgrading from Oracle 11g to 12c. That is awesome news! But I was not expecting a statistically significant 5.5% buffer processing decrease upgrading to the more recent 12.1.0.2.0 version. Second, this has caused me to do a little digging to perhaps understand the performance decrease. The results of my experimental journey are really interesting...I think more interesting than this posting! But I'll save the details for my next article.

Remember, if you have any questions or concerns about my experiment you can run the experiment yourself. Plus all the details of my experiment are included in the Analysis Pack.

All the best in your Oracle performance tuning work!

Craig.





Categories: DBA Blogs

Does Increasing An Oracle Background Process OS Priority Improve Performance?

Sat, 2014-11-01 22:00
This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.Does Increasing An Oracle Background Process OS Priority Improve Performance?
Does increasing an Oracle Database background process operating system priority improve performance? As you might expect, the answer is, "It depends."

In this posting I will explain the results of an experiment where I increase the Oracle Database 12c log writer background processes operating system priority.

In my experiment I created a clear CPU bottleneck and the top wait event was log file parallel write. I gathered some data. Next I increased all the log writer background process priorities. Again, I gathered some data and then I analyzed the two data sets. The results were disappointing, not surprising, but a fundamental rule in performance tuning was demonstrated and reinforced.

You can download the "analysis pack" which contains the raw experimental data, histogram, statistical R results and the R statical package commands HERE. You can download total time delta reporting script (ttpctx.sql) I show below, which is contained within my OraPub System Monitor (OSM) Toolkit HERE.

Before I could gather some data, I needed to create the appropriate system load, the data gather scripts and the data analysis scripts. Here's a quick overview of each.

Increasing The LGWR Processes PriorityIf you are not familiar with changing Oracle Database background OS process priority, I blogged about how to do this HERE.

My experiment consisted of creating two situations and statistically comparing them to see if increasing the LGWR background process OS priority increased performance. The only difference in the "two situations" was the instance parameter, _high_priority_processes. For the "normal priority" part of the experiment, the default "LMS*|VKTM" was used. For the "high priority" part of the experiment the parameter was changed to "LMS*|VKTM|LG*". The "LG*" caused the increase in the Linux OS priority of all log writer processes from the default 19 to 41.

Ready for some version specifics? Oracle continues to make OS priority related instance parameter changes... even within 12c releases. Since this experiment was done with Oracle Database version 12.1.0.1.0 the parameter default was "LMS*|VKTM" not "LMS*" as with version 12.1.0.2.0. Also, in 12.1.0.2.0 VKTM is placed into a new parameter, _highest_priority_processes.

Generating The DML With CPU Bottleneck Load
To generate the DML workload, I used my OP Load Generator. You can download it HERE. It's quick, easy and I'm familiar with it. I kept increasing the number of DML processes until the CPU utilization was pegged at 100%. While the database server was clearly out of CPU power, the log writer background processes were rarely seen as the top CPU consuming process. I used the "top" program to monitor the process activity. The top CPU consuming processes were almost always the Oracle server/foreground/shadow processes. As I'll explain in the Results section below, this has significant ramifications on the results.

Oracle Time Based Analysis Summary
The data collection script was similar to the one I used back in 2012 when collecting data regarding Oracle's commit write facility. Essentially, I collected multiple three minute samples of the delta elapsed time, user commits, total non-idle wait time and CPU consumption. In both the normal and high priority runs, around 99% of the database time was CPU consumption and 1% of the time Oracle non-idle wait time. Also the top wait event (~70%) was log file parallel write (display name: log file redo write).

If you want master an Oracle Time Based Analysis, check out my online seminar, Tuning Oracle Using An AWR Report. It will teach you how to using an AWR report to optimize Oracle performance so users will feel the difference.

The OraPub System Monitor Toolkit script ttpctx.sql report below was taken during the "normal priority" log writer load.

SQL> @ttpctx

Database: prod35 16-MAY-14 06:07pm
Report: ttpctx.sql OSM by OraPub, Inc. Page 1
Total Time Activity (39 sec interval)

Avg Time Time Wait
Time Component % TT % WT Waited (ms) (sec) Count(k)
------------------------------------- ------- ------- ----------- ----------- --------
CPU consumption: Oracle SP + BG procs 98.91 0.00 0.000 238.716 0
log file redo write 0.73 67.56 8.082 1.770 0
control file parallel write 0.18 16.41 33.077 0.430 0
target log write size 0.05 4.20 6.111 0.110 0
oracle thread bootstrap 0.03 3.05 40.000 0.080 0
os thread creation 0.02 1.53 20.000 0.040 0
commit: log file sync 0.01 0.76 10.000 0.020 0
enq: CR - block range reuse ckpt 0.01 0.76 20.000 0.020 0
Disk file operations I/O 0.00 0.00 0.000 0.000 0

The OSM report below was taken during the "high priority" log writer load.

SQL> @ttpctx

Database: prod35 16-MAY-14 09:25pm
Report: ttpctx.sql OSM by OraPub, Inc. Page 1
Total Time Activity (41 sec interval)

Avg Time Time Wait
Time Component % TT % WT Waited (ms) (sec) Count(k)
------------------------------------- ------- ------- ----------- ----------- --------
CPU consumption: Oracle SP + BG procs 98.92 0.00 0.000 238.733 0
log file redo write 0.83 77.01 8.272 2.010 0
control file parallel write 0.08 7.28 14.615 0.190 0
target log write size 0.05 4.98 5.909 0.130 0
oracle thread bootstrap 0.03 3.07 40.000 0.080 0
os thread creation 0.02 1.92 25.000 0.050 0
commit: log file sync 0.01 0.77 10.000 0.020 0
enq: CR - block range reuse ckpt 0.01 0.77 20.000 0.020 0
enq: RO - fast object reuse 0.00 0.38 10.000 0.010 0


Data Collection
For the normal priority load 27 three minute samples where collected. For the high priority situation there were 30 three minute samples collected. (I forgot why there was only 27 samples collected for the normal priority.)  I collected the elapsed time, total non-idle wait time, total CPU consumption (v$sys_time_model: db_cpu + background cpu time) and total user commits.

In this experiment more user commits processed per second means better performance.


Experimental Results
I used the free statistics package "R" (www.r-project.org) to analyze the data. I demonstrate how to get, install and use "R" in my online video seminar, Using Skewed Data To Your Advantage.

With the normal log writer process priority, an average of 984.5 commits/sec and a median of 983.0 commits/sec occurred. With the LG* high process priority, an average of 993.6 commits/sec and a median of 991.0 commits/sec occurred. While the "high priority" situation was able to process more commits per second, is this statistically significant?

The red "smoothed" histogram is the normal priority situation and the blue smoothed histogram is when the log writers were set to the higher priority. The more separated the two histograms the more "different" the sample sets, the more likely there is a statistically significant difference and the more likely a user would feel the difference. Looking at the above histograms plot, there does not appear to be a real difference. But let's do a proper significance test!

Because both sample sets are normally distributed (details are in the Analysis Pack), I could use a simple t-test. R produced a p-value of 0.04451. To be statistically "different" I want the p-value to be less than 0.05 and it is. What does this mean?

While statistically and numerically the commit rates are different, I wouldn't expect any special performance tuning award! In fact, the hassles with cycling a production instance and setting underscore/hidden parameters would make it very unlikely I would increase the OS priority of the log writer background processes. I want to see a big performance difference.

To Summarize... What I Learned... Again
This situation is a perfect example of focusing on the wrong thing! While there is a clear operating system CPU bottleneck and the top wait event is about redo, the log writers are not suffering from a want/need of CPU resources. I suspect the server processes want more CPU resources, but they are NOT the processes we increased their OS priority.

If the log writers were suffering from a lack of CPU resources and fighting for CPU resources, I would expect to see them consuming CPU resources along with the Oracle server processes. And I would definitely expect to see them near the top of the "top" process monitor... especially when their priority has been increased!

Because of this "misguided" tuning effort, this experiment does not build a case for or against changing the log writer priority. What it reinforces is in our quest to optimize performance, make sure we focus on the right thing.

As a side note, this is a great statistical analysis example for two reasons. First, our samples sets look similar, but statistically they are not. Second, while they are statistically different, the performance impact will not be very different. And my guess is the users won't feel a thing... except frustration and anger.

To Super Summarize
When considering increasing a background process's operating system priority, make sure the process is in need of CPU and is not able to get it. In this situation, the DBA could have been mislead by the CPU bottleneck. But upon closer inspection of the log writers from an operating system perspective and knowing the wait event "log file parallel write" is probably more about IO than CPU (Oracle perspective) it would be unlikely that increasing the log writer processes OS priority would help increase the commits per second.

Thanks for reading!

Craig.












Categories: DBA Blogs

How To Change The Priority Of Oracle Background Processes

Mon, 2014-10-27 18:31
This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.How To Change The Priority Of Oracle Background Processes

Before you get in a huf, it can be done! You can change an Oracle Database background process

priority through an instance parameter! I'm not saying it's a good idea, but it can be done.

In this post I explore how to make the change, just how far you can take it and when you may want to consider changing an Oracle background process priority.

To get your adrenaline going, check out the instance parameter _high_priority_processes from one of your production Oracle system with a version of 11 or greater. Here is an example using my OSM tool, ipx.sql on my Oracle Database version 12.1.0.2.0.
SQL> @ipx _highest_priority_processes
Database: prod40 27-OCT-14 02:22pm
Report: ipx.sql OSM by OraPub, Inc. Page 1
Display ALL Instance Parameters

Instance Parameter and Value Description Dflt?
-------------------------------------------------- -------------------- -----
_highest_priority_processes = VKTM Highest Priority TRUE
Process Name Mask
Then at the Linux prompt, I did:
$ ps -eo pid,class,pri,nice,time,args | grep prod40
2879 TS 19 0 00:00:00 ora_pmon_prod40
2881 TS 19 0 00:00:01 ora_psp0_prod40
2883 RR 41 - 00:02:31 ora_vktm_prod40
2889 TS 19 0 00:00:01 ora_mman_prod40
2903 TS 19 0 00:00:00 ora_lgwr_prod40
2905 TS 19 0 00:00:01 ora_ckpt_prod40
2907 TS 19 0 00:00:00 ora_lg00_prod40
2911 TS 19 0 00:00:00 ora_lg01_prod40
...
Notice the "pri" for priority of the ora_vktm_prod40 process? It is set to 41 while all the rest of the Oracle background processes are set to the default of 19. Very cool, eh?

Surprised By What I Found
Surprised? Yes, surprised because changing Oracle process priority is a pandoras box. Just imagine if an Oracle server (i.e., foreground) process has its priority lowered just a little and then attempts to acquire a latch or a mutex? If it doesn't get the latch quickly, I might never ever get it!

From a user experience perspective, sometimes performance really quick and other times the application just hangs.

This actually happened to a customer of mine years ago when the OS started reducing a process's priority after it consumed a certain amount of CPU. I learned that when it comes to Oracle processes, they are programed to expect an even process priority playing field. If you try to "game" the situation, do so at your own risk... not Oracle's.

Then why did Oracle Corporation allow background process priority to be changed. And why did Oracle Corporation actually change a background processes priority?!

Doing A Little Exploration
It turns out there are a number of "priority" related underscore instance parameters! On my 11.2.0.1.0 system there 6 "priority" parameters. On my 12.1.0.1.0 system there are 8 "priority" parameters. On my 12.1.0.2.0 system there are 13 "priority" parameters! So clearly Oracle is making changes! In all cases, the parameter I'm focusing on, "_high_priority_processes" exists.

In this posting, I'm going to focus on my Oracle Database 12c version 12.1.0.2.0 system. While you may see something different in your environment, the theme will be the same.

While I'll be blogging about all four of the below parameters, in this posting my focus will be on the _high_priority_processes parameter. Below are the defaults on my system:
_high_priority_processes        LMS*
_highest_priority_processes VKTM
_os_sched_high_priority 1
_os_sched_highest_priority 1

Messing With The LGWR Background Processes
I'm not testing this on a RAC system, so I don't have an LMS background process. When I saw the "LMS*" I immediately thought, "regular expression." Hmmm... I wonder if I can change the LGWR background process. So I made the instance parameter change and recycled the instance. Below shows the instance parameter change:
SQL> @ipx _high_priority_processes
Database: prod40 27-OCT-14 02:36pm
Report: ipx.sql OSM by OraPub, Inc. Page 1
Display ALL Instance Parameters

Instance Parameter and Value Description Dflt?
-------------------------------------------------- -------------------- -----
_high_priority_processes = LMS*|LGWR High Priority FALSE
Process Name Mask

Below is an operating system perspective using the ps command:

ps -eo pid,class,pri,nice,time,args | grep prod40
...
5521 RR 41 - 00:00:00 ora_vktm_prod40
5539 TS 19 0 00:00:00 ora_dbw0_prod40
5541 RR 41 - 00:00:00 ora_lgwr_prod40
5545 TS 19 0 00:00:00 ora_ckpt_prod40
5547 TS 19 0 00:00:00 ora_lg00_prod40
5551 TS 19 0 00:00:00 ora_lg01_prod40
...

How Far Can I Take This?
At this point in my journey, my mind was a blaze! The log file sync wait event can be really difficult to deal with and especially so when there is a CPU bottleneck. Hmmm... Perhaps I can increase the priority of all the log writer background processes?

So I made the instance parameter change and recycled the instance. Below shows the instance parameter change:
SQL> @ipx _high_priority_processes
Database: prod40 27-OCT-14 02:44pm
Report: ipx.sql OSM by OraPub, Inc. Page 1
Display ALL Instance Parameters

Instance Parameter and Value Description Dflt?
-------------------------------------------------- -------------------- -----
_high_priority_processes = LMS*|LG* High Priority FALSE
Process Name Mask

Below is an operating system perspective using the ps command:

ps -eo pid,class,pri,nice,time,args | grep prod40
...
5974 TS 19 0 00:00:00 ora_psp0_prod40
5976 RR 41 - 00:00:00 ora_vktm_prod40
5994 TS 19 0 00:00:00 ora_dbw0_prod40
5996 RR 41 - 00:00:00 ora_lgwr_prod40
6000 TS 19 0 00:00:00 ora_ckpt_prod40
6002 RR 41 - 00:00:00 ora_lg00_prod40
6008 RR 41 - 00:00:00 ora_lg01_prod40
6014 TS 19 0 00:00:00 ora_lreg_prod40
...

So now all the log writer background processes have a high priority. My hope would be that if there is an OS CPU bottleneck and the log writer background processes wanted more CPU, I now have the power to give that to them! Another tool in my performance tuning arsenal!

Security Hole?
At this point, my exuberance began to turn into paranoia. I thought, "Perhaps I can increase the priority of an Oracle server process or perhaps any process." If so, that would be a major Oracle Database security hole.

With fingers trembling, I changed the instance parameters to match an Oracle server process and recycled the instance. Below shows the instance parameter change:

SQL> @ipx _high_priority_processes
Database: prod40 27-OCT-14 02:52pm
Report: ipx.sql OSM by OraPub, Inc. Page 1
Display ALL Instance Parameters

Instance Parameter and Value Description Dflt?
-------------------------------------------------- -------------------- -----
_high_priority_processes = High Priority FALSE
LMS*|LG*|oracleprod40 Process Name Mask

Below is an operating system perspective using the ps command:

$ ps -eo pid,class,pri,nice,time,args | grep prod40
...
6360 TS 19 0 00:00:00 ora_psp0_prod40
6362 RR 41 - 00:00:00 ora_vktm_prod40
6366 TS 19 0 00:00:00 ora_gen0_prod40
6382 RR 41 - 00:00:00 ora_lgwr_prod40
6386 TS 19 0 00:00:00 ora_ckpt_prod40
6388 RR 41 - 00:00:00 ora_lg00_prod40
6394 RR 41 - 00:00:00 ora_lg01_prod40
6398 TS 19 0 00:00:00 ora_reco_prod40
...
6644 TS 19 0 00:00:00 oracleprod40...
...

OK, that didn't work so how about this?

SQL> @ipx _high_priority_processes
Database: prod40 27-OCT-14 02:55pm
Report: ipx.sql OSM by OraPub, Inc. Page 1
Display ALL Instance Parameters

Instance Parameter and Value Description Dflt?
-------------------------------------------------- -------------------- -----
_high_priority_processes = High Priority FALSE
LMS*|LG*|*oracle* Process Name Mask

Let's see what happened at the OS.

$ ps -eo pid,class,pri,nice,time,args | grep prod40
...
6701 RR 41 - 00:00:00 ora_vktm_prod40
6705 RR 41 - 00:00:00 ora_gen0_prod40
6709 RR 41 - 00:00:00 ora_mman_prod40
6717 RR 41 - 00:00:00 ora_diag_prod40
6721 RR 41 - 00:00:00 ora_dbrm_prod40
6725 RR 41 - 00:00:00 ora_vkrm_prod40
6729 RR 41 - 00:00:00 ora_dia0_prod40
6733 RR 41 - 00:00:00 ora_dbw0_prod40
...
6927 RR 41 - 00:00:00 ora_p00m_prod40
6931 RR 41 - 00:00:00 ora_p00n_prod40
7122 TS 19 0 00:00:00 oracleprod40 ...
7124 RR 41 - 00:00:00 ora_qm02_prod40
7128 RR 41 - 00:00:00 ora_qm03_prod40

Oh Oh... That's not good! Now EVERY Oracle background process has a higher priority and my Oracle server process does not.

So my "*" wildcard caused all the Oracle processes to be included. If all the processes a high prioirty, then the log writer processes have no advantage over the others. And to make matters even worse, my goal of increasing the server process priority did not occur.

However, this is actually very good news because it appears this is not an Oracle Database security hole! To me, it looks like the priority parameter is applied during the instance startup for just the background processes. Since my server process was started after the instance was started and for sure not included in the list of background processes, its priority was not affected. Good news for security, not as good of news for a performance optimizing fanatic such as myself.

Should I Ever Increase A Background Process Priority?
Now that we know how to increase an Oracle Database background process priority, when would we ever want to do this? The short answer is probably never. But the long answer is the classic, "it depends."

Let me give you an example. Suppose there is an OS CPU bottleneck and the log writer background processes are consuming lots of CPU while handling all the associated memory management when server process issues a commit. In this situation, performance may benefit by making it easier for the log writer processes to get CPU cycles, therefore improving performance. But don't even think about doing this unless there is a CPU bottleneck. And even then, be very very careful.

In my next block posting, I'll detail an experiment where I changed the log writer background processes priority.

Thanks for reading!

Craig.



Categories: DBA Blogs

11 Tips To Get Your Conference Abstract Accepted

Wed, 2014-10-08 20:21
This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.11 Ways To Get Your Conference Abstract Accepted
This is what happens when your abstract is selected!Ready for some fun!? It's that time of year again and the competition will be intense. The "call for abstracts" for a number of Oracle Database conferences are about to close.

The focus of this posting is how you can get a conference abstract accepted.

As a mentor, Track Manager and active conference speaker I've been helping DBAs get their abstracts accepted for many years. If you follow my 11 tips below, I'm willing to bet you will get a free pass to any conference you wish in any part of the world.

1. No Surprises! 
Track Manager After A SurpriseThe Track Manager wants no surprises, great content and a great presentation. Believe me when I say, they are looking for ways to reduce the risk of a botched presentation, a cancelation or a no show. Your abstract submissions is your first way to show you are serious and will help make the track incredibly awesome.

Tip: In all your conference communications, demonstrate a commitment to follow through.

2. Creative Title.
The first thing everyone sees is the title. I can personally tell you, if the title does not peak my curiosity without sounding stupid, then unless I know the speaker is popular I will not read the abstract. Why do I do this? Because as a Track Manager, I know conference attendees will do the same thing! And as a Track Manager, I want attendees to want to attend sessions in my track.

Tip: Find two people, read the title to them and ask what they think. If they say something like, "What are you going to talk about?" that's bad. Rework the title.

3. Tell A Story
The abstract must tell a compelling story. Oracle conferences are not academic conferences! There needs to be some problem along with a solution complete with drama woven into the story.

Tip: People forget bullet points, but they never forget a good story.

4. Easy To Read
The abstract must be easy to review. The abstract reviewers may have over a hundred abstracts to review. Make it a good quick read for the reviewers and your chances increase.

Tip: Have your computer read your abstract back to you. If you don't say, "Wow!" rework the abstract. 

5. Be A Grown-Up
You can increase the perception you will physically show up and put on a great show at the conference by NOT putting into your abstract emoji, bullet points, your name and title or pushing a product or service. NEVER copy/paste from a powerpoint outline into the abstract or outline. (I've seen people do this!)

Tip: Track Managers do not want to baby sit you. They want an adult who will help make their track great.

6. Submit Introductory Level Abstracts
I finally figured this out a couple years ago. Not everyone is ready for a detailed understanding of cache buffer chain architecture, diagnosis, and solution development. Think of it from a business perspective. Your market (audience) will be larger if your presentation is less technical. If this bothers you, read my next point.

Tip: Submit both an introductory level version and advanced level version of your topic.

7. Topics Must Be Filled
Not even the Track Manager knows what people will submit. And you do not know what the Track Manager is looking for. And you do not know what other people are submitting. Mash this together and it means you must submit more than one abstract. I know you really, really want to present on topic X. But would you rather not have an abstract accepted?

Tip: Submit abstracts on multiple topics. It increases your chances of being accepted.

8. Submit Abstract To Multiple Tracks
This is similar to submitting both an introductory version of your abstract. Here's an example: If there is a DBA Bootcamp track and a Performance & Internals Track, craft your abstract to Bootcamp version has a more foundational/core feel to it. And craft your Performance & Internals version to feel more technical and advanced.

Do not simply change the title and the abstract can not be the same.  If the conference managers or the Track Manager feels you are trying to game the conference, you present a risk to the conference and their track and your abstracts will be rejected. So be careful and thoughtful.

Tip: Look for ways to adjust your topic to fit into multiple tracks.

9. Great Outline Shows Commitment
If the reviewers have read your title and abstract, they are taking your abstract seriously. Now is the time to close the deal by demonstrating you will put on a great show. And this means you already have in mind an organized and well thought out delivery. You convey this with a fantastic outline. I know it is difficult to create an outline BUT the reviewers also know this AND having a solid outline demonstrates to them you are serious, you will show up, and put on a great show.

Tip: Develop your abstract and outline together. This strengthens both and develops a kind of package the reviewers like to see.

10. Learning Objectives Show Value
You show the obvious value of your topic through the learning objectives. Personally, I use these to help keep me focused on my listener, just not what I'm interested in at the moment. Because I love my work, I tend to think everyone also does... not so. I must force myself to answer the question, "Why would a DBA care about this topic?"

Tip: Develop your learning objectives by asking yourself, "When my presentation is over, what do I want the attendees to remember?"

11. Submit About Problems You Solved
Submit on the topics you have personally explored and found fascinating. Every year, every DBA has had to drill deep into at least one problem. This concentrated effort means you know the topic very well. And this means you are qualified to tell others about it! People love to hear from people who are fascinated about something. Spread the good news resulting from a "bad" experience.

Tip: Submit on topics you have explored and are fascinated with.

How Many Abstracts Should I Submit?
It depends on the conference, but for a big North America conference like ODTUG, RMOUG and IOUG I suggest at least four.

Based on what I wrote above, pick three topics, perhaps create both an introductory and advanced version and look to see if it makes sense to submit to multiple tracks. That means you'll probably submit at least four abstracts. It's not as bad as it sounds, because you will only have perhaps three core abstracts. All the others are modifications to fit a specific need. Believe when you receive the acceptance email, it will all be worth it!

See you at the conference!

Craig.


Categories: DBA Blogs

Comparing SQL Execution Times From Different Systems

Mon, 2014-10-06 19:17
This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.Comparing SQL Execution Times From Different Systems
Suppose it's your job to identify SQL that may run slower in the about-to-be-upgrated Oracle Database. It's tricky because no two systems are alike. Just because the SQL run time is faster in the test environment doesn't mean the decision to upgrade is a good one. In fact, it could be disastrous.

For example; If a SQL statement runs 10 seconds in production and runs 20 seconds in QAT, but the production system is twice as fast as QAT, is that a problem? It's difficult to compare SQL runs times when the same SQL resides in different environments.

In this posting, I present a way to remove the CPU speed differences, so an appropriate "apples to apples" SQL elapsed time comparison can be made, thereby improving our ability to more correctly detect risky SQL that may be placed into the upgraded production system.

And, there is a cool, free, downloadable tool involved!

Why SQL Can Run Slower In Different Environments
There are a number of reasons why a SQL's run time is different in different systems. An obvious reason is a different execution plan. A less obvious and much more complex reason is a workload intensity or type difference. In this posting, I will focus on CPU speed differences. Actually, what I'll show you is how to remove the CPU speed differences so you can appropriately compare two SQL statements. It's pretty cool.

The Mental Gymnastics
If a SQL statement's elapsed time in production is 10 seconds and 20 seconds in QAT, that’s NOT an issue IF the production system is twice as fast.

If this makes sense to you, then what you did was mentally adjust one of the systems so it could be appropriately compared. This is how I did it:

10 seconds in production * production is 2 times as fast as QA  = 20 seconds 

And in QA the sql ran in 20 seconds… so really they ran “the same” in both environments. If I am considering placing the SQL from the test environment into the production environment, then this scenario does not raise any risk flags. The "trick" is determining "production is 2 times as fast as QA" and then creatively use that information.

Determining The "Speed Value"

Fortunately, there are many ways to determine a system's "speed value." Basing the speed value on Oracle's ability to process buffers in memory has many advantages: a real load is not required or even desired, real Oracle code is being run at a particular version, real operating systems are being run and the processing of an Oracle buffer highly correlates with CPU consumption.

Keep in mind, this type of CPU speed test is not an indicator of scalability (benefit of adding additional CPUs) in any way shape or form. It is simply a measure of brut force Oracle buffer cache logical IO processing speed based on a number of factors. If you are architecting a system, other tests will be required.

As you might expect, I have a free tool you can download to determine the "true speed" rating. I recently updated it to be more accurate, require less Oracle privileges, and also show the execution plan of the speed test tool SQL. (A special thanks to Steve for the execution plan enhancement!) If the execution plan used in the speed tool is difference on the various systems, then obviously we can't expect the "true speeds" to be comparable.

You can download the tool HERE.

How To Analyze The Risk

Before we can analyze the risk, we need the "speed value" for both systems. Suppose a faster system means its speed rating is larger. If the production system speed rating is 600 and the QAT system speed rating is 300, then production is deemed "twice as fast."

Now let's put this all together and quickly go through three examples.

This is the core math:

standardized elapsed time = sql elapsed time * system speed value

So if the SQL elapsed time is 25 seconds and the system speed value is 200, then the standardized "apples-to-apples" elapsed time is 5000 which is 25*200. The "standardized elapsed time" is simply a way to compare SQL elapsed times, not what users will feel and not the true SQL elapsed time.

To make this a little more interesting, I'll quickly go through three scenarios focusing on identifying risk.

1. The SQL truly runs the same in both systems.

Here is the math:

QAT standardized elapsed time = 20 seconds X 300 = 6000 seconds

PRD standardized elapsed time = 10 seconds X 600 = 6000 seconds

In this scenario, the true speed situation is, QAT = PRD. This means, the SQL effectively runs just as fast in QAT as in production. If someone says the SQL is running slower in QAT and therefore this presents a risk to the upgrade, you can confidently say it's because the PRD system is twice as fast! In this scenario, the QAT SQL will not be flagged as presenting a significant risk when upgrading from QAT to PRD.

2. The SQL runs faster in production.

Now suppose the SQL runs for 30 seconds in QAT and for 10 seconds in PRD. If someone was to say, "Well of course it's runs slower in QAT because QAT is slower than the PRD system." Really? Everything is OK? Again, to make a fare comparison, we must compare the system using a standardizing metric, which I have been calling the, "standardized elapsed time."

Here are the scenario numbers:

QAT standardized elapsed time = 30 seconds X 300 = 9000 seconds
PRD standardized elapsed time = 10 seconds X 600 = 6000 seconds

In this scenario, the QAT standard elapsed time is greater than the PRD standardized elapsed time. This means the QAT SQL is truly running slower in QAT compared to PRD. Specifically, this means the slower SQL in QAT can not be fully explained by the slower QAT system. Said another way, while we expect the SQL in QAT to run slower then in the PRD system, we didn't expect it to be quite so slow in QAT. There must another reason for this slowness, which we are not accounting for. In this scenario, the QAT SQL should be flagged as presenting a significant risk when upgrading from QAT to PRD.

3. The SQL runs faster in QAT.

In this final scenario, the SQL runs for 15 seconds in QAT and for 10 seconds in PRD. Suppose someone was to say, "Well of course the SQL runs slower in QAT. So everything is OK." Really? Everything is OK? To get a better understanding of the true situation, we need to look at their standardized elapsed times.

QAT standardized elapsed time = 15 seconds X 300 = 4500 seconds
PRD standardized elapsed time = 10 seconds X 600 = 6000 seconds 

In this scenario, QAT standard elapsed time is less then the PRD standardized elapsed time. This means the QAT SQL is actually running faster in the QAT, even though the QAT wall time is 15 seconds and the PRD wall time is only 10 seconds. So while most people would flag this QAT SQL as "high risk" we know better! We know the QAT SQL is actually running faster in QAT than in production! In this scenario, the QAT SQL will not be flagged as presenting a significant risk when upgrading from QAT to PRD.

In Summary...

Identify risk is extremely important while planning for an upgrade. It is unlikely the QAT and production system will be identical in every way. This mismatch makes identifying risk more difficult. One of the common differences in systems is their CPU processing speeds. What I demonstrated was a way to remove the CPU speed differences, so an appropriate "apples to apples" SQL elapsed time comparison can be made, thereby improving our ability to more correctly detect risky SQL that may be placed into the upgraded production system.

What's Next?

Looking at the "standardized elapsed time" based on Oracle LIO processing is important, but it's just one reason why a SQL may have a different elapsed time in a different environment. One of the big "gotchas" in load testing is comparing production performance to a QAT environment with a different workload. Creating an equivalent workload on different systems is extremely difficult to do. But with some very cool math and a clear understanding of performance analysis, we can also create a more "apples-to-apples" comparison, just like we have done with CPU speeds. But I'll save that for another posting.

All the best in your Oracle performance work!

Craig.




Categories: DBA Blogs

Top 7 Reasons Why Oracle Conferences Rock!

Wed, 2014-09-24 11:56
This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.Top 7 Reasons Why Oracle Conferences Rock!
Why take the time and make the effort to attend an Oracle database conference or Oracle user group meeting? We're all busy, so there had better be some super good reasons to make the effort! For me, the benefits definitely exceed the cost.

There are many different conferences to choose from. There are professional conferences, leadership conferences, scientific conferences and business focused conferences. So why an Oracle Database conference? What's the big deal?

This is difficult for me to explain, so I've summarized why I love Oracle conferences into a "Top 7" list.

Number 7. Get free stuff from vendors
I suspect 50% of Oracle DBAs attend conferences because they love getting free stuff. If you want to maximize the good stuff, timing is everything. Get to the booth when the exhibition hall first opens, because vendors have tons of stuff to give away and they are full of energy. Also, just before the exhibition hall closes on the final day show up again. Vendors are tired and want to get rid of as much stuff as possible... and you just happen to be there!

Number 6. Become known in your industryIf you want to focus and excel in your career in a particular area, then pass on what you know. If you hold on to what you learn, no one will know you're an expert but you... and that's lonely. One of the best ways to become known and enjoy conferences is to speak at them. If this is something you want to do, please email me. I'll mentor you. I'm serious.

Number 5. Talk to others, if you want toWant to talk with people who care about Oracle technology as much as you do? Conferences are a great way to break out of your rut and think... and if you want to talk. One of the reasons I like conferences is it gives me a chance to meet with past students.

Number 4. Learn, if you want to
There are always opportunities to learn at conferences. How many times have I heard someone say they were in a lame session. But when I ask if they read the abstract first, they say no. If you want to increase your chances of attending great sessions, read the abstract. Second, look for speakers you like. Third, sit next to the door just in case you need to make a quick exit! The worst you can do is look at the agenda and pick the sessions you want to attend. Do a little research and you'll be surprised how many good sessions there are.


Number 3. Be inspired!How many places can you go to receive inspiration? Not many, is my answer! Because I can chill for a bit at conferences, observe what's going on in our industry and interact with a wide variety of people associated with Oracle technology, I tend to leave with a fresher and more refined view... or better said, "A Wider View." That in itself is worth the cost of any conference.

Number 2. Network with vendors, speakers, colleagues
I take it one step at a time. I always cruise the exhibition halls looking for new products and touching base with the vendors I know. Personally, I like to encourage the new vendors because it is a massive investment for them to exhibit. When I'm cruising, that's where I typically reconnect with students and other DBAs. It's a more comfortable and relaxing environment for me. If you want to speak with an expert face-to-face, a conference is a natural place to do this. If someone walks up to me with an AWR report, I'll make time for them. If they email me the week before, I'll do whatever I can to schedule some time with them.

Number 1. Be with friends
This one is personal. When I'm teaching or consulting, it's very intense and compressed. But at conferences, I get a chance to unwind, sit back, have a beer, and talk. I'm really involved with the IOUG conference and each year I look forward to reconnecting with this small group of friends. It's such a good feeling to be with "Oracle friends" without having to talk... about Oracle. I can just sit and listen without anyone wondering... What's wrong with Craig? You can do the same thing. Find an Oracle user group and volunteer.

Why Attend? Because Oracle conferences rock!There you have it. Seven reasons why Oracle conferences and Oracle User Groups rock! If you're not planning to attend any, you're missing a great way to connect, advance your career, and have a lot of fun. See you there!!

All the best in your Oracle performance work!

Craig.




Categories: DBA Blogs

Top 7 Reasons Why Oracle Conferences Are A Waste Of Time

Mon, 2014-09-15 17:38
This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.Top 7 Reasons Why Oracle Conferences Are A Waste Of Time
Want to turn a lame Oracle Database conference experience into a great one? You may not believe this, but I think Oracle conferences are a waste of time. That is, unless I take action. I've been to hundreds of Oracle conferences, so I'm kind of an expert in this field.

Here is my "Top 7" list about why Oracle conferences are time suckers AND how you can turn them into a GREAT educational, social and networking experience.

Number 7. Sleeper Presentations. You forgot to read that session abstract before you sat down? You're not alone! Here my secret: Sit by the door. But when you walk out, remember that the speaker probably
knows how you feel but is doing they best they can. Out of respect for them and the poor souls who are actually enjoying the session, be quiet when you leave.

Number 6. My System Is Bigger Than Your System. How many times have we all heard someone talking about their, "50 node RAC global system with 5 million active users." Really? Is that even possible? Here are four proven options. Option 1 is to ask a question to shut them up. For example, "So how do you deal with the enqueue contention?" Another option is to simply walk away. If you feel that's rude, then Option 3 is to suddenly grab your phone to answer "that" call and walk away...never to return. If you're feeling feisty then combining Option 2 and 3 is wonderful. Ask the question "So how do you deal with upgrading 50 nodes?" AND THEN suddenly grab your phone and walk away.

Number 5. Not Another New Feature Presentation! Oracle Corporation and their marketing minions love to talk about futures. If you're depressed from hearing about new features you won't get to touch for the next five years, here's what to do. First, remember that Oracle Corporation user group sponsorship is important to user group conferences. Without the help from Oracle, most conferences would not occur. Second, the more Oracle sponsors an event, the more they influence the content. The money source always drives the content. The solution is to look at number of presentations given by non-Oracle employees versus Oracle employees... before you register.

Number 4. Can't Afford It Unless Oracle Sales Pays. Yes you can! I know training and traveling budgets have been cut. What you need is a free pass, right? I figured this one out while still working for Oracle. Submit an abstract to speak. If speaking freaks you out, click HERE. If you won't speak, then volunteer. Believe me when I say, EVERY CONFERENCE NEEDS VOLUNTEERS.

Number 3. Boring Conference. I totally get that. Want to know the secret about boring conferences? Avoid them...unless you take action. My conference of choice is Collaborate/IOUG. I love that conference... because I'm really involved, have been for years and look forward to working with the conference team and seeing them each year. It's kind of like family to me.

Number 2. I Feel Like A Dork. Because I speak a lot, people don't realize that much of the time I feel like a social idiot. Strange, but put me on stage or in a mentoring situation and I'm fine. It's taken me 20 years of conferences to figure this one out. Here's what to do: Go with a colleague or volunteer so you'll always have someone to hang out with. If that doesn't work and you're a guy, look for the stupid games because that's where you'll find DBAs just like you (and me).

Number 1. Stupid Drunks. I get so tired of drunk people yelling in my face! Especially when I can feel the spit from their mouths poke me in the eye. It only took a few years to figure this one out. The solution? Step back a couple of feet. And if that doesn't work, then walk away. Don't worry about offending them, because they won't remember you anyways.

Make It A Great Conference!

There you have it, my "Top 7" list about why Oracle conferences are time suckers AND how you can turn them into a GREAT educational, social and networking experience.

See you at IOUG this April in Las Vegas!

All the best in your Oracle performance work,

Craig.
Categories: DBA Blogs

Watch Oracle DB Elapsed Time and Wall Time With Parallel Query

Sun, 2014-09-07 10:09
This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.Watch Oracle Elapsed Time and Wall Time With Parallel Query
In my recent postings I wrote that when using the Oracle Database parallel query a SQL statement's wall time should be equal to its elapsed time divided by the number of parallel query slaves plus some overhead.

That may seem correct, but is it really true? To check I ran an experiment and posted the results here. The results are both obvious and illuminating.

If you don't want to read but just sit on the couch, have a beer and watch TV you're in luck! I took a clip from my Tuning Oracle Using An AWR Report online video seminar put it on youtube.  You can watch the video clip on YouTube HERE or simply click on the movie below.


The Math, For Review Purposes
In my previous recent postings I detailed the key time parameters; DB Time, DB CPU, non-idle wait time, elapsed time, parallelism and effective parallelism. To save you some clicking, the key parameters and their relationships are shown below.

DB Time = DB CPU + NIWT

Elapsed Time = Sum of DB Time

Wall Time = ( Elapsed Time / Parallelism ) + Parallelism Overhead

Wall Time = Elapsed Time / Effective Parallelism


Test Results: When Oracle Parallel Query was NOT involved.
If you want to see my notes, snippets, etc. they can be found in this text file HERE.

Here is the non-parallel SQL statement.

select /*+ FULL(big2) NOPARALLEL (big2) */ count(*)
into   i_var
from   big2 
where  rownum < 9000000

When the SQL statement was running, I was monitoring the session using my Realtime Session Sampler OSM tool, rss.sql. Since I knew the server process session ID and wanted to sample every second and wanted to see everything just for this session, this is the rss.sql syntax:
SQL>@rss.sql 16 16 827 827 % % 1
For details on any OSM tool syntax, run the OSM menu script, osmi.sql. You can download my OSM Toolkit HERE.

The rss.sql tool output is written to a text file, which I was doing a "tail -f" on. Here is a very small snippet of the output. The columns are sample number, sample time, session SID, session serial#, Oracle username, CPU or WAIT, SQL_ID, OraPub wait category, wait event, [p1,p2,p3].


We can see the session is consuming CPU and waiting. When waiting, the wait event is "direct path read", which is asynchronous (we hope) block read requests to the IO subsystem that will NOT be buffered in the Oracle buffer cache.

Now for the timing results, which are shown in the below table. I took five samples.  It's VERY important to know that the wait time (WAIT_TIME_S), DB CPU (DB_CPU_S), and DB Time (DB_TIME_S) values are related to ONLY server process SID 16. In blazing contrast, the wall time (WALL_S), elapsed time (EL_VSQL_S), and SQL statement CPU consumption (CPU_VSQL_S) is related the entire SQL_ID statement execution.

Here are the "no parallel" experimental results.
SQL> select * from op_results;

SAMPLE_NO WALL_S EL_VSQL_S CPU_VSQL_S WAIT_TIME_S DB_CPU_S DB_TIME_S
---------- ---------- ---------- ---------- ----------- ---------- ----------
1 35.480252 35.470015 9.764407 24.97 9.428506 34.152294
2 35.670021 35.659748 9.778554 25.15 9.774984 35.541861
3 35.749926 35.739473 9.774375 25.12 9.31266 34.126285
4 35.868076 35.857752 9.772321 25.32 9.345398 34.273479
5 36.193062 36.18378 9.712962 25.46 9.548465 35.499693
Let's check the math. For simplicity and clarity, please allow me to round and use only sample 5.
DB_TIME_S = DB_CPU_S + WAIT_TIME_S
35.5 = 9.5 + 25.5 = 35.0
The DB Time is pretty close (35.5 vs 35.0). Close enough to demonstrate the time statistic relationships.
Elapsed Time (EL_VSQL_S) = DB_TIME_S
35.5 = 34.2
The Elapsed Time is off by around 4% (35.5 vs 34.2), but still closely to demonstrate the time statistic relationships.
Wall Time (WALL_S) = Elapsed Time (EL_VSQL_S) / Effective Parallelism
35.5 = 35.5 / 1
Nice! The Wall Time results matched perfectly. (35.5 vs 35.5)

To summarize in a non parallel query (i.e., single server process) situation, the time math results are what we expected! (and hoped for)

Test Results: When Oracle Parallel Query WAS involved.

The only difference in the "non parallel" SQL statement above and the SQL statement below is the parallel hint. Below is the "parallel" SQL statement.
select /*+  FULL(big2) PARALLEL(big2,3)  */ count(*) into i_var from big2 where rownum<9000000>
When the "parallel" SQL statement was running, because Oracle parallel query was involved resulting in multiple related Oracle sessions, when monitoring using my rss.sql tool, I need to open the session ID (and serial#) to include all sessions. I still sampled every second. Here is the rss.sql syntax:
SQL>@rss.sql 0 9999 0 9999 % % 1
The tool output is written to a text file, which I was doing a "tail -f" on. Here is a very small snippet of the output. I manually inserted the blank lines to make it easier to see the different sample periods.


There is only one SQL statement being run on this idle test system. And because there is no DML involved, we don't see much background process activity. If you look closely above, sessions 168 (see third column) must be a log write process because the wait event is "log file parallel write". I checked and session 6 is a background process as well.

It's no surprise to typically see only four session involved. One session is the parallel query coordinator and the three parallel query slaves! Interestingly, the main server process session that I executed the query from is session number 16. It never showed up in any of my samples! I suspect it was "waiting" on an idle wait event and I'm only showing processes consuming CPU or waiting on a non-idle wait event. Very cool.

Now for the timing results. I took five samples.  Again, it's VERY important to know that the wait time (WAIT_TIME_S), DB CPU (DB_CPU_S), and DB Time (DB_TIME_S) values are related to ONLY calling server process, which in this case is session 16. In blazing contrast, the wall time (WALL_S), elapsed time (EL_VSQL_S), and SQL statement CPU consumption (CPU_VSQL_S) is related the entire SQL statement execution.

Here are the "parallel" experimental results.
 SQL>  select * from op_results;

SAMPLE_NO WALL_S EL_VSQL_S CPU_VSQL_S WAIT_TIME_S DB_CPU_S DB_TIME_S
---------- ---------- ---------- ---------- ----------- ---------- ----------
1 46.305951 132.174453 19.53818 .01 4.069579 4.664083
2 46.982111 132.797536 19.371063 .02 3.809439 4.959602
3 47.79761 134.338069 19.739735 .02 4.170921 4.555491
4 45.97324 131.809249 19.397557 .01 3.790226 4.159572
5 46.053922 131.765983 19.754143 .01 4.062703 4.461175
Let's check the math. So simplicity and clarity, please allow me to round and use sample 5.
DB_TIME_S = DB_CPU_S + WAIT_TIME_S
4.5 = 4.1 + 0
The DB Time shown above is kind of close... 10% off. (4.5 vs 4.1) But there is for sure timing error in my collection sript. I take the position, this is close enough to demonstrate the time statistic relationships. Now look below.
Elapsed Time (EL_VSQL_S)  = DB_TIME_S
131.7 != 4.5
Woah! What happened here? (131.7 vs 4.5) Actually, everything is OK (so far aways) because the DB Time is related to the session (Session ID 16), whereas the elapsed time is ALL the DB Time for ALL the processes involved in the SQL statement. Since parallel query is involved, resulting in four additional sessions (1 coordinator, 3 slaves) we would expect the elapsed time to be greater than the DB Time. Now let's look at the wall time.
Wall Time (WALL_S) = ( Elapsed Time (EL_VSQL_S) / Parallelism ) + overhead
46.1 = ( 131.8 / 3 ) + 2.2
Nice! Clearly the effective parallelism is greater than 3 because there is some overhead (2.2). But the numbers makes sense because:

1. The wall time is less than the elapsed time because parallel query is involved.

2. The wall time is close to the elapsed time divided by the parallelism. And we can even see the parallelism overhead.

So it looks like our time math is correct!


Reality And The AWR Or Statspack Report
This is really important. In the SQL Statement section of any AWR or Statspack Report, you will see the total elapsed time over the snapshot interval and perhaps the average SQL ID elapsed time per execution. So what is the wall time? What are users experiencing? The short answer is, we do NOT have enough information.

To know the wall time, we need to know the parallelism situation. If you are NOT using parallel query, than based on the time math demonstrated above, the elapsed time per execution will be close to what the user experiencing (unless there is an issue outside of Oracle). However, if parallelism is involved, you can expect the wall time (i.e, user's experience) to be much less than the elapsed time per execution shown in the AWR or Statspack report.

Another way of looking at this is: If a user is reporting a query is taking 10 seconds, but the average elapsed time is showing as as 60 seconds, parallel query is probably involved. Also, as I mentioned above, never forget the average value is not always the typical value. (More? Check out my video seminar entitled, Using Skewed Data To Your Advantage HERE.)

Thanks for reading!

Craig.
Categories: DBA Blogs

What Is Oracle Elapsed Time And Wall Time With A Parallelism Twist

Mon, 2014-08-18 10:14
This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.
What Is Oracle Elapsed Time And Wall Time With A Parallelism Twist
In this post I'm focusing on Oracle Database SQL elapsed time, adding parallelism into the mix and then revisiting wall time. What initially seems simple can take some very interesting twists!

If you are into tuning Oracle Database systems, you care about time. And if you care about time, then you need to understand the most important time parameters: what they are, their differences, how they relate to each other and how to use them in your performance tuning work.

A couple weeks ago I wrote about Oracle DB Time, non-idle wait time, and server process CPU consumption (DB CPU) time. If you have not read that posting, HERE is the link. It must be a good read because it quickly become my most viewed post ever! In this posting, the focus is SQL elapsed time, parallelism, and again wall time. Enjoy!

Quick Review
In my previous related post, I covered non-idle wait time, DB CPU, and DB Time. Here is a very quick summary of each.

Non-Idle Wait Time occurs when an Oracle process is not consuming CPU, the session pauses (i.e., waits) and Oracle considers the wait time important for performance tuning. An example of a non-idle wait event is direct path read temp. An example of an idle wait event is SQL*Net message from client or pmon timer.

DB CPU is Oracle server/foreground/shadow process CPU consumption. This is not include Oracle background process CPU consumption.

DB Time is DB CPU plus Non-Idle Wait Time. Remember that DB Time does not include background process CPU consumption and Oracle Corporation determines which wait events are considered idle.

Elapsed Time
Elapsed Time (ET) is all DB Time related to a defined task. A "defined task" could be a SQL statement, group of SQL statements, pl/sql procedure, batch job, etc. It is whatever makes sense in your tuning situation.

The elapsed time for a SQL_ID can be found in v$sql. But be careful because this elapsed time is related to "all" the SQL_ID executions. Thankfully, there is an "executions" column in v$sql.


Elapsed time is displayed in a number of areas within an Oracle Database AWR and Statspack report. Looking at the above screen shot, the "top" elapsed time SQL has an elapsed time of 268561 seconds. This means that over the AWR report's snapshot interval, for all this SQL's executions, its total DB Time is 268561 seconds. Said another way, if we were to add up all this SQL's DB CPU and non-idle wait time for all its executions within the snapshot interval, the value should be 268561.

There is a lot of great information provided in the AWR and Statspack SQL reports. For example, because the elapsed time and the CPU time (DB CPU) is shown above, we can calculate the non-idle wait time for the "top" elapsed time SQL ID.

non idle wait time = elapsed time - cpu time
268465 = 268561 - 96

For the "top" elapsed SQL, its elapsed time 268561 and it's DB CPU is 96 therefore its non-idle wait time is 268465. Wow! This statement has tons of associated wait time compared to CPU consumption time.

But it gets even better! Because the total Elapsed Time and the total number Executions over the snapshot interval is displayed, we can determine the average elapsed time!

average elapsed time = total elapsed time / executions
746.03 = 268561 / 36

Do not be deceived! The average elapsed time is unlikely what the user is experiencing. Two possibility examples for this deception are skewed elapsed times and parallelism.

For most DBAs this is unexpected. It also causes performance perception problems yet solutions are available to understand what's really going on.

I've spent so much time researching this topic and seen it increase my consulting value, I've posted a number of blog entries on this subject. Plus I created an OraPub Online Institute seminar focused specifically on this subject. It's called Using Skewed Performance Data To Your Advantage. Check it out. I'm really proud how it turned out. I also have a couple of OSM scripts dedicated to this topic, sqlelget[11].sql.

Revisiting Wall Time With A Parallelism Twist
Now it's time to put this all together.

DB CPU is the Oracle server process CPU consumption.

Non-Idle Wait Time (NIWT) is the time when an Oracle process can not consume CPU and must pause and we care about this time.

DB Time is the Oracle server process CPU consumption and all non-idle wait time.

Elapsed Time (ET) is the sum (i.e., all) DB Time related to a task, such as a SQL_ID.

Wall Time is what we hope the user experiences. I'll assume there is no time gap between Oracle and the user, therefore the wall time will equal the user's experience.

Effective Parallelism is the effective number of Oracle parallel slaves or some other form of parallelism, such as designed-in application parallelism. (For simplicity, I'm only going to mention Oracle parallel query.) If Oracle parallel query is not involved, then the effective parallelism is one. If two parallel query slaves are involved, then the effective parallelism will be a little less than 2

Parallelism can reduce wall time because we can simultaneously "burn time" in multiple places. For example, 60 seconds of elapsed time with a process running serially, results in a wall time of 60 seconds. But if we have two parallel query slaves, while the elapsed time (i.e., all the DB Time) is still 60 seconds (plus some overhead time), the wall time will be around 30 seconds (plus some overhead time).

The math is really simple...that is until you factor in scalability (i.e., the overhead), which I won't. If you're interested, read the last chapter of my book, Forecasting Oracle Performance.

Let's simplify this by using some mathematical notation.

DB Time = DB CPU + NIWT

Elapsed Time = Sum of DB Time

Wall Time = Elapsed Time / Effective Parallelism

Pretty straightforward, eh? Below is a short video clip summarizing the key time parameters taken from the OraPub Online Institute seminar, Tuning Oracle Using an AWR Report - Part 2. If you can't see the video, click HERE watch it on YouTube.



Test You Knowledge
True or False? If the total elapsed time is 60 seconds and parallel query is not involved, the total wall time will also be 60 seconds. True

True or False? If the elapsed time per execution is 60 seconds and the wall time is 30 seconds, then parallel query is involved. True

True or False? Bonus question yet very important to understand: If the elapsed time per execution is 60 seconds and two PQ slaves are involved, then the wall time will be 30 seconds.

The last question is false because there is overhead when parallelizing. Parallelism is not free. Because of this, the wall time will hopefully drop to perhaps 35 seconds. That 5 seconds is the parallelization overhead.

Coming Up Next: Video Proof!
While the above may seem correct, I ran some SQL statements and captured the relevant time statistics. There is quite a bit of detail and I ran two different tests, so I'll post that in a week or two.

Thanks for reading,

Craig.
Categories: DBA Blogs

Watch Oracle DB Session Activity With The Real-Time Session Sampler

Mon, 2014-08-11 17:38
This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.Watch Oracle DB Session Activity With My Real-Time Session Sampler
Watching session activity is a great way to diagnose and learn about Oracle Database tuning. There are many approaches to this. I wanted something simple, useful, modifiable, no Oracle licensing
issues and that I could give away. The result is what I call the Oracle Real-Time Session Sampler (OSM: rss.sql).

The tool is simple to use.  Based on a number filtering command line inputs, it repeatedly samples active Oracle sessions and writes the output to a file in /tmp. You can do a "tail -f" on the file to watch session activity in real time!

The rss.sql tool is included in the OraPub System Monitor (OSM) toolkit (v13j), which can be downloaded HERE.

If you simply want to watch a video demo, watch below or click HERE.


The Back-Story
Over the past two months I have been creating my next OraPub Online Institute seminar about how to tune Oracle with an AWR/Statspack report using a quantitative time based approach. Yeah... I know the title is long. Technically I could have used Oracle's Active Session History view (v$active_session_history) but I didn't want anyone to worry about ASH licensing issues. And ASH is not available with Oracle Standard Edition.

The Real-Time Session Sampler is used in a few places in the online seminar where I teach about Oracle session CPU consumption and wait time. I needed something visual that would obviously convey the point I wanted to make. The Real-Time Session Sampler worked perfectly for this.

What It Does
Based on a number of command line inputs, rss.sql repeatedly samples active Oracle sessions and writes the output to file in /tmp. The script contains no dml statements. You can do a "tail -f" on the output file to see session activity in real time. You can look at all sessions, a single session, sessions that are consuming CPU or waiting or both, etc. You can even change the sample rate. For example, once every 5.0 seconds or once every 0.25 seconds! It's very flexible and it's fascinating to watch.

Here is an example of some real output.



How To Use RSS.SQL
The tool is run within SQL*Plus and the output is written to the file /tmp/rss_sql.txt. You need two windows: one to sample the sessions and other other to look at the output file. Here are the script parameter options:

rss.sql  low_sid  high_sid  low_serial  high_serial  session_state  wait_event_partial|%  sample_delay

low_sid is the low Oracle session id.
high_sid is the high Oracle session id.
low_serial is the low Oracle session's serial number.
high_serial is the high Oracle session's serial number.
session_state is the current state of the session at the moment of sampling: "cpu", "wait" or for both "%".
wait_event_partial is when the session is waiting, select the session only with this wait event. Always set this to "%" unless you want to tighten the filtering.
sample_delay is the delay between samples, in seconds.

Examples You May Want To Try
By looking at the below examples, you'll quickly grasp that this tool can be used in a variety of situations.

Situation: I want to sample a single session (sid:10 serial:50) once every five seconds.

SQL>@rss.sql  10 10 50 50 % % 5.0

Situation: I want to essentially stream a single session's (sid:10 serial:50) activity.

SQL>@rss.sql 10 10 50 50 % % 0.125

Situation: I want to see what sessions are waiting for an row level lock while sampling once every second.

SQL>@rss.sql 0 99999 0 99999 wait enq%tx%row% 1.0

Situation: I want to see which sessions are consuming CPU, while sampling once every half second.

SQL>@rss.sql 0 99999 0 99999 cpu % 0.50

Be Responsible... It's Not OraPub's Fault!
Have fun and explore...but watch out! Any time you are sample repeatedly, you run the risk of impacting the system under observation. You can reduce this risk by sampling less often (perhaps once every 5 seconds), by limiting the sessions you want to sample (not 0 to 99999) and by only select sessions in either a "cpu" or "wait" state.

A smart lower impact strategy would be to initially keep a broader selection criteria but sample less often; perhaps once every 15 seconds. Once you know what you want to look for, tighten the selection criteria and sample more frequently. If you have identified a specific session of interest, then you stream the activity (if appropriate) every half second or perhaps every quarter second.

All the best in your Oracle Database tuning work,

Craig.
Categories: DBA Blogs

What Is Oracle DB Time, DB CPU, Wall Time and Non-Idle Wait Time

Tue, 2014-08-05 06:30
This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.What Is Oracle DB Time, DB CPU, Wall Time and Non-Idle Wait Time
If you are into tuning Oracle Database systems, you care about time. And if you care about time, then you need to understand the most important time parameters: what they are, their differences, how they relate to each other and how to use them in your performance tuning work.

The key Oracle Database time parameters are elapsed time, database time (DB Time), non-idle wait time and server process CPU consumption (DB CPU) time.

This first post is pretty basic, yet core fundamental stuff. So in the following two posts I'll introduce elapsed time, add parallelism into the mix and revisit wall time. What initially seems simple can some take very interesting twists!

You probably know that I am all about quantitative Oracle performance analysis. I research, write, teach, and speak about it. I even have an OraPub Online Institute seminar about how to tune your Oracle Database systems from a standard AWR or Statspack report using an Oracle Time Based Analysis (OTBA) framework.

So let's get started!

Wall Time & Run Time
I'll start with Wall Time because that is close (hopefully) to what a user experiences. In fact, if there is no time gap between the Oracle Database and the user, then we can do a little math and figure out what the users are, on average, experiencing. I'll get back to wall time in the next post, where I include elapsed time and parallelism into the equation.

DB CPU
DB CPU is Oracle server/foreground/shadow process CPU consumption. Each Oracle server process gathers its own CPU consumption using the time and/or getrusage C function system call. So unless there is a major screw-up by either the operating system or the Oracle kernel developers, the time will be good... very good. The name DB CPU is taken from the actual statistic name, which is found in both v$sess_time_model and v$sys_time_model.

If you look at any AWR or Statspack report in the "Time Model" section, you will see DB CPU. The value shown will be all server process CPU consumption within the reporting snapshot interval, converted to seconds. (The raw statistic is stored in microseconds.)

Below is an example Time Model Statistics screen shot from a standard AWR report. I've highlighted DB CPU.



If you run one of my OraPub System Monitor (OSM) time related tools like ttpctx.sql or rtpctx.sql you see a CPU time statistic. That contains both the DB CPU (i.e., server process) and "background process cpu" statistics. Here's an example.

SQL> @ttpctx.sql
Remember: This report must be run twice so both the initial and
final values are available. If no output, press ENTER about 11 times.

Database: prod35 31-JUL-14 12:09pm
Report: ttpctx.sql OSM by OraPub, Inc. Page 1
Total Time Activity (142 sec interval)

Avg Time Time Wait
Time Component % TT % WT Waited (ms) (sec) Count(k)
------------------------------------- ------- ------- ----------- ----------- --------
CPU consumption: Oracle SP + BG procs 95.95 0.00 0.000 347.212 0
PX Deq: Slave Session Stats 1.45 35.74 0.113 5.240 47
library cache: mutex X 0.58 14.26 0.136 2.090 15
PX Deq: Slave Join Frag 0.43 10.57 0.067 1.550 23
PX Deq: Signal ACK EXT 0.29 7.16 0.045 1.050 23
control file parallel write 0.28 7.03 20.600 1.030 0
PX qref latch 0.27 6.75 0.012 0.990 85
latch free 0.20 4.91 0.090 0.720 8
log file parallel write 0.16 4.02 12.826 0.590 0

Non-Idle Wait Time
When an Oracle process can not consume CPU, it will pause. As an Oracle DBA, we know this as wait time. Sometimes a process waits and it's not a performance problem, so we call this Idle Wait Time. Oracle background processes typically have lots of idle wait time. However, when a user is waiting for sometime to complete and way down deep their Oracle server process is waiting to get perhaps a lock or latch, this is Non-Idle Wait Time. Obviously, when tuning Oracle we care a lot about non-idle wait time.

Below is a simple query showing wait event classifications. In this system there are 119 Idle wait events, so all the rest would be classified as non-idle wait events.

Oracle uses a variety of methods to determine wait time. I have a number of postings and educational content available about this. You'll see them if you do an OraPub or blog search for "time".

When working with non-idle wait time, remember the 80/20 rule. Most of the wait time we care about will be contained with in the largest ("top") two to four wait events. Don't waste YOUR time focusing on the 20%.

Here's an example. In the screen shot below, while not shown the total wait time is 1966 seconds.
If you add up the displayed "top" four wait events, their combined wait time is 1857. This is about 95% of all the non-idle wait time. This is a good example demonstrating that most of the wait time is found in the top two to four events.

My OSM toolkit has many wait time related tools. Most start with "sw" for "session wait" but the both ttpctx.sql or rtpctx.sql will contain the non-idle wait time and also CPU consumption. This is a good time to transition into DB Time.

DB Time
DB Time is a time model statistic that is the sum of Oracle process CPU consumption and non-idle wait time. When optimizing Oracle systems we typically focus on reducing "time", though many times database work is also part of the equation. This "time" is essentially DB Time, though sometimes I take control over what I consider idle wait time.

The name DB Time comes from the actual statistic name in both v$sess_time_model and v$sys_time_model.

If you look at any AWR or Statspack report in the "Time Model" section, you will see DB Time.
The DB time value is technically all server process CPU consumption plus the non-idle wait time within the reporting snapshot interval, converted to seconds. (The raw statistic is stored in microseconds.) Surprisingly, Oracle does not include "background cpu time" in the DB Time statistic. There are both good and not so good reasons the background CPU time is not include, but that's a topic for another posting.

A Little Math
We have enough detail to relate DB Time, DB CPU and non-idle wait time together... using a little math.

DB Time = DB CPU + non_idle_wait_time

And of course,

non_idle_wait_time = DB Time - DB CPU

This is important, because there is no single statistic that shows all the non-idle wait time. This must be derived. Shown above is one way to derive the non-idle wait time. Take a look at the AWR report snippet below.

In the Non-Idle Wait Time section above, I stated that the total non-idle wait time was 1966 seconds. I derived this from the Time Model screen shown above. I simply did:

non_idle_wait_time = DB Time - DB CPU
1966.16 = 4032.03 - 2065.87

Coming Up Next
I wanted to keep this post short, which means I left out the more interesting topics. So in the next post I'll merge into the picture elapsed time along with parallelism and revisit wall time. Then in the third post (that's my guess at this point), I'll actually demonstrate this in two different systems.

Thanks for reading,

Craig.
 
Categories: DBA Blogs

Pages