Performance Issues - Linux Gurus Input Needed

Greebo

N9017H - C172M (1976)
Joined
Feb 11, 2005
Messages
10,976
Location
Baltimore, MD
Display Name

Display name:
Retired Evil Overlord
As you're all aware, we've been experiencing sporatic performance issues of late.

I'm unable to determine why. :(

Even telnetting into the server, I'm seeing very slow response times, but what I know to check indicates no problems on the server.

TOP shows the server is barely exerting itself:
Code:
 13:26:58  up 19:31,  4 users,  load average: 1.29, 1.30, 0.86
47 processes: 46 sleeping, 1 running, 0 zombie, 0 stopped
CPU0 states:   0.4% user   0.0% system    0.0% nice   0.0% iowait  99.0% idle
CPU1 states:   1.0% user   0.1% system    0.0% nice   0.0% iowait  98.3% idle
CPU2 states:   1.0% user   0.0% system    0.0% nice   0.0% iowait  98.4% idle
CPU3 states:   0.0% user   0.2% system    0.0% nice   0.0% iowait  99.2% idle
Mem:  6196600k av, 6168124k used,   28476k free,       0k shrd, 2142092k buff
      2928384k active,            2528788k inactive
Swap: 12586916k av, 3850756k used, 8736160k free                 1390304k cache

NetStat shows a small number of connections - no indication that I can see of a DOS attack of any kind...

the few other things that I look at all also show ok.

So - where should I be looking?

(Oh at the MOMENT, mysql is churning, but thats only cause I'm running an optimize - normally its barely a blip)
 
Chuck --

It's been almost 10 years since I retired so I've forgotten lots of stuff and no longer have manuals, and my experience was all unix, but that's pretty similar to linux (or really, the reverse is true...).

A couple of things come to mind.

1. You don't have a lot of free memory. Are you possibly getting excessive swapping?

2. When did you last reboot? My experience was that some varients of unix started to get sick if you didn't periodically bounce them.

3. What kind of delays do you get when you ping the system?
 
I hear hampster food works..... :D

I'd look at memory swaps. The amount of used memory looks kinda high.
 
The amount of used memory being reported by top or being shown by a single process is pretty much useless in determing if it is a memory issue. Linux by it's very nature is going to use as much spare memory as it can for disk cache.

For example if I type:
top on my primary server I get:
Code:
Mem:   2061032k total,  1979288k used,    81744k free,   120272k buffers
Swap:  1453872k total,      204k used,  1453668k free
Which makes it look like my system is using almost 2 gig of memory. This is not the case. By typing this command I get a better picture of the situation:
free -m
Code:
                 total       used       free     shared    buffers     cached
Mem:          2012       1933         79          0        117       1374
-/+ buffers/cache:        441       1571
Swap:         1419          0       1419
The part that says: -/+ buffers/cache states that 1571 megabyte is free.. This is essentially what is *actually* free to processes. Based on this information we know that my server does not need more memory.

There are also some common issues with how top/ps report the memory being used by a single process. Sometimes you'll see that a pretty small process, for example, a text editor is using a massive amount of memory. Let me try to explain why this is. Linux has many very common libraries that are shared amongst many different processes. When ps/top look at the memory usage of a processit also calculates the memory use of the libraries being used. You have to remember though that many different processes are using that library which is causing that libraries memory usage to be very high. So *really* that small text editor is not what's responsible for all the memory usage.. It's a combination of all your processes running that are using the above mentioned library. Hopefully that makes a little sense.

Now to the subject of why POA performs like crap... I'll monitor it a little more and try to determine for sure if it's a network issue or server resources issue but based on previous slowdowns that I've watched it appears to me to be a server resource issue.

Since Pilots of America is running on a virtual server I suspect that your load reports generated by "top" are only showing the load of *your* virtual server. Not the load of the actual physical server. So the other three virtual servers on the server could be hammering the CPU like mad which would in turn make POA perform like crap..and would be completely un noticeable by looking at resource usage. I suspect that the best way to verify this would be to benchmark the CPU or I/O during a period where POA seems fast. When it slows down benchmark it again and see if it's slower. If it's slower, that means some other jackass on the virtual server is maxing it out...Which is probably the case.

Another issue that I can think of probably has to do with the conservative default configuration files of Apache/MySQL. I'm sure a little bit of tweaking would go a long distance.

Really the issue is going to come down to the fact that POA is on a virtual server and it's performance is completely dependant on the resources being used by other people. There is no way to control this nor is there really a way to monitor this. About the only decent way to fix this that I can think of would be to get *off* a virtual server and onto a full blown dedicated server.

I'm aware of course that by doing so the operating costs would increase dramatically over the current level. But this is just one of those things.. The truth is you get what you pay for.

If you are interested in getting a dedicated server or futher help in the attempt to diganose the current performance situation let me know. I would be willing to pay the difference between the current operating expenses and the increase of a dedicated server if need be.

If you want me to figure it out...just a regular shell account would be all I'd need....and if your worried about security it wouldn't be an issue if everything is setup properly.
 
Last edited:
jangell said:
I've watched it appears to me to be a server resource issue.

Really the issue is going to come down to the fact that POA is on a virtual server and it's performance is completely dependant on the resources being used by other people.

IOW, primarily memory. (and to a lesser extent processor). Which is where I suggested Chuck look.

FWIW, I run a couple of BSD boxen, and I've run BSD and other *x flavors for about 25 years (has it really been that long?), 15+ on FreeBSD alone. Yeah, there are some differences in Linux, and some differences in the report output, but when it comes down to it, most of the non-network performance issues come down to memory resources.
 
wsuffa said:
IOW, primarily memory. (and to a lesser extent processor). Which is where I suggested Chuck look.

FWIW, I run a couple of BSD boxen, and I've run BSD and other *x flavors for about 25 years (has it really been that long?), 15+ on FreeBSD alone. Yeah, there are some differences in Linux, and some differences in the report output, but when it comes down to it, most of the non-network performance issues come down to memory resources.
Problem is.. there is no *great* way to look. I'm sure they are using Virtuozzo which I do not beleive lets you "oversell" memory. Which means that the memory dedicated to each virtual server is actually dedicated to that virtual server. It could also be the server bouncing off of some limit in apache or mysql with the default configuration files of however you installed them.

Processor usage is a whole different deal though. Virtuozzo has the capability to set a hard limit on the cycles allowed per virtual server per unit of time. It'll simply limit your performance at this. But it's *very* common for a provider to oversell the amount of processor power available to each server..which means that one or two servers bouncing off their limit could easily be affecting every other virtual server. There would be absolutely no great way for you to monitor the overall server resource usage as all the tools are going to lie to you and tell you *your* usage. Like I said the best way to measure this would be by way of writing a quick bash script to benchmark cpu performance. When POA starts to go down the crapper run this benchmark again and if this is an overall server resource issue (some virtual server maxing the entire server out) it would show in the benchmark result...top/ps/any tool that's going to show you actual cpu usage is going to be showing you only your usage of your limit assigned by Virutozzo.

I've had poor luck with virtual servers for hosting and finally came to the conclusion that the cost of using your own dedicated server is by far worth it.
 
Last edited:
My guess is memory or disk fragmentation. Reboot for former, defrag for the latter.
 
We ran for months without rebooting with no performance issues. In the last 2 months I've issued I think seven reboot requests. I can't think of any changes in the preceeding months to account for this major performance change. Also, since it is so erratic, I'm inclined to go with Jesse's thinking on the virtual server issue - it's an unknown, so its a good target...
At present, while things seem to be running well, here's the top and free reports:
Code:
 05:37:32  up 1 day, 11:42,  2 users,  load average: 0.06, 0.05, 0.08
41 processes: 40 sleeping, 1 running, 0 zombie, 0 stopped
CPU0 states:   7.0% user   1.0% system    0.0% nice   0.0% iowait  90.0% idle
CPU1 states:   0.0% user   1.0% system    0.0% nice   0.0% iowait  98.0% idle
CPU2 states:  11.0% user   1.0% system    0.0% nice   0.0% iowait  86.0% idle
CPU3 states:   1.0% user   0.0% system    0.0% nice   0.0% iowait  98.0% idle
Mem:  6196600k av, 5896560k used,  300040k free,       0k shrd, 1364788k buff
      2786336k active,            2288980k inactive
Swap: 12586916k av, 3887732k used, 8699184k free                 1443612k cache
Code:
[05:37:39] [~] $free
             total       used       free     shared    buffers     cached
Mem:       6196600    5921276     275324          0    1364892    1452044
-/+ buffers/cache:    3104340    3092260
Swap:     12586916    3887732    8699184
Lots of free memory...

Defragmentation --- is that typically a problem in linux like it is in Windows? If so, how do I address it?
 
Chuck, on a BSD system you can check fragmentation with the FSCK command. It normally runs as part of the boot script, you can run it as superuser, but it won't usually make repairs unless you're in superuser mode. That's for BSD-varieties of Unix (FreeBSD, NetBSD, etc).

I assume the same command applies in Linux, because FSCK is a standard unix command.

At the end of the disk check it will show % fragmentation.
 
Don't seem to have fsck on this one...
 
Greebo said:
Defragmentation --- is that typically a problem in linux like it is in Windows? If so, how do I address it?

Assuming that it's stock Red Hat (or a clone) and the default filesystem is ext3, there really isn't much that you can do to defragment. Linux, by default, does a much better job of file system organization.
 
Found the rpg for it, uploading it to the server now.
 
Hmph, the rpm installed...now to find the darn bin...

I'm so friggin rusty at linux.
 
Greebo said:
Hmph, the rpm installed...now to find the darn bin...

I'm so friggin rusty at linux.

Be careful, you can only fsck a partition read-only while it's mounted.

fsck should be installed by default (/sbin/fsck). It's part of the e2fsprogs package directly from Red Hat.
 
fsck is not going to defrag it for you.. Listen to FlyNE, when it comes to Linux he makes me look like a total n00b.

Fragmentation is not the issue. It's kind of a thing of the past. You could defragment the ext3 filesystem if you downgrade the filesystem to ext2 and use e2defrag. You would have to unmount the filesystem which really isn't an option for you.

Best thing: Write a script to benchmark CPU and I/O. Have that run every few minutes and log the results. When you notice POA going slow look in that log to see if those numbers increased a bunch during that time. If they did, get off the virtual server.
 
aha! I don't have fsck, I have fsck.cramfs - but I kind of suspected fragging wasn't the issue with more modern versions of RedHat. I don't feel comfortable messing around with fsck and esp. not a version named differently - sounds like it could be dangerous. ;) So I'm not gonna run it...

jesse wanna write such a script? :)
 
Greebo said:
jesse wanna write such a script?

Sure..Does the system have make and some of the common libraries to compile code with? I'm thinking a decent way to measure cpu performance would be by timing how long it takes to compile something.

I don't really have a Redhat/Virtuozzo enviorment to verify that such things are installed.

Are you familiar with cron?

More or less I'm thinking I'll write a script that will compile code and record the amount of time it took into a log file. I'll also write something to measure I/O performance. From there it will have to be called from cron on a specific schedule, perhaps every 15 minutes or so. The script could be written so that it does not run for very long..So it shouldn't really hurt the performance of POA that much. By looking at the results from the log in 48 hours or so you would be able to determine if there are any resourse issues based on any dramatic increases in time.
 
While at it, why not include bouncing a few packets to an outside server that has decent routing. That'll give you some indication of IO and packet delays in the internal routing. A keep-alive type ping might do the trick.
 
Are you seeing slow OS response times, slow page load times, or slow DB times?

Although it has been a while, I'd focus on how busy the AMP engine (apache / mysql / perl
PHP:
) is - how busy is apache, how many requests are you serving, etc. 

Have you looked at your request log to make sure you aren't getting drilled by googlebot or similar?  Back in 2003, I was working on a project when the system crawled to a halt.  Google was indexing us with 18 seperate, unique bots at once.  That was a killer.

Cheers,

-Andrew
reformed engineer
 
AuntPeggy said:
My guess is memory or disk fragmentation. Reboot for former, defrag for the latter.
AFAIK, you don't need to defrag *ix system file systems...like on this Mac here. :rolleyes:

I think it does help to reboot once in a while, especially system like Mac that doesn't have a swap partition. It creates swap files on the fly as needed on the file system and you can end up with several between reboots.
 
While most of my experience was on non-consumer hardware, I had systems with uptimes beyond 750 days in my little realm. The system has only been up a day...
 
mikea said:
AFAIK, you don't need to defrag *ix system file systems...like on this Mac here. :rolleyes:

I think it does help to reboot once in a while, especially system like Mac that doesn't have a swap partition. It creates swap files on the fly as needed on the file system and you can end up with several between reboots.
I just had to move one of my primary linux servers from one cage in the data center to the other. It broke my heart to unplug it. I was sitting at like 700 days of uptime.

If you have to reboot a linux server...something's wrong with it.:yes:
 
FlyNE said:
I just had to move one of my primary linux servers from one cage in the data center to the other. It broke my heart to unplug it. I was sitting at like 700 days of uptime.

If you have to reboot a linux server...something's wrong with it.:yes:
I'm talking Mach/BSD on a Mac laptop with no swap partition. I still leave it up for a week at a time. I just traveled traveled a long weekend and never shut it down once. Just sleep.

Long ago I had a guy tell me I should be rebooting my Sun servers. I never thought to do that.

We also saved a screen capture of a Novell server that hit 700+ days when the NT guys said "of course" you need to reboot every weekend due to memory leaks or you were going to have problems. Guess which was one was the advanced solution?

I saved a story at the U opf North Carolina where they had to trace the network cables to find a Novell server and discovered it had been accidently enclosed in a wall built 4 years before. No wonder they almost went out of business. How can you make money like that? And the hardware vendors didn't want to recommend anything that didn't make you have to upgrade the hardware every few years.
 
I'll have a script together shortly. I can't type for **** right now.
 
One of my FreeBSD systems was up over 2 years. Finally had to bring it down when I needed to do some work on the UPS.

I've got one running on an old (real old) laptop. Battery float on a UPS. Should be no reason to bring it down save for software upgrades.
 
Two hours...Wrote a decent script... and.. I made a *major* mistake. Take a look at what I did :( ... Man .. don't mix up cp for rm...

Code:
[root@server poa]# cp -RF test /home/test/
cp: invalid option -- F
Try `cp --help' for more information.
[root@server poa]# [B]rm -Rf test /home/test[/B]
[root@server poa]# ls
[root@server poa]# cd /home/test
bash: cd: /home/test: No such file or directory
[root@server poa]# ls
[root@server poa]# cd /home/te
[root@server poa]# cp -Rf tes
[root@server poa]# ls
[root@server poa]# ls
[root@server poa]# FUC U**** **** **** **** **** **** **** F
[root@server poa]# **** TUCKLFdsjklfsdjfk
[root@server poa]# ****!!!**** ****
[root@server poa]#
Notice where I realized what I just did... The part in bold was supposed to be a copy....cp....rm means delete..****..Time to start over :(...This is why you should only be in root when you have to be..and this is why you make backups of servers.
 
Last edited:
jangell said:
Two hours...Wrote one hell of a script... and.. I made a *major* mistake. Take a look at what I did :( ... Man .. don't mix up cp for rm...

Notice where I realized what I just did... The part in bold was supposed to be a copy....cp....rm means delete..****..Time to start over :(...This is why you should only be in root when you have to be..and this is why you make backups of servers.
Congratulations. You have just passed one of the required mileposts of being a Unix SA. It is necessary that you do such a thing because its 2AM and the system MUST be up by 8AM.

Another is when you type

Code:
rm . hiddenfile

How about the SA who did
Code:
>ls
>ls -al
>ls -al > ls
>ls -al
-bash: ls: permission denied

Bonus question: What setting, not recommended, is required to cause that problem?

I was slightly mystified for a minute or three figuring out what could have been done to make ls stop working. Figuring it out was a minor *ix BWAHAHA!
 
Last edited:
Ok. This script isn't near as cool as the last one. But it still should be very effective. I just don't want to reinvent the wheel and redo the one I accidentally deleted.. Anyways here is the script:

Code:
##################################################################################
##################Pilots of America Server Test Script############################
############################Jesse Angell##########################################
##################################################################################


##################################################################################
###This function first removes leftovers from failed previous run attempts########
###it will then make a copy of the testfile to use for this benchmark run#########
###after that it will time how long it takes to compress with gzip################
###it'll now remove that gzip file as we no longer need it########################
##################################################################################

function cputest() {
    rm /home/test/testfile_go.gz >/dev/null 2>&1
    cp /home/test/testfile /home/test/testfile_go >/dev/null 2>&1
    /usr/bin/time -f '%e' -o /home/test/time gzip /home/test/testfile_go
    rm /home/test/testfile_go.gz >/dev/null 2>&1
    }


##################################################################################
###This function pings google and strips the results for the time it took and#####
###puts that number into a variable for later use in this script##################
##################################################################################

function pingtest() {
    pingtest=`ping google.com -c 1 |grep time= |awk '{print $7}' |awk -F = '{print $2}'`    
}


##################################################################################
###this function is used to log the actual results of the test into a text file###
###we call the above two functions and use echo to place them in a human readable#
###format into /home/test/testresults.log after which it will remove the file#####
###with the saved cpu benchmark time as it's no longer needed#####################
##################################################################################

function logresult() {
    pingtest
    cputest
    echo "`date`   Ping Time: `echo "$pingtest"`   CPU Time: `cat /home/test/time`" >> /home/test/testresults.log
    rm /home/test/time >/dev/null 2>&1
    }

###################################################################################
###This if statement is pretty simple and launches everything above.  first it#####
###checks to see if we have a 50MB test file to compress for benchmarks.  if#######
###we do not it will generate one randomily using dd and /dev/unrandom.############
###it's important to never delete the testfile as our results could be ############
###changed if this is forced to generate a new test file###########################
###################################################################################

if [ -f /home/test/testfile ]
    then
     logresult
    else
     dd if=/dev/urandom of=./testfile bs=1024 count=51200 >/dev/null 2>&1
     logresult
fi
It is a very safe script. I tested/develoepd it on Redhat ES 4. It took my system about 6 seconds to run the cpu test. You can make hte test more stressfull by increasing the number in dd. from 51,000 to 100,000 or so would about double the time...

Here are some basic instructions for you to deploy this:

/usr/sbin/adduser test
su test
cd /home/test
wget http://www.jesseangell.com/test
chmod 755 -Rf /home/test/test

That will pretty much get the script rolling. You can test it by typing:
/home/test/test .. That will generate a log file at /home/test/testresults.log.

In order for this to be effective you need to add it to cron so that it will run itself every few minutes. As the user: test type the following:

crontab -e

That will bring up the crontab in vi. Press i on your keyboard and type in:
0,5,10,15,20,25,30,35,40,45,50,55 * * * * /home/test/test
Now press: SHIFT :
Type in: wq! and press enter. This will now run every five minutes.

The basic theory behind this script is pretty simple. This will allow you to watch the actual performance of the server. It's impossible for you to really know what's going on unless you load test it. This script generates a 50 megabyte file of random crap (from /dev/urandom). Everytime it runs it compresses this file with gzip..which is very cpu intensive. It also pings google.com (google always works) to test network connectivity. From there it will log it. If POA gets slow simply take a look in the log and see if the times went way up. If they did, get off the virtual server.

If you are actually serious about using this. I could improve it/add more features to suite POA. Let me know if you need any Linux server help. THere is always a ton of tweaking that can be done...Along with setting up automated database backups offsite..etc.


Excuse my writing/technical explanation of my script. I'm tired and burned out.
 
So I got an error message and I was trying to figure out what it meant, so I started running the script line by line (well kinda) to test possible problem areas...

So I deleted testfile_go and then typed cp testfile testfile_go

Over a minute later, 39 of the 52 megs have been copied.

I switched users a couple times from me to root to test etc... 10 seconds or so just for that to take hold.

Meanwhile top reports the same total lack of load and free reports tons of memory.

Something is slamming the machine all right...
 
Greebo said:
So I got an error message and I was trying to figure out what it meant, so I started running the script line by line (well kinda) to test possible problem areas...

So I deleted testfile_go and then typed cp testfile testfile_go

Over a minute later, 39 of the 52 megs have been copied.

I switched users a couple times from me to root to test etc... 10 seconds or so just for that to take hold.

Meanwhile top reports the same total lack of load and free reports tons of memory.

Something is slamming the machine all right...
Yeahbut you're on shared server. I don't think you can be sure it's not something being done by another site sharing the same physical hardware.

Maybe you're one of the unlucky ones sharing the hardware with the Maccast. :p (I think he ran screaming to a dedicated host on an another site when GoDaddy shut him down with no notice.)

Have you opened a ticket with GoDaddy support asking them to look at it?
 
jangell said:
Ok. This script isn't near as cool as the last one. But it still should be very effective. I just don't want to reinvent the wheel and redo the one I accidentally deleted.. Anyways here is the script:

...

Excuse my writing/technical explanation of my script. I'm tired and burned out.

Jesse, I'm impressed you would not only write the script but also write the documentation in the wee hours. I would have run out words in there somewhere.

I'm bashing myself lately because I spend so much time doing similar things. We geeks are hopeless.

Note to Alan: I do, indeed, understand everything that script does. :p
 
Last edited:
Greebo said:
So I got an error message and I was trying to figure out what it meant, so I started running the script line by line (well kinda) to test possible problem areas...

So I deleted testfile_go and then typed cp testfile testfile_go

Over a minute later, 39 of the 52 megs have been copied.

I switched users a couple times from me to root to test etc... 10 seconds or so just for that to take hold.

Meanwhile top reports the same total lack of load and free reports tons of memory.

Something is slamming the machine all right...
I'm curious what the error message was. Let me know. You are probably just missing a binary or a binary is in the wrong location.

That 52 meg file should copy in a matter of seconds. I should have timed this operation also and put it in the log. Didn't think of that. I can add that.

So Yeah, Some other guy on the server is really loading it up. You might be able to request that GoDaddy look into the performance issues. But at this point you are going to be dealing with this kind of thing now and into the future..

May I suggest....
http://ev1servers.com/Dedicated/RTG/servers/valuextreme.aspx

They start at $69 per month for the celeron version and $89 for a Pentium 4.

I'd be willing to split operating costs..Let me know.. I wouldn't even want root. All I'd want is a regular user account for the purpose of network monitoring for my other servers.

mike said:
I was slightly mystified for a minute or three figuring out what could have been done to make ls stop working. Figuring it out was a minor *ix BWAHAHA!
It looks to me like he overwrote ls with the output of ls. I'm not sure if there would be a setting in to protect you from doing this. I'm not going to try it to find out. lol.
 
Last edited:
A very linux savvy friend suggested something that sounded like "compare run time to wall clock time". I didn't ask how to do this but he said it would tell you what percentage of the processor throughput your virtual server was getting.
 
lancefisher said:
A very linux savvy friend suggested something that sounded like "compare run time to wall clock time". I didn't ask how to do this but he said it would tell you what percentage of the processor throughput your virtual server was getting.
In an ideal situation your cpu time will be very close to real time (wall clock).. I could have compared these two numbers and outputted the difference but there would be no advantage. Either way I have to load the server up and time the operation. If you were curious about current load at that exact moment it would be ideal to compare those two numbers. But I'm more interested in logging real time performance for later review.

This is what the log file looks like. I'm running this script on my server right now. It will give you some leverage with GoDaddy to show the times that were extremely slow. I could write some logic in to have a special log that only shows the slow time. But I'd need a threshold to use..and I'd have to see the log of this running on the POA server to know what that would be.

Code:
Sat Oct  7 21:01:13 CDT 2006   Ping Time: 33.3   CPU Time: 10.26
Sat Oct  7 21:02:15 CDT 2006   Ping Time: 33.5   CPU Time: 13.22
Sat Oct  7 21:03:13 CDT 2006   Ping Time: 33.1   CPU Time: 10.70
Sat Oct  7 21:04:12 CDT 2006   Ping Time: 33.3   CPU Time: 10.51
Sat Oct  7 21:05:13 CDT 2006   Ping Time: 33.8   CPU Time: 10.67
Notice I'm running it every minute. Which is kinda overkill. lol.
 
Last edited:
mikea said:
Code:
>ls
>ls -al
>ls -al > ls
>ls -al
-bash: ls: permission denied
Bonus question: What setting, not recommended, is required to cause that problem?
He created a file called "ls" in his current directory with the output from the 'ls' command. The setting is that he had added './' to his PATH.
 
Last edited:
FlyNE said:
He created a file called "ls" in his current directory with the output from the 'ls' command. The setting is that he had added './' to his PATH.

I suspected this also..Which would be no big deal.
 
Back
Top