[Home] [Downloads] [Search] [Help/forum]


Register forum user name Search FAQ

Gammon Forum

[Folder]  Entire forum
-> [Folder]  SMAUG
. -> [Folder]  SMAUG coding
. . -> [Subject]  loop in write_to_descriptor in smaug fuss

loop in write_to_descriptor in smaug fuss

It is now over 60 days since the last post. This thread is closed.     [Refresh] Refresh page


Pages: 1 2  

Posted by Starlick   (9 posts)  [Biography] bio
Date Tue 16 May 2006 09:38 PM (UTC)

Amended on Tue 16 May 2006 11:15 PM (UTC) by Starlick

Message
I run a mud that has a bit of smaug code in it, especially the socket code. I recently added mccp and took it from smaug fuss. I then noticed my mud started lagging very ocasionally when it never used to, sometimes the lag could be a couple of minutes, sometimes barely noticeable but often for a few seconds. I thought what have I done? surely this would be obvious to everyone else if it were a bug.. but a quick look at the code revealed a big problem and one very easy to fix too, well a quick fix at least :).
Anyway problem is below, I posted this to the smaugfuss site too but thought it was large enough to warrent posting here..

If the socket is blocked, you continue the loop.
What should happen is either you check how many times this has failed, or you just return false immediately
as it is it can lag a mud for minutes at a time if the socket blcoks more commonly it just lags the mud for a few seconds as sockets usually die..

Why bother using non blocking sockets when you basically block them waiting until you can write??

The same problem is also further down in write_to_descriptor also, best solution would be to put data back into buffer and let the mud try again later and perhaps a counter in descriptor_data that could count the bad retries each time flush_buffer is called and perhaps after 10 blocked writes dump the socket.

of course the easiest fix is just to return false straight away, but this could dump people when it isn't necessary as sockets can block temporarily for various reasons.


len = d->mccp->out_compress->next_out - d->mccp->out_compress_buf;
if( len > 0 )
{
for( iStart = 0; iStart < len; iStart += nWrite )
{
nBlock = UMIN( len - iStart, 4096 );
nWrite = send( d->descriptor, d->mccp->out_compress_buf + iStart, nBlock, 0 );
if( nWrite == -1 )
{
iErr = errno;
if( iErr == EWOULDBLOCK )
{
/*
* This is a SPAMMY little bug error. I would suggest
* not using it, but I've included it in case. -Orion
*
perror( "Write_to_descriptor: Send is blocking" );
*/
nWrite = 0;
continue;
}
else
{
perror( "Write_to_descriptor" );
return FALSE;
}
}

if( !nWrite )
break;
}

if( !iStart )
break;

if( iStart < len )
memmove( d->mccp->out_compress_buf, d->mccp->out_compress_buf + iStart, len - iStart );

d->mccp->out_compress->next_out = d->mccp->out_compress_buf + len - iStart;
}

<snip>

and same problem here for players not using mccp
probably best to put data back into outbuf and let it try again later



for( iStart = 0; iStart < length; iStart += nWrite )
{
nBlock = UMIN( length - iStart, 4096 );
nWrite = send( d->descriptor, txt + iStart, nBlock, 0 );
if( nWrite == -1 )
{
iErr = errno;
if( iErr == EWOULDBLOCK )
{
/*
* This is a SPAMMY little bug error. I would suggest
* not using it, but I've included it in case. -Orion
*
perror( "Write_to_descriptor: Send is blocking" );
*/
nWrite = 0;
continue;
}
else
{
perror( "Write_to_descriptor" );
return FALSE;
}
}
}
[Go to top] top

Posted by Nick Gammon   Australia  (23,016 posts)  [Biography] bio   Forum Administrator
Date Reply #1 on Tue 16 May 2006 10:46 PM (UTC)
Message
If you log onto the forum (top RH corner) you can edit your own posts.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Starlick   (9 posts)  [Biography] bio
Date Reply #2 on Tue 16 May 2006 10:52 PM (UTC)
Message
Excellent, thanks Nick.
[Go to top] top

Posted by David Haley   USA  (3,881 posts)  [Biography] bio
Date Reply #3 on Tue 16 May 2006 11:37 PM (UTC)
Message
Quote:
Why bother using non blocking sockets when you basically block them waiting until you can write??
The point of a non-blocking socket is not that you can always write to it and "forget" about it, it's that a blocking operation will return a WOULDBLOCK error. This prevents your program from blocking up uselessly on a network fluke, which obviously is something you don't want to do (blocking up, that is).

You are correct to point out that it is somewhat silly to keep pounding on the socket if it's blocking.

However you are also correct to point out that returning false really isn't a terribly good solution, because as you say it could boot people off incorrectly.

A better solution is to simply act as if the send had succeeded, and wait till the next loop to send whatever you need to send.

I rewrote the network code from scratch for my MUD, and what I do is basically this:


while ( there's still stuff to write )
  write a chunk of it
  if write failed:
    if error == would block
      return success (try again next time)
    else
      return failure, boot player
    end if
  end if
end while


The key is that on block I simply wait till next time. Now, one of the major changes from SMAUG's network code is that "next time" is not a whole tick away (roughly 0.5 seconds, if I remember correctly). My code continuously polls sockets, for both incoming and outgoing data; as soon as there's something to read or write, it tries to take care of that. Meanwhile it is keeping track of the time, and once the tick time has elapsed it calls the update function.

The result is that the MUD feels much more responsive, as input/output is handled immediately, as opposed to handling a batch of it, then pausing until the next update -- that is what SMAUG does.

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org
[Go to top] top

Posted by Starlick   (9 posts)  [Biography] bio
Date Reply #4 on Tue 16 May 2006 11:44 PM (UTC)
Message
"The point of a non-blocking socket is not that you can always write to it and "forget" about it, it's that a blocking operation will return a WOULDBLOCK error. "

Yeah I do realise what blocking is vs non blocking. I was just meaning that why did they even bother using non blocking sockets when they loop anyway :).

A nice idea about polling continuously, I might try that and see how it goes.


I just added code to add the unsent text back into the buffer that it couldnt send so it gets resent again on the next try.

It seems stock smaug just returns false on any error including a blocked socket.
[Go to top] top

Posted by Starlick   (9 posts)  [Biography] bio
Date Reply #5 on Wed 17 May 2006 01:23 AM (UTC)

Amended on Tue 11 Mar 2008 03:30 AM (UTC) by Nick Gammon

Message
Just incase anyone wants a fix..

i tried to submit it here but was too big..
you can see the fix here...
http://forums.smaugfuss.org/index.php?a=topic&t=695&amp

should do the trick, requeues any data it cant send due to blocks, as if you just returned true as it currently is you would lose data to that descriptor, or if you returned false the socket would be dropped :).

no more lag!

[EDIT - 11 March 2008] - The Smaug FUSS site is now http://www.smaugmuds.org/
[Go to top] top

Posted by Starlick   (9 posts)  [Biography] bio
Date Reply #6 on Mon 22 May 2006 03:57 AM (UTC)

Amended on Mon 22 May 2006 04:25 AM (UTC) by Starlick

Message
Just to let everyone know that I found a few issues with my fix and then found various other strange things in smaug/smaugfuss that didn't make much sense.. I have fixed these but our code is so different now that it will take a while to make a patch for regular smaug or smaugfuss..

The problems I found were

1) in regular smaug, if a socket blocks, smaug drops the connection..

2) in smaugfuss if a socket blocks, smaugfuss sits there in a loop waiting for the socket to unblock, this can be seconds or in worse case scenarios minutes.

3) Why does smaug/smaugfuss do a select on the sockets without a timeout? and then have to use the sleep command?
and why does it select on on outbound sockets when they have no data to be written to?


Anyway the first two i addressed by checking for a blocked socket, if it was blocked i took note of how many bytes were successfully written and then had to add the rest back into the buffer. this means that if someone has a blocked socket they dont get dropped and no one else suffers lag because the mud hammers their socket waiting to write.

the 3rd I am not sure what the writers were doing.. seems a strange mess of code in there really..

I then decided to give select a delay, remove the sleep.
I also only added descriptors to the outset of descriptors that actually had data waiting to go out as otherwise select is going to return instantly everytime as there will always be descriptors that are ready for output even though the mud might not have anything to send.

Then i got rid of the sleep, but added a loop to check the time, so that after each select, it checks if .25 seconds have passed, and only allows the mud activity to happen then.


This means by doing it this way, mud input/output is lightning fast, no delays at all and we are only looking at descriptors when there is actually something to send or receive.

I also increased the output to 4k each time instead of the miserable 512 bytes that it currently is..

I was suprised to find that every code base has these problems and no one has ever addressed them...

Why? - I have no idea.

If anyone actually wants to see what kind of change this makes to a mud (hit ethereal2.org 4000) and type idea list.

Once you make the change you can never go back to that old laggy code hehe






[Go to top] top

Posted by David Haley   USA  (3,881 posts)  [Biography] bio
Date Reply #7 on Mon 22 May 2006 05:39 AM (UTC)
Message
What you did regarding the sleeping/timeouts is basically what I did, and yes, it does make I/O lightning fast. The main problem is that you get a TICK_TIME seconds window to make your input; as you say by adding the sockets at every iteration even if they have nothing to send, select returns instantly.


I wrote up a small report on this a while ago. I suppose I should have mentioned this earlier. :-)

http://www.gammon.com.au/forum/bbshowpost.php?bbsubject_id=2989

There's some other interesting information in there.

As to your 'why' questions, I think Nick put it correctly:
Quote:
I'm taking a guess here that whoever did that made changes to code that was initially working properly without fully understanding what they were doing.


Basically, networking code is something that relatively few people actually understand. A lot of people hack around on SMAUG without understanding exactly what is going on in the big picture. Network code is the kind of thing that's fairly easy to drastically break if you don't know what all the functions are for, and my bet is that that's why the network code behaves so oddly.

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org
[Go to top] top

Posted by Starlick   (9 posts)  [Biography] bio
Date Reply #8 on Mon 22 May 2006 06:17 AM (UTC)
Message
Yeah seems we were on the same track.

it is funny because I have also changed our mud to using events instead of cycling through everything checking to see if something needs to be done.

I just used an array of 240 lists.. one list per .25 seconds and then put events in whatever list was necessary, and if i needed it to go happen beyond a minute just had an int in the event structure for how many rounds it should cycle etc.

Of course had to keep a record on the character structure or object structure of how many entries were in the lists so didnt extract that character before removing all entries or nasty things happen hehe.

Was it worth it? Well it certainly isnt a hard change to make but time consuming it was.. It does make it very easy to add cool things that you want to happen in the future etc, but on any cpu bought today it doesnt really make much difference in the cpu usage hehe. Only satisfaction is you know that the mud is doing things efficiently.


You maybe right about the network code. I downloaded quite a few distributions going way back, all seem to work with the same method. Maybe people just thought well it works, lets not change it..


[Go to top] top

Posted by Samson   USA  (683 posts)  [Biography] bio
Date Reply #9 on Mon 22 May 2006 01:04 PM (UTC)
Message
Well I don't know about others, but any improvement to the network performance is welcomed by me. I look forward to seeing the patch. Networking code isn't one of my best areas, so like others I've tended to leave it be since it "works well enough".
[Go to top] top

Posted by Zeno   USA  (2,871 posts)  [Biography] bio
Date Reply #10 on Fri 26 May 2006 03:49 AM (UTC)
Message
I may have run into this just now. I hit an infinite loop, with this backtrace:
#0  0x007be402 in __kernel_vsyscall ()
#1  0x008a5b51 in send () from /lib/libc.so.6
#2  0x080c4eb6 in _to (desc=17,
    txt=0x90ded38 "\n\r\033[1;37mMikomi dual-wields a small stick.\n\r\033[1;37mMikomi blocks the oncoming attack.\n\r\033[1;37mRowe scratches Mikomi. \033[0;32mShe does 89 points of damage.\033[0m\033[1;37m\n\r\033[1;37mMikomi blocks the oncoming "..., length=4004) at comm.c:2174
#3  0x080c9ae3 in flush_buffer (d=0x910bab8, fPrompt=0 '\0') at comm.c:1991
#4  0x080c568d in close_socket (dclose=0x910bab8, force=0 '\0') at comm.c:1576
#5  0x080c64f3 in check_playing (d=0x90da1a0, name=0x9108df8 "Sevi", kick=1 '\001') at comm.c:3721
#6  0x080c756c in nanny (d=0x90da1a0, argument=0xbfebf404 "carpet") at comm.c:2557
#7  0x080caa54 in game_loop () at comm.c:1246
#8  0x080cc68b in main (argc=17, argv=0x90df9c5) at comm.c:604

In write_to_descriptor, nWrite was negative. So of course this was the infinite loop:
for ( iStart = 0; iStart < length; iStart += nWrite )

Zeno McDohl,
Owner of Bleached InuYasha Galaxy
http://www.biyg.org
[Go to top] top

Posted by Starlick   (9 posts)  [Biography] bio
Date Reply #11 on Sat 27 May 2006 12:17 AM (UTC)

Amended on Sat 27 May 2006 12:18 AM (UTC) by Starlick

Message
Yeah thats the loop, it's got stuck writing to a blocked socket.

It happens a lot on reconnects because....
Normally a select is done which tells the mud which sockets it can write to, so it doesnt write to sockets that arent ready.. in the loop that calls flush_buffer it checks to see if that socket has been selected by the select.

However, when someone reconnects, it does a close_socket, when then calls flush_buffer on their old socket, so it starts trying to write to that old socket even though it's not ready and then causes a loop while it hammers a blocked socket, and those sockets can take minutes to die sometimes.

in the close socket code you should check if the socket is ready to be written to.. as there is no point in flushing the buffer if we cant write to it.

But of course aside from this in the write_to_descriptor loop you need to check for a blocket socket as it could become blocked during the process also. As a quick fix its better to return FALSE which will drop their connection rather than have the whole mud lockup.

void close_socket( DESCRIPTOR_DATA * dclose, bool force )
{
CHAR_DATA *ch;
DESCRIPTOR_DATA *d;
bool DoNotUnlink = FALSE;
SNOOP_DATA *snoop, *next_snoop;

/*
* flush outbuf
*/
if( !force && dclose->outtop > 0 &&
FD_ISSET( dclose->descriptor, &out_set ))
flush_buffer( dclose, FALSE );
[Go to top] top

Posted by David Haley   USA  (3,881 posts)  [Biography] bio
Date Reply #12 on Sat 27 May 2006 12:29 AM (UTC)
Message
If I could make a suggestion, when dealing with such fundamental code as the networking, it would really be best to avoid quick fixes, and to prefer proper fixes that understand the problem and solve it in some generally correct way. Making quick fixes to fix a specific issue without completely understanding how it fits into the overall picture can introduce very subtle errors into a program, especially when you're dealing with a fundamental component that provides I/O for the entire game.

This is another reason why I suspect the network code hasn't changed much; it's very easy to break things by making small tweaks, and so most people prefer to simply stay away from the whole thing because it (generally speaking) works "well enough".



The problem in this case is that the design behind flush_buffer is really not a terribly safe one, because it tries to write out the entire buffer before closing it. Instead, it would be much safer to mark the socket as closed so that it accepts no further output, and to let the socket flush itself normally. Optionally a timeout could be introduced such that if it doesn't flush within, say, five seconds, the socket is simply cut off, throwing away the entire buffer contents.

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org
[Go to top] top

Posted by Starlick   (9 posts)  [Biography] bio
Date Reply #13 on Sat 27 May 2006 12:39 AM (UTC)

Amended on Sat 27 May 2006 12:40 AM (UTC) by Starlick

Message
Checking if the socket is ready for writing before writing to it won't break anything, especially as it's in the close_socket function anyway so its about to kill the socket.

That is far better than what currently happens (the mud locking up when someone reocnnects and it starts trying to write to their old dead socket).

This is a change someone can make in seconds to their mud and is perfectly safe and will stop them getting huge lag at times.






[Go to top] top

Posted by David Haley   USA  (3,881 posts)  [Biography] bio
Date Reply #14 on Sat 27 May 2006 02:35 AM (UTC)
Message
Yes, clearly it is better than the MUD locking up. My point is that it is not correct to check only once if the socket is ready before flushing the buffer, because the socket might stop being ready but be ready later on. Don't forget that close_socket is used in situations other than reconnects, and you really want the flush_buffer function idea to work normally.

Basically, you want to keep checking if it's ready to write to all the way until either:
a) you've written everything you want to write
b) some error condition (or timeout) occurs

It just isn't right to only check once and then write, because you might check once, write half the data, and then throw away the rest, which is quite simply (but, yes, unfortunately) an incorrect solution.

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org
[Go to top] top

The dates and times for posts above are shown in Universal Co-ordinated Time (UTC).

To show them in your local time you can join the forum, and then set the 'time correction' field in your profile to the number of hours difference between your location and UTC time.


50,085 views.

This is page 1, subject is 2 pages long: 1 2  [Next page]

It is now over 60 days since the last post. This thread is closed.     [Refresh] Refresh page

Go to topic:           Search the forum


[Go to top] top

Quick links: MUSHclient. MUSHclient help. Forum shortcuts. Posting templates. Lua modules. Lua documentation.

Information and images on this site are licensed under the Creative Commons Attribution 3.0 Australia License unless stated otherwise.

[Home]


Written by Nick Gammon - 5K   profile for Nick Gammon on Stack Exchange, a network of free, community-driven Q&A sites   Marriage equality

Comments to: Gammon Software support
[RH click to get RSS URL] Forum RSS feed ( https://gammon.com.au/rss/forum.xml )

[Best viewed with any browser - 2K]    [Hosted at HostDash]