Gammon Forum : SMAUG : SMAUG coding : String Hashing

Entire forum ➜ SMAUG ➜ SMAUG coding ➜ String Hashing

String Hashing

It is now over 60 days since the last post. This thread is closed. Refresh page

Posted by

Greven Canada (835 posts) Bio

Date

Tue 09 Dec 2003 07:00 AM (UTC)

Message

From what I understand, fread_string and fread_string_nohash do fairly different things. fread_string will use STRALLOC to assign the memory, and fread_string_nohash uses someething else. Now, when your freeing the memory, you need to use either DISPOSE for fread_string_nohash, and STRFREE for fread_string.

Questions: Why does it hash the information, and how is it hashing it? When you use the wrong free macro, it will crash, and I assume this deals with the hashing, but does it for sure? Am I way off?

Nobody ever expects the spanish inquisition!

darkwarriors.net:4848
http://darkwarriors.net

Top

Posted by

David Haley USA (3,881 posts) Bio

Date

Reply #1 on Tue 09 Dec 2003 07:42 AM (UTC)

Message

OK... the basic idea between string hashing is to save memory.

For instance, you might want to have 30 mobs, each with the name "a goblin".

Now, without string hashing, you are storing those 9 bytes (8 letters + null space) 30 times. Not good...

The clever thing with string hashing is that you store it only once, and you keep a reference count on how many times that string is actually used. When the reference count hits 0, you deallocate the memory - but never before.

How the hashing works is a slightly different subject, but basically it keeps a list of string/count structures. When you use STRALLOC, it looks up in the table if the string already exists; if so, it returns the pointer to that and increments the reference count; if not, it allocates new memory, sets the reference count to one, and then returns that pointer.

Hash tables are actually a slightly more complicated concept; I don't think it's necessary to understand the concepts of hashing algorithms, buckets, hash collisions, etc., to understand string hashing.

The QUICKLINK macro basically means that you know that the string you're linking to is in the hash table, so you know where to increment the count. Personally I think that the way quicklink is implemented is dangerous, but that's another story. :)

Anyways, let's consider now that you have a hashed string. Because of how the implemented it (no type safety or anything) your string is basically a char *. This is good because it means you can use it exactly as you would a normal, non-hashed string, but bad because of the freeing.

What happens if two mobs share the same hashed name? There is only one memory allocation, with a reference count of two. If you use the wrong macro - DISPOSE for instance - you do not properly manage the hash table. So, it goes and erases the memory, and poof, the other mob is now pointing to invalid memory.

What STRFREE does is to actually go into the hash table, decrement the count, and only deallocate if there is nothing left.

What I think is very poor in SMAUG's implementation is that the only way to tell if a string is hashed is to look up its allocation, and that can be tedious. It would have been as simple as putting an h_ in front of the variable name, such as mob->h_name... and better yet, which is the solution I am adopting, create your own data structure that provides nicer type safety. In their defense, however, such a solution is much easier to do in C++ where you can have much stricter compile-time typing than in C. Still, they could have done a better job.

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org

Top

Posted by

Greven Canada (835 posts) Bio

Date

Reply #2 on Tue 09 Dec 2003 04:53 PM (UTC)

Message

Ok, then is there any reason to use DISPOSE at all, if STRFREE will do the same job, but better, checking for the count on the string, and wipe it when its done, rather than just wipe it? Or does DISPOSE do something that STRFREE doesn't?

Nobody ever expects the spanish inquisition!

darkwarriors.net:4848
http://darkwarriors.net

Top

Posted by

David Haley USA (3,881 posts) Bio

Date

Reply #3 on Tue 09 Dec 2003 07:14 PM (UTC)

Message

Well, DISPOSE is just a plain old free, as far as I remember. The thing is you can't use STRFREE on a normally allocated string (or block of memory) because it wouldn't have the special hash data structure (string + count) that a hashed string allocated using STRALLOC would have.

If you use STRFREE on a normal string, it'll mess up because it'll start looking for a count that doesn't exist, and change it when it shouldn't. Bad times...

You could have every string allocated using STRALLOC and freed using STRFREE, I suppose, and I think that to some degree that's what the SMAUG people decided to do (mostly). It's just that sometimes you don't need to bother with the whole hashing algorithm and you save some time.

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org

Top

Posted by

Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date

Reply #4 on Wed 10 Dec 2003 10:19 AM (UTC)

Message

All this hashing gets quite complicated, I would agree with Ksilyan that it would be better if it was easier to work out what strings were hashed and what were not.

Either using a C++ class (which doesn't really apply to SMAUG which is written in C), or do away with it entirely. Remember a lot of this space-saving code was written when people had 8 Mb PCs, not 128 Mb, so the need to save space is a little reduced.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Top

Posted by

Samson USA (683 posts) Bio

Date

Reply #5 on Wed 10 Dec 2003 03:12 PM (UTC)

Message

Except that all this hashing stuff is still quite useful when you consider that the vast majority of muds are being used on shared hosts, most of whom impose some strict RAM quotas.

In my case, I did some testing. Currently we use 22MB with the code and its mix of hash/non-hash. Changing it to disable the hashing caused end usage to rise to 29MB - that's a 7MB difference, and would generally be enough to bump things into a higher account grade.

I haven't tried to do the opposite, but I can't imagine that you'd save anything more by making it all hashed. As has been pointed out, if the string doesn't exist the hash code allocates it and sets the count to 1. Only thing that really would do is waste space in the hash buckets.

You can't completely eliminate DISPOSE though, since it's also used to deallocate structures when they're no longer needed. Since this would require going over lots of code to fix, that would probably explain why it never happened :)

Top

Posted by David Haley USA (3,881 posts) Bio

Date Reply #6 on Wed 10 Dec 2003 06:43 PM (UTC)

Message

Yes, it's true that the hashing does save space and that's important on shared machines. Doesn't mean it has to be that hard to use, though, and that prone to subtle yet nearly fatal errors.

I'm not sure why they use DISPOSE, which is just a fancy schmancy version of free. I think it checks against freeing a null pointer, but no programmer should be disposing a null anyways...

My solution to this was to write a C++ hashed string class that when you assign to it automatically hashes the string into the table; it also automatically removes the string from the table before that, and when you deallocate it (or when it is deallocated automatically.)

I haven't implemented it everywhere yet because I don't feel like changing all of it just now, but I will at some point. Everything I do that needs one of those uses the new version; I guess sometime I'll have to attack the rest as well. :P

When I said make everything hashed, I meant all variables that actually stick around, not those little temporary ones declared in functions. Like if you need a small buffer for just a few calls, don't bother hashing it; that would be a waste of processor time. You'd probably save a few k, which is not a big deal considering the work it would be. Then again, results will vary a lot depending on how big your MUD is.

If you use a typedef in C, can't that at least semi-solve the problem? For example, you can say:
typedef char* HashedString;
And then declare:


struct MobileStruct
{
  HashedString name;
}

Will C let you use that as a normal char* or will it complain at you? I guess it depends on how strict your compiler is; I've found that gcc is not strict at all, and it's generally a much better idea to use g++, even on normal C.

For example, when I ported a Star Wars MUD to C++, without changing anything at all but moving from gcc to g++, roughly 1,000 errors (not just warnings) were generated from poor code that did very strange (and unsafe) things. I think it's good to bonk the programmer around like that, and make sure that s/he *really* knows what s/he is doing. Anyways... that's just me. :P

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org

Top

Posted by

Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date

Reply #7 on Wed 10 Dec 2003 08:34 PM (UTC)

Message

Quote:

Currently we use 22MB with the code and its mix of hash/non-hash. Changing it to disable the hashing caused end usage to rise to 29MB - that's a 7MB difference, and would generally be enough to bump things into a higher account grade.

Good point. In that case an alternative would be compression. I know PennMUSH has used compression rather than hashing for a long time, and offer a number of algorithms, from no compression, to Huffman encoding, a token replacement system that I wrote, and maybe one or two more.

The problem with hashing is that the whole string has to exactly match (ie. give the same hash) whereas with compression, if hundreds of descriptions had the same words in them (eg. "room", "door", "exit", "Darkhaven", "you") then they compress down quite a bit. Some compression algorithms (like zLib) allow you to supply a pre-composed dictionary (eg. the words above) so that you get the compression even on short strings.

Generally compression is slower than decompression, because that is the time the dictionary lookups occur, and the compression would only occur once (when the area is loaded), and thus the decompression may not have a great speed penalty.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Top

Posted by

Greven Canada (835 posts) Bio

Date

Reply #8 on Sun 14 Dec 2003 04:54 AM (UTC)

Message

Do you happen to know where to get the compression code? I don't know if yours is publicly available, but if not, can someone maybe suggest one? It seems to me that if there is something to improve effeiceny with a reasonable amount of cost(processor time, I assume, with the compression), that it is something worth while, similiar to MCCP. Thanks for all the help and explanations, everyone.

Nobody ever expects the spanish inquisition!

darkwarriors.net:4848
http://darkwarriors.net

Top

Posted by

Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date

Reply #9 on Sun 14 Dec 2003 06:04 AM (UTC)

Message

It is all publicly available in the PennMUSH source code, see www.pennmush.org.

However if I was doing it today I would use zLib, a public compression/decompression algorithm. It is the one used for MCCP.

See: http://www.gzip.org/zlib/

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Top

The dates and times for posts above are shown in Universal Co-ordinated Time (UTC).

To show them in your local time you can join the forum, and then set the 'time correction' field in your profile to the number of hours difference between your location and UTC time.

21,118 views.

It is now over 60 days since the last post. This thread is closed. Refresh page

Go to topic: Search the forum

top