Network Working Group                                       T. Klingberg
Request for Comments: NNNN                                   R. Manfredi
Category: Informational                                        June 2002


                        Gnutella 0.6


Status of this Memo

   This is a draft.


Copyright Notice

   Copyright (C) 2002, Tor Klingberg & Raphael Manfredi
   All Rights Reserved.

   Permission is granted to make verbatim copies of this document, 
   provided the Copyright Notice is preserved.


Rights of Free Implementation

   The authors of the various proposals that make up this document
   grant the rights to anyone to freely implement those proposals.
   Gnutella is an Open Protocol, where the specifications are
   public and free of any patent.


Table of Contents

   1   Introduction
   1.1   Purpose
   1.2   Requirements
   1.3   Terminology
   1.4   Extending the protocol
   2   Protocol Definition
   2.1   Initiating a Connection
   2.2   Gnutella Messages
   2.2.1   Message Header
   2.2.2   Ping (0x00)
   2.2.3   Pong (0x01)
   2.2.4   Use of Ping and Pong messages
   2.2.4.1   A simple pong caching scheme
   2.2.4.2   Other pong caching schemes
   2.2.5   Query (0x80)
   2.2.6   Query Hit
   2.2.7   Use of Query and Query Hit
   2.2.7.1   Forwarding and routing of Query and Query Hit messages
   2.2.7.2   When and how to send new Query messages.
   2.2.7.3   When and how to respond with Query Hit messages.
   2.2.8   Push (0x40)
   2.2.9   Bye (0x02)
   2.3   GGEP Extension blocks
   2.3.1   GGEP Format
   2.3.2   Creating Extension IDs
   3   Protocol Usage
   3.1   Flow Control
   3.2   Network Structure
   3.2.1   Query Routing Protocol (unfinished)
   4   File Transfer
   4.1   Normal File Transfer
   4.2   Firewalled servants
   4.3   Busy Servants
   4.4   Sharing
   5   Security Considerations
   5.1   Threats against individual Gnutella participants
   5.2   Threats against the Gnutella network
   5.3   Threats against third parties
   6   Credits
   Appendix 1   HUGE (Hash/URN Gnutella Extensions)
   Appendix 2   XML
   Appendix 3   Finding a Gnutella host
   Appendix 4   When to open or accept new Gnutella connections
   Appendix 5   Gnutella network traffic compression


1 Introduction

1.1 Purpose     

Gnutella is a decentralized peer-to-peer system. It allows the
participants to share resources from their system for others to
see and get, and locate resources shared by others on the network.

Resources can be anything: mappings to other resources, cryptographic
keys, files of any type, meta-information on keyable resources, etc.
However, the semantics for locating and handling resources other than
plain files are not specified in this document.

Each participant launches a Gnutella program, which will seek out 
other Gnutella nodes to connect to.  This set of connected nodes 
carries the Gnutella traffic, which is essentially made of queries, 
replies to those queries, and also other control messages to 
facilitate the discovery of other nodes.

Users interact with the nodes by supplying them with the list of
resources they wish to share on the network, can enter searches for
other's resources, will hopefully get results from those searches,
and can then select those resources amongst the results: if those
resources are files, for instance, they can download them.  But one
can imagine other types of resources that, once fetched, will bring
more than their content value.

Resource data exchanges between nodes are negotiated using the 
standard HTTP protocol.  The Gnutella network is only used to locate 
the nodes sharing those resources.

This document is intended for readers with a fair knowledge of 
network programming, but do not require any previous Gnutella 
experience.  Still, other implementations of this protocol will give
useful information about implementation techniques that is not 
included in this document.  A list of Gnutella programs can be found 
at http://www.gnutelliums.com


1.2 Requirements

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [34].


1.2.1 The Gnutella Development Forum (the GDF)

The Gnutella Development Forum is a good place to find more Gnutella 
documentation, proposals about changes and extensions and to discuss 
Gnutella development with other developers. The message archive is 
also a good source for information about the protocol and its 
implementation. Some of the links in this document requires 
membership in the Gnutella Development Forum. Everyone is, of course,
allowed to become a member. The GDF is located at 
http://groups.yahoo.com/group/the_gdf

There are many other forums for discussing Gnutella development as 
well.


1.3 Terminology

Servant         A program participating in the Gnutella network is 
                called a servant. The words "peer", "node" and "host"
                have similar meanings, but refers to a network 
                participant rather than a program. When a servant 
                have a clear client or server role the words "client"
                or "server" may be used. The word "client" is 
                sometimes used as a synonym for servant. Some other 
                documents use the word "servent" instead of servant, 
                as a contraction of "SERVer" and "cliENT".

Message         Messages are the entity in which information is
                transmitted over the network. Sometimes the word 
                "packet" is used with the same meaning. Some other 
                documents use the word "descriptor"

GUID            Globally Unique IDentifier.  This is a 16-byte long
                value made of random bytes, whose purpose it is to
                identify servants and messages.  This identification
                is not a signature, just a way to identify network
                entities in a unique manner.

1.4 Extending the protocol

This document is the definition of the Gnutella 0.6 protocol. 
Servants MAY extend the protocol or even change parts of it (for 
example by compressing or encrypting the messages), but servants 
MUST always stay compatible with servants that follow this 
specification. 

If a servant, for example, wants to compress the Gnutella messages, 
it MUST first make sure the remote host of a connection can 
decompress the stream (during handshake), and otherwise leave the 
messages uncompressed. Servants MAY chose not to accept a connection 
with a servant that does not support a feature, but MUST always make 
sure that the Gnutella network is not split into separate networks. 

Separate networks for special purposes are, of course, allowed but 
then it is no longer the Gnutella network, but another network. 

This protocol also allows for extensions inside many messages. Such 
extensions can pass through servants that do not know about the 
extension to reach servants that do.


2 Protocol Definition

The Gnutella protocol defines the way in which servants communicate 
over the network. It consists of a set of messages used for 
communicating data between servants and a set of rules governing 
the inter-servant exchange of messages. Currently, the following 
messages are defined:

Ping            Used to actively discover hosts on the network. A 
                servant receiving a Ping message is expected to 
                respond with one or more Pong messages.

Pong            The response to a Ping. Includes the address of a 
                connected Gnutella servant, the listening port of
                that servant, and information regarding the amount
                of data it is making available to the network.

Query           The primary mechanism for searching the distributed 
                network. A servant receiving a Query message will
                respond with a Query Hit if a match is found against 
                its local data set.

QueryHit        The response to a Query. This message provides the
                recipient with enough information to acquire the data
                matching the corresponding Query.

Push            A mechanism that allows a firewalled servant to 
                contribute file-based data to the network.

Bye             An optional message used to inform the remote host 
                that you are closing the connection, and your reason 
                for doing so.


2.1 Initiating a Connection

A Gnutella servant connects itself to the network by establishing a 
connection with another servant currently on the network. 
Techniques for finding the first host are described in Appendix 3. 
Once the first connection is established, the addresses of more hosts
will be supplied over the network. The default Gnutella port is 6346,
but servants MAY use any unused port. If the desired port is used 
(probably by another Gnutella servant) the servant SHOULD attempt to 
listen on another port.  This listening port is advertised by the
servant through the Pong messages.

Techniques and rules for how to select what other Gnutella hosts to 
connect to and when to accept connection requests can be found in 
Appendix 4.

Once the address of another servant on the network is obtained, a 
TCP/IP connection to the servant is created, and a handshaking 
sequence is initiated. The client is the host initiating the 
connection and the server is the host receiving it. "<cr>" refers 
to ASCII character 13 (carriage return), and "<lf>" to 10 (new line).

   1. The client establishes a TCP connection with the server.
   2. The client sends "GNUTELLA CONNECT/0.6<cr><lf>".
   3. The client sends all capability headers--except for
      vendor-specific headers--each terminated by "<cr><lf>", with
      an extra "<cr><lf>" at the end.
   4. The server responds with "GNUTELLA/0.6 200 <string><cr><lf>".
      <string> SHOULD be "OK", but servants SHOULD just look for the
      "200" code.
   5. The server sends all its headers, in the same format as in (3).
   6. The client sends "GNUTELLA/0.6 200 OK<cr><lf>, as in (4) if
      after parsing the server's headers, it still wishes to connect.
      Otherwise, it needs to reply with an error code and close the
      connection.
   7. The client sends any vendor-specific headers as needed, in the
      same format as (3).
   8. Both client and server send binary messages at will, using the
      information gained in (3) and (5).  

All headers SHOULD be registered with the GDF database at
http://groups.yahoo.com/group/the_gdf/database?method=reportRows&tbl=9
(Requires GDF membership)

Headers follow the standards described in RFC822 and RFC2616.  Each
header is made of a field name, followed by a colon, and then the 
value.  Each line ends with the <cr><lf> sequence, and the end of the
headers is marked by a single <cr><lf> line.  Each line normally 
starts a new header, unless it begins with a space or an horizontal 
tab (ASCII codes 32 and 9 in decimal, respectively), in which case it
continues the preceding header line.  The extra spaces and tabs may 
be collapsed into a single space as far as the header value goes.  
For instance:

    First-Field: this is the value of the first field<cr><lf>
    Second-Field: this is the value<cr><lf>
        of the<cr><lf>
        second field<cr><lf>
    <cr><lf>

The header above is made of two fields, "First-Field" and "Second-
Field" whose values are respectively "this is the value of the first 
field" and "this is the value of the second field" (leading spaces of
the continuation were collapsed).  Note that the leading space 
between the ":" ending the field name and the start of the value 
string does not count.

Multiple header lines with the same field name are identical to one
header line where all the values of the fields would be separated by 
",". This means:

    Field: first<cr><lf>
    Field: second<cr><lf>

is strictly equivalent to saying:

    Field: first,second<cr><lf>

In other words, order matters in that case.

Here is a sample interaction between a client and a server.  Data 
sent from client to server is shown on the left; data sent from 
server to client is shown on the right.

       Client                           Server
       -----------------------------------------------------------
       GNUTELLA CONNECT/0.6<cr><lf>
       User-Agent: BearShare<cr><lf>
       Query-Routing: 0.2<cr><lf>
       Pong-Caching: 0.1<cr><lf>
       GGEP: 0.5<cr><lf>
       <cr><lf>
                                        GNUTELLA/0.6 200 OK<cr><lf>
                                        User-Agent: BearShare<cr><lf>
                                        Query-Routing: 0.1<cr><lf>
                                        Pong-Caching: 0.1<cr><lf>
                                        GGEP: 0.5<cr><lf>
                                        Private-Data: 5ef89a<cr><lf>
                                        <cr><lf>
       GNUTELLA/0.6 200 OK<cr><lf>
       Private-Data: a04fce<cr><lf>
       <cr><lf>

       [binary messages]                [binary messages]

A few notes about the responses: first, the client (server) SHOULD
disconnect if receiving any response other than "200" at step 4
(6).  There is no need to define these error codes now.  Second,
servants SHOULD ignore higher version numbers in steps (2), (4), and
(6).  For example, it is perfectly legal for a future client to
connect to a server and send "GNUTELLA CONNECT/0.7".  The server
SHOULD respond with "GNUTELLA/0.7 200 OK" if it supports the 0.7
protocol, or "GNUTELLA/0.6 200 OK" otherwise.

A few notes about the headers: servants SHOULD use standard HTTP
headers whenever appropriate.  For example, servants SHOULD use the
standard "User-Agent" header rather than make up a "Servant-Vendor"
header.  However, it is perfectly legal to add new headers (e.g.,
"Query-Routing") when no appropriate HTTP header exists, as long as
they follow HTTP syntax. Headers unknown to the servant MUST be 
ignored.

Some older servants will initiate the handshake by sending 
"GNUTELLA CONNECT/0.4<lf><lf>". The server SHOULD then reply with 
"GNUTELLA OK<lf><lf>" followed by binary messages, if it can accept 
the connection. Servants MAY retry using the 0.4 connect string if
the 0.6 connection attempt were rejected. No handshaking headers can 
be used in 0.4 handshaking.

When rejecting a connection, a servant MUST, if possible, provide the
remote host with a list of other Gnutella hosts, so it can try 
connecting to them. This SHOULD be done using the X-Try header.

An X-Try header can look like:

        X-Try:1.2.3.4:1234,3.4.5.6:3456

There MAY be a space after the colon and after each comma. There MAY 
be multiple X-Try headers in one header set. The header MAY end with
an extra comma.  The header MAY be formatted on several lines using
continuations.

Each item in the X-Try header gives the IP address of a servant
and its listening port number.  This is sometimes referred to as
being a "connection pong".  If the server sending the X-Try 
implements Pong-Caching, then the connection pongs being sent must be
fresh ones.

The normal status code for rejecting a connection because the servant
is busy is "503 " followed by "Busy" or another description string.


2.2 Gnutella Messages

Once a servant has connected successfully to the network, it 
communicates with other servants by sending and receiving Gnutella 
protocol messages. Each message is preceded by a Message Header with 
the byte structure given below.

Note 1: One IP packet may contain several Gnutella messages, and 
one Gnutella message may be split up on multiple IP-packets. This 
means one can never assume a Gnutella message ends when the chunk of 
data read from the socket ends.

Note 2: All fields in the following structures are in little-endian 
byte order unless otherwise specified.

Note 3: All IP addresses in the following structures are in IPv4 
format. For example, the IPv4 byte array

    0xD0     0x11     0x32     0x04
    byte 0   byte 1   byte 2   byte 3

represents the dotted address 208.17.50.4.


2.2.1 Message Header

The message header is 23 bytes divided into the following fields.

    Bytes:  Description:
    0-15    Message ID/GUID (Globally Unique ID)
    16      Payload Type
    17      TTL (Time To Live)
    18      Hops
    19-22   Payload Length

Message ID      A 16-byte string (GUID) uniquely identifying the
                message on the network. 
                        
                Servants SHOULD store all 1's (0xff) in byte 8 of the
                GUID.  (Bytes are numbered 0-15, inclusive.) This 
                serves to tag the GUID as being from a modern 
                servant.
        
                Servants SHOULD initially store all 0's in byte 15 of
                the GUID. This is reserved for future use.

                The other bytes SHOULD have random values.

Payload         Indicates the type of message
Type            0x00 = Ping
                0x01 = Pong
                0x02 = Bye
                0x40 = Push
                0x80 = Query
                0x81 = Query Hit

                Other Gnutella messages can be used, but if so the
                servant MUST first make sure that the remote host 
                supports this new message type. This can be done 
                using handshaking headers.

TTL             Time To Live. The number of times the message 
                will be forwarded by Gnutella servants before it is 
                removed from the network. Each servant will decrement
                the TTL before passing it on to another servant. When
                the TTL reaches 0, the message will no longer be 
                forwarded (and MUST not).

Hops            The number of times the message has been forwarded.
                As a message is passed from servant to servant, the
                TTL and Hops fields of the header must satisfy the 
                following condition:
                TTL(0) = TTL(i) + Hops(i)
                Where TTL(i) and Hops(i) are the value of the TTL and
                Hops fields of the message, and TTL(0) is maximum 
                number of hops a message will travel (usually 7).

Payload         The length of the message immediately following 
Length          this header. The next message header is located 
                exactly this number of bytes from the end of this 
                header i.e. there are no gaps or pad bytes in the 
                Gnutella data stream. Messages SHOULD NOT be larger
                than 4 kB.

The Payload Length field is the only reliable way for a servant to 
find the beginning of the next message in the input stream. 
Therefore, servants SHOULD rigorously validate the Payload Length 
field for each message received.  If a servant becomes out of synch 
with its input stream, it SHOULD close the connection associated with
the stream since the upstream servant is either generating, or 
forwarding, invalid messages.

Abuse of the TTL field in broadcasted messages (Query) will lead to 
an unnecessary amount of network traffic and poor network 
performance.  Therefore, servants SHOULD carefully check the TTL 
fields of received query messages and lower them as necessary.  
Assuming the servant's maximum admissible Query message life is 7 
hops, then if TTL + Hops > 7, TTL SHOULD be decreased so that TTL + 
Hops = 7.  Broadcasted messages with very high TTL values (>15) 
SHOULD be dropped.

Immediately following the message header, is a payload consisting 
of one of the following messages.


2.2.2 Ping (0x00)

Ping messages MAY contain a GGEP extension block (see Section 2.3),
but no other payload.


2.2.3 Pong (0x01)

Pong messages contains information about a Gnutella host. The 
message has the following fields

    Bytes:  Description:
    0-1     Port number. The port number on which the responding
            host can accept incoming connections.
    2-5     IP Address. The IP address of the responding host.
            Note: This field is in big-endian format.
    6-9     Number of shared files. The number of files that the
            servant with the given IP address and port is sharing
            on the network.
    10-13   Number of kilobytes shared. The number of kilobytes
            of data that the servant with the given IP address and
            port is sharing on the network.
    14-     OPTIONAL GGEP extension block. (see Section 2.3)

Pong messages are only sent in response to an incoming Ping 
message. It is valid for more than one Pong message to be sent in 
response to a single Ping message. This enables host caches to send 
cached servant address information in response to a Ping request.

The Message ID of a Pong message MUST be the Message ID of the Ping 
message it is sent in reply to.

The fields specifying the number of shared files and the number of 
kilobytes shared was intended to allow one to measure the amount of 
data available on the network.  With a very large Gnutella network, 
and minimized Ping and Pong message traffic, this can no longer be 
done.  Still, these fields SHOULD be filled out correctly. 


2.2.4 Use of Ping and Pong messages

In early versions Gnutella, Ping messages were broadcasted over the
network. Pong messages were then routed back to the originator of 
the Ping message the same way as Query Hits messages are routed
(se section 2.2.7). That system consumed a lot of network bandwidth, 
so modern Gnutella servants cache Pong messages, or use other means 
of minimizing the bandwidth used by Ping and Pong messages. 

There are different systems for handling Ping and Pong messages, 
but what they have in common is:

    * When a Ping message is received (TTL>1 and it was at least one 
      second since another Ping was received on that connection), a 
      servant MUST, if possible, respond with a number of Pong 
      Messages. These pongs MUST have the same message ID as the 
      incoming ping, and a TTL no lower than the hops value of the 
      ping. The number of pongs returned may vary, but 10 is a 
      reasonable number. Servants that are able to accept incoming 
      Gnutella SHOULD reply to these Ping messages.

    * The pongs sent SHOULD have a good quality. That includes
      high probability that they are connectable and a good spread 
      of hosts from across the network

    * The bandwidth used by Ping and Pong messages SHOULD be 
      minimized. Servants MUST never output very high quantities of 
      Ping and Pong messages.

    * An incoming Ping message with TTL = 1 and Hops = 0 or 1 is 
      used to probe the remote host of a connection, and MUST 
      always be replied to with a pong having information about the 
      host who received the ping.

    * An incoming Ping message with TTL = 2 and Hops = 0 is a 
      "Crawler Ping" used to scan the network. It and SHOULD be 
      replied to with pongs containing information about the host 
      receiving the ping and all other hosts it is connected to. The
      information about neighbour nodes can be provided either by 
      creating pongs on their behalf, or by forwarding the ping to 
      them, and forward the pongs returned to the crawler.

Servants fulfilling these requirements MUST provide a the header
"Pong-Caching: 0.1" (or a higher number if a later version is used) 
during the handshake. That allows other nodes to know if pong 
caching in any form is supported. Note that this applies to servants 
do not really cache pong messages as well, as long as the rules above
applies. Servants are strongly RECOMMENDED to follow the rules above,
and provide the Pong-Caching header.

When storing or forwarding Pong messages, any GGEP payload SHOULD be
included. When sending a Ping message, one cannot know if it will 
reach only the neighbour host, or many hosts on the network. It 
depends on what system for handling Ping and Pong messages other 
servants are using. Servants MUST NOT make assumptions of how far 
a Ping message (and its payload) will reach.


2.2.4.1 A simple pong caching scheme

This is one system for handling Ping and Pong messages. There are 
others available (see sect. 2.2.4.2), and any system that abides to 
the rules in sect. 2.2.4 is ok.

For each connection an array of Pong Messages are stored. 10 may 
be a good number. When a pong comes in, it overwrites the oldest 
stored pong in array of he connection the pong came from. The 
information that must be stored for each pong is:

 * IP Address
 * Port number
 * Number of files shared
 * Number of kilobytes shared
 * GGEP extension block (if present)
 * Hops value, i.e. how far away on the network the host using the 
   stored address is

When a Ping message, called P, is received over connection C, and
it has been at least one second since last time a ping was received 
over C, the servant will return a number of pongs (10 for example)
from its stored pongs. The pongs will be pick from all connections
except from C, since it would be no good sending pongs back where 
they came from. A servant should also return a pong with information 
about itself, if it can accept incoming connections.

The outgoing pong will have the same message ID as P, not the message
ID it had when the pong was received. The Hops is set to the stored 
hops value + 1, and TTL so that TTL+Hops=7. If the TTL is less than 
P's Hops value, the current stored pong will not be sent. This also 
means that pongs whose Hops value already is 7 will not be propagated
any further.

Exactly how to select which of the stored pongs to send in response 
to an incoming ping is up to each servant. A good idea is to pick 
pongs from different connections and with varying stored Hops values.

To keep the cache fresh, a ping (TTL=7, Hops=0) is sent over all 
connections at small interval (like every 3 seconds). This look like
very often, but remember that the neighbour servants will just 
respond with pongs from its own cache. The short time ensures that 
pongs are always fresh. To neighbour hosts who has not indicated 
that they support pong caching (using the Pong-Caching handshaking 
header), one ping per minute might be a better number.

Incoming pings with TTL=1 and Hops=0 or 1 (see above section 2.2.4) 
is replied to with a single pong containing information about the 
local host. Pings with TTL=2 and Hops=0 are replied to with one pong 
about the local host, and one about each other host the local host is
connected to. Information about the neighbour hosts is retrieved when
a new connection is started by sending a TTL=1, Hops=0 ping and 
storing the pong returned. This can be done using handshaking headers
instead.

The bandwidth used by this scheme is very limited. Assuming a ping is
sent every 3 seconds and that 10 pongs are returned to every ping. 
Since a (without extensions) is 23 bytes and a pong (without 
extensions) is 37 bytes, the amount of bandwidth used per connection 
is (23+10*37)/3 = 131 bytes/sec/connection. If extensions are used in
ping and/or pong messages, the bandwidth usage will increase, but 
will still be kept on an acceptable level. If the bandwidth usage 
must re decreased further, the interval between update pings could be
increased.


2.2.4.2 Other pong caching schemes

A slightly more advanced scheme for pong caching is available at
http://www.limewire.com/index.jsp/pingpong

A different, but compatible scheme can be found at
http://groups.yahoo.com/group/the_gdf/files/Proposals/PONG/Variants/
        pingreduce.txt

Other schemes might have been created after this was written.


2.2.5 Query (0x80)

Since Query messages are broadcasted to many nodes, the total size 
of the message SHOULD not be larger than 256 bytes. Servants MAY drop
Query messages larger that 256 bytes, and SHOULD drop Query messages 
larger than 4 kB.

A Query message has the following fields:

Bytes:  Description:
0-1     Minimum Speed. The minimum speed (in kb/second) of servants
        that should respond to this message. A servant receiving a 
        Query message with a Minimum Speed field of n kb/s SHOULD 
        only respond with a Query Hit if it is able to communicate at
        a speed >= n kb/s.
        
2-      Search Criteria. This field is terminated by a NUL (0x00).

        See section 2.2.7.3 for rules and information on how to 
        interpret the Search Criteria
        
Rest    OPTIONAL extensions block. The rest of the query message is
        used for extensions to the original query format. The allowed
        extension types are GGEP, HUGE and XML (see Section 2.3 and 
        Appendixes 1 and 2).
        
        If two or more of these extension types exist together, 
        they are separated by a 0x1C (file separator) byte. Since 
        GGEP blocks can contain 0x1C bytes, the GGEP block, if 
        present, MUST be located after any HUGE and XML blocks.
        
        The type of each block can be determined by looking for the 
        prefixes "urn:" for a HUGE block, "<" or "{" for XML and 0xC3 for
        GGEP.

        The extension block SHOULD NOT be followed by a null (0x00) 
        byte, but some servants wrongly do that.


2.2.6 Query Hit

Query Hit messages has the following fields:

Bytes:  Description:
0       Number of Hits. The number of query hits in the result set 
        (see below).
        
1-2     Port. The port number on which the responding host can accept
        incoming HTTP file requests. This is usually the same port as
        is used for Gnutella network traffic, but any port MAY be 
        used.
        
3-6     IP Address. The IP address of the responding host. 
        Note: This field is in big-endian format.

7-10    Speed The speed (in kb/second) of the responding host.
        
11-     Result Set. A set of responses to the corresponding Query. 
        This set contains Number_of_Hits elements, each with the 
        following structure:
        
        Bytes:  Description:
        0-3     File Index. A number, assigned by the responding 
                host, which is used to uniquely identify the file 
                matching the corresponding query.

        4-7     File Size. The size (in bytes) of the file whose 
                index is File_Index.

        8-      File Name. The name of the file whose index is 
                File_Index. Terminated by a null (i.e. 0x00) 

        x       Extensions block. Allowed extension types are HUGE, 
                GGEP and plain text metadata. This field is 
                terminated by a null (0x00), even if there are no 
                extensions (resulting in a double null). Also, the
                extensions block itself MUST NOT contain any null 
                bytes.
        
                If two or more of these extension types exist 
                together, they are separated by a 0x1C (file 
                separator) byte. Since GGEP blocks can contain 0x1C 
                bytes, the GGEP block, if present, MUST be located 
                after any HUGE and plan text blocks.
                
                The type of each block can be determined by looking 
                for the prefixes "urn:" for a HUGE block, 0xC3 for 
                GGEP and anything else is probably plain text 
                metadata.
        
                Plain text metadata is intended to be displayed 
                directly to the user. It was first invented by 
                Gnotella (a now discontinued Gnutella servant) to tag
                MP3 files. Examples:
                "192 kbps 44 kHz 3:23"
                "120 kbps(VBR) 44kHz 3:55" (variable bitrate)
                Other plan text formats MAY be used.
        
x       RECOMMENDED extra block. This block is not required, but 
        strongly recommended. It is sometimes called EQHD, or 
        (incorrectly) just QHD. It has the following format:
        
        Bytes:
        0-3     Vendor Code. Four case-insensitive characters 
                representing a vendor code. For example "LIME" for 
                LimeWire. See registered codes and register yours at
                http://groups.yahoo.com/group/the_gdf/database?
                        method=reportRows&tbl=6 
                        (Requires GDF membership)

        4       Open Data Size. Contains the length (in bytes) of the
                Open Data field. Set to 2 in most current 
                implementations, and 4 in those that support XML 
                metadata outside GGEP (see Section 2.3 and Appendix 2). 
                The Open Data area MAY be larger to allow future 
                extensions.
        
        x       Open Data. Contains two 1-byte flags fields with the 
                following layout and in the specified order:
                
                bit:    Description:
                7,6     Reserved for future use
                5       flagGGEP
                4       flagUploadSpeed
                3       flagHaveUploaded
                2       flagBusy
                1       Reserved for future use
                0       flagPush

                The first flag byte can be viewed as an "enabler" for
                the flags in the second byte, the "setter".  Only
                those bits that were enabled must be considered by
                the servant as being valid.  This logic is reversed
                for flagPush, which is set in the first byte and
                enabled in the second.  The enabling byte allows
                you to know which flags are supported by a given
                servant.

                Bits 5,4,3,2 in the first byte MUST be set if and 
                only if the corresponding flag in the second byte is 
                meaningful.

                Bit 0 in the second byte MUST be set if and only 
                if the corresponding flag in the second byte is 
                meaningful. Yes, the order is reversed for this flag.

                flagGGEP is set is set if and only if the private
                data block (see below) contains a GGEP block.

                flagUploadSpeed is set if and only if the Speed field
                of the QueryHit message contains the highest 
                average transfer rate (in kbps) of the last 10 
                uploads. Otherwise Speed field contains the hosts 
                total upload speed as set by the user, and therefore 
                less reliable.

                flagHaveUploaded is set if and only if the servant 
                has successfully uploaded at least one file.

                flagBusy is set if and only if the all of the 
                servant's upload slots are currently full.

                flagPush is set if and only if the servant is 
                firewalled or cannot accept incoming TCP connections 
                for any other reason.

                The reserved flags MUST not be set, unless they are 
                used for a future extension.

                If XML metadata (Appendix 2) is included in the 
                current Query Hit message, the following 2 bytes of
                Open Data area will contain the size of the XML 
                block. The XML block itself is placed in the private 
                area (see below).

x       Private Data. Undocumented vendor-specific data. This field 
        continues till the servant Identifier, which uses the last 16
        bytes of the message.

        If the flagGGEP in the open data block is set, this block
        contains a GGEP (see Section 2.3) extension block. The GGEP 
        block starts with a 0xC3 byte. Any data before or after the 
        GGEP block is vendor-specific data, and MUST be ignored, if 
        not recognized.

        Servants are NOT RECOMMENDED to use the private data area for
        vendor specific data. Servants SHOULD use GGEP extensions 
        instead.

        If the Open Data area indicates an XML block is will also be 
        placed in the private area (see Appendix 2). Assuming that 
        the two bytes in the Open Data area specifies an XML block of
        m bytes, that block can be found by extracting the last m 
        bytes of the private area. Both GGEP and XML can exist in the
        same Private Data area, but XML SHOULD be implemented inside 
        GGEP.
        [TODO: How about the nul after the XMP block? What is it good for?]

Last 16 Servant Identifier. A 16-byte string uniquely identifying the
        responding servant on the network. This SHOULD be constant 
        for all Query Hit messages emitted by a servant and is 
        typically some function of the servant's network address. The
        servant Identifier is mainly used for routing the Push 
        Message (see below).


2.2.7 Use of Query and Query Hit

2.2.7.1 Forwarding and routing of Query and Query Hit messages

A servant SHOULD forward incoming Query messages to all of its 
directly connected servants, except the one that delivered the 
incoming Query. Servants using Flow control or Ultrapeers (sections 
3.1 and 3.2) will not always forward every Query over every 
connection.

A servant MUST decrement a message header's TTL field, and 
increment its Hops field, before it forwards the message to any 
directly connected servant. If, after decrementing the header's TTL 
field, the TTL field is found to be zero, the message MUST NOT 
be forwarded along any connection.

A servant receiving a message with the same Payload Message and 
Message ID as one it has received before, MUST discard the 
message. It means the message has already been seen.

QueryHit messages MUST only be sent along the same path that 
carried the incoming Query message. This ensures that only those 
servants that routed the Query message will see the QueryHit 
message in response. A servant that receives a QueryHit message
with  Message ID = n, but has not seen a Query message with 
Message ID = n SHOULD remove the QueryHit message from the 
network.


2.2.7.2 When and how to send new Query messages.

Query messages are usually sent when the user initiates a search. A
servant MAY also create Queries automatically, to find more locations
of a resource for example. If doing so the servant MUST be very 
careful not overload the network. A servant SHOULD not send more than
one automatic query per hour.

Servants SHOULD NOT allow the user to create a large amount of 
queries by repeatedly clicking on a button.

Servants SHOULD watch queries originating from its neighbours
(Hops=0) If those queries are too frequent, are duplicates or 
indicate bad servants behavior in any other way, the servants SHOULD 
drop those queries or even close the connection.

The TTL value of a new query created by a servant SHOULD NOT be 
higher than 7, and MUST NOT be higher than 10. The hops value MUST be
set to 0.


2.2.7.3 When and how to respond with Query Hit messages.

When a servant receives an incoming Query message it SHOULD match 
the Search Criteria of the query against its local shared files. 

The Search Criteria is text, and it has never been specified which 
charset that text was encoded with.  Therefore, servants MUST assume 
it is pure ASCII only.  If any byte with the 7th bit set (high bit) 
is found, then either there is a GGEP extension specifying the 
encoding used, or the servant SHOULD guess the proper encoding.
Most likely, it will be ISO-latin-1 or UTF-8.

Exactly how to interpret the Search Criteria is not specified 
either, but here are some guidelines for interoperability between 
servants:

The Search Criteria is a string of keywords.  A servant SHOULD only 
respond with files that has all the keywords.  It is RECOMMENDED to 
break up the words on any non-alphanumeric characters (anything but 
letters and numbers).  A space is the standard separator between 
words.

Servants MAY also require that all matching terms be present in the 
same number and order as in the query.

Regular expressions are not supported and common regexp "meta-
characters" such as "*" or "." will either stand for themselves or be
ignored. The matching SHOULD be case insensitive.  Empty queries or 
queries containing only 1-letter words SHOULD be ignored.

GGEP extensions MAY be used to provide details on how to parse the 
Search Criteria (such as specifying that regular expressions matching
should be used), but a servant can never be sure other servants will 
understand the GGEP extension.

Servants MAY ignore queries whose Search Criteria is shorter than 
a chosen length. The reason is to ignore too broad searches.

Query messages with TTL=1, hops=0 and Search Criteria="    " (four
spaces) are used to index all files a host is sharing. Servants 
SHOULD reply to such queries with all its shared files. Multiple 
Query Hit messages SHOULD be used if sharing many files. Allowed 
reasons not to respond to index queries include privacy and 
bandwidth. 

Query Hit messages MUST have the same Message ID as the Query message
it is sent in reply to. The TTL SHOULD be set to at least the hops
value of the corresponding query plus 2, to allow the Query Hit to
take a longer route back, if necessary. The TTL value MUST be at
least the hops value of the corresponding query, and the initial
hops value of the Query Hit message MUST (as usual) be set to 0.
Some servants use a TTL of (2 * Query_TTL + 2) in their replies to
be sure that the reply will reach its destination.  Replies with
high TTL level SHOULD be allowed to pass through.


2.2.8 Push (0x40)

A Push message has the following fields:
Bytes:  Description:

0-15    Servant Identifier. The 16-byte string uniquely identifying 
        the servant on the network who is being requested to push the
        file with index File_Index. The servant initiating the push 
        request MUST set this field to the Servant_Identifier 
        returned in the corresponding QueryHit message. This is 
        used to route the Push message to the sender of the Query 
        Hit message.

16-19   File Index. The index uniquely identifying the file to be 
        pushed from the target servant. The servant initiating the 
        push request MUST set this field to the value of one of the 
        File_Index fields from the Result Set in the corresponding 
        QueryHit message.

20-23   IP Address. The IP address of the host to which the file with
        File_Index should be pushed. This field is in big-endian 
        format.

24-25   Port. The port number the receiver of this message should 
        push to.

26-     OPTIONAL GGEP extension block. (see Section 2.3)

A servant may send a Push message if it receives a QueryHit 
message from a servant that doesn't support incoming connections. 
This might occur when the servant sending the QueryHit message is 
behind a firewall.  When a servant receives a Push message, it SHOULD
act upon the push request if and only if the servant_Identifier field
contains the value of its servant identifier.  The Message_Id field
in the Message Header of the Push message SHOULD not contain the same
value as that of the associated QueryHit message, but SHOULD contain 
a new value generated by the servant's Message_Id generation 
algorithm.

Push messages are forwarded back to the originator of the Query Hits 
message using the Servant Identifier value.  This means multiple Push
messages can have the same Servant Identifier.  Push messages MUST 
only be considered as duplicates if the Message ID in the header is 
the same.  Since Push messages are not broadcasted, duplicate 
messages should be very rare.


2.2.9 Bye (0x02)

The Bye message is an OPTIONAL message used to inform the 
servant you are connected to that you are closing the connection. 

Servants supporting the Bye message MUST indicate that by sending 
the following header in the handshaking sequence:

        Bye-Packet: 0.1

Servants MUST NOT send Bye messages to hosts that has not indicated 
support using the above header.  Future versions will be backwards 
compatible, so Bye messages MAY also be sent to hosts providing the 
above header with a later version number.

A Bye packet MUST be sent with TTL=1 (to avoid accidental propagation
by an unaware servant), and hops=0 (of course).

A servant receiving a Bye message MUST close he connection 
immediately. The servant that sent the packet MUST wait a few 
seconds for the remote host to close the connection before closing 
it.  Other data MUST NOT be sent after the Bye message.  Make sure 
any send queues are cleared. 

The servant that sent by Bye message MAY also call shutdown() with 
'how' set to 1 after sending the Bye message, partially closing the 
connection.  Doing a full close() immediately after sending the Bye 
messages would prevent the remote host from possibly seeing the Bye 
message.

After sending the Bye message, and during the "grace period" when
we don't immediately close the connection, the servant MUST read
all incoming messages, and drop them unless they are Query Hits
or Push, which MAY still be forwarded (it would be nice to the
network).  The connection will be closed as soon as the servant
gets an EOF condition when reading, or when the "grace period"
expires.

A Bye message has the following fields:
Bytes:  Description:

0-1     Code. The presence of the Code allows for automated processing
        of the message, and the regular SMTP classification of error 
        code ranges should apply. Of particular interests are the 
        200..299, 400..499 and 500..599 ranges.
        Here is the general classification ("User" here refers to the 
        remote node that we are disconnecting from):

        2xx     The User did nothing wrong, but the servant chose to 
                close the connection: it is either exiting normally 
                (200), or the local manager of the servant requested 
                an explicit close of the connection (201).

        4xx     The User did something wrong, as far as the servant is
                concerned. It can send packets deemed too big (400), 
                too many duplicate messages (401), relay improper 
                queries (402), relay messages deemed excessively long-
                lived [hop+TTL > max] (403), send too many unknown 
                messages (404), the node reached its inactivity 
                timeout (405), it failed to reply to a ping with TTL=1
                (406), or it is not sharing enough (407).

        5xx     The servant noticed an error, but it is an "internal" 
                one. It can be an I/O error or other bad error (500), 
                a protocol desynchronization (501), the send queue 
                became full (502).

2-      NULL-terminated Description String. The format of the String 
        is the following (<cr> refers to "\r" and <lf> to "\n"):

                Error message, as descriptive as possible

        or optionally, something more qualified, with HTTP-like 
        headers giving out more information:

                Error message, as descriptive as possible<cr><lf>
                Server: some server/version<cr><lf>
                X-Gnutella-XXX: some specific Gnutella header<cr><lf>
                for instance telling the host about alternate<cr><lf>
                nodes it could connect to<cr><lf>
                <cr><lf>

        The presence of a <cr><lf> at the end of message indicates 
        that HTTP-like headers are present.  The absence of any 
        <cr><lf> indicates that the short error message form was used.

        Unless circumstances making that impossible (urgent 
        disconnection due to a memory fault), the HTTP-like headers 
        version SHOULD be used, with at least a Server: header, 
        allowing better tracing and debugging.

For further information about the Bye message, please refer to the
original documentation located at:
http://groups.yahoo.com/group/the_gdf/files/Proposals/BYE/


2.3 GGEP Extension blocks

The Gnutella Generic Extension Protocol (GGEP) allows arbitary 
extensions in Gnutella messages. A GGEP block is a framework for 
other extensions. If you wish to implement a new extension to a 
packet, you MUST do so inside GGEP. Some extensions that were 
invented before GGEP (XML metadata for example) are allowed to 
existoutside GGEP. 

Servants are RECOMMENDED to implement GGEP.  However, all servants 
MUST pass on GGEP extension blocks inside Gnutella messages. servants
that have support the forwarding of all packets that contain GGEP 
extensions (whether or not they can process them), MUST include a new
header in the Gnutella 0.6 connection handshake indicating this 
support.  This will allow other servants to know what types of 
packets this servant can accept.  The format of this header is 

GGEP : majorversion.minorversion

As the current version of GGEP is 0.5 when this was written the 
header would be

GGEP: 0.5

Servants SHOULD remove any GGEP blocks from Ping, Pong and Push 
messages before sending those messages to hosts that have not 
indicated GGEP support.

For the original GGEP documentation see
http://groups.yahoo.com/group/the_gdf/files/Proposals/GGEP/


2.3.1 GGEP Format

A GGEP block always starts with a magic byte used to help distinguish
GGEP extensions from legacy data which may exist.  It must be set to 
the value 0xC3.

When a GGEP block is used between the nulls in a result in a Query 
Hits message, it is not allowed to contain any null bytes. This 
requires some special tricks in the field format.

The magic byte is followed by any number of extensions. They SHOULD 
be processed in the order in which they appear. The following is the 
format of each extension:

Bytes used:     Field Name:
1               Flags
1-15            ID
1-3             Data Length
x               Extension Data

Flags:          These are options which describe the encoding of the 
                extension header and data.

                Bit    Name             
                7      Last Extension.  When set, this is the last 
                       extension in the GGEP block.

                6      Encoding.  The value contained in this field 
                       dictates the type of encoding which should be 
                       applied to the extension data (after possible 
                       compression).

                       0 = There is no encoding on the data. 
                       1 = The data is encoded using the COBS scheme.

                       Details about the COBS encoding scheme can be 
                       found at http://www.acm.org/sigcomm/sigcomm97/
                                    papers/p062.pdf  

                5      Compression.  The value contained in this 
                       field dictates the type of compression that 
                       should be applied to the extension data. 

                       0 = The extension data has not been compressed.
                       1 = The extension data should be decompressed 
                           using the deflate algorithm. 

                       One should only compress data if doing so will
                       make a material difference in the resulting 
                       packet size.                       

                       Details about the Deflate compression scheme 
                       may be found at http://www.gzip.org/zlib/
                       and http://www.faqs.org/rfcs/rfc1951.html 

                4      Reserved. This field is currently reserved for
                       future use.  It must be set to 0. 

                3-0    ID Len Value 1-15 can be stored here.  Since 
                       this will not be zero, it ensures this byte 
                       will not be 0x0. 

ID:             The raw binary data in this field is the extension ID.
                The length of this field can range between 1 and 15 
                bytes, and is determined by the Flags field. See 
                section 2.3.2 below on suggestions and rules for 
                creating extension IDs.  No byte in the extension 
                header may be 0x0.

Data Length:    This is the length of the raw extension data.  Please
                note that most Gnutella clients will drop messages, 
                and possibly connections if the message size is 
                larger than a certain threshold (which varies 
                according to message type).  Please pay attention to 
                these limits when creating and bundling new 
                extensions. 

                This field uses an encoding technique that ensures 
                that 0x0 is never the value of any byte.  Steps were 
                also taken to ensure that the encoding is compact. In
                this technique, a length field is the concatenation 
                of length chunks.  The format of each length chunk 
                (which contains 6 bits of length info) is described 
                in bit level below:

                Format:
                76543210
                MLxxxxxx
                        
                M = 1 if there is another length chunk in the 
                sequence, else 0

                L = 1 if this is the last length chunk in the 
                sequence, else 0

                xxxxxx = 6 bits of data.

                01aaaaaa ==> aaaaaa (2^6 values = 0-63)

                10bbbbbb 01aaaaaa ==> bbbbbbaaaaaa 
                (2^12 values = 0-4095)

                10ccccccc 10bbbbbb 01aaaaaa ==> ccccccbbbbbbaaaaaa 
                (2^18 values = 0-262143)

                Boundary Cases:
                0      =                       01 000000b = 0x40 
                63     =                       01 111111b = 0x7f 
                64     =            10 000001  01 000000b = 0x8140 
                4095   =            10 111111  01 111111b = 0xbf7f 
                4096   = 10 000001  10 000000  01 000000b = 0x818040
                262143 = 10 111111  10 111111  01 111111b = 0xbfbf7f
                
                As you see, when the bits are concatenated, the 
                number is in big endian format.

Extension Data  The actual extension data.  The format of this field 
                varies between extensions.  A servant that does not 
                recognize and extension will not be able to parse the
                Extension data, but since the length of this field is
                specified by Data Length, it can still skip to the 
                next extension.  Note that extensions MAY be empty.


2.3.2 Creating Extension IDs

The Extension ID field in the GGEP header is a binary field 
consisting of between 1 and 15 bytes.  It cannot contain the 
byte 0x0, and one must be able to compare IDs with a simple binary 
comparison.  Aside from those rules, GGEP does not mandate any 
particular format, but does encourage the creation of short IDs that 
are free from conflicts.  One should also note that Extension IDs are
meant to be consumed by machines.  Still, the following rules apply.

GDF Registered Extensions:
Any Extension ID of less than 4 bytes MUST be stored in the 
appropriate GDF database.  Any Extension ID of less than 3 bytes must
also be approved by the GDF.  The format of the extension data must 
also be registered.

VendorID Extensions
This simple technique allows for the creation of ExtensionIDs based 
upon uses the following format VendorID.BinaryID

VendorID for a Gnutella servant is a 4 byte value that has been 
registered in the GDF Peer Codes database.  In the QueryHit 
Descriptor, this case is case insensitive.  With ExtensionIDs, the 
case matters, as one must be able to perform a binary comparison on 
the ID.  This means an ExtensionID of "SWAP.1" and "swap.1" are 
different, but both "belong" the vendor who ones the code "SWAP."

This technique may be good for experimental and strictly vendor-
specific extensions, but should be avoided for extension that may be
useful for other vendors as well. Marking an extension by a vendor ID
makes it harder for other vendors to use the extension in their 
servants.

Extension implementers SHOULD publish the ID, format, and expected
data size for their extensions in the GDF database called 
"GGEP Extensions." located at
http://groups.yahoo.com/group/the_gdf/database?method=reportRows&tbl=10
(Requires GDF membership)


3 Protocol Usage

Apart from the protocol definition in section 2, there are also 
some guidelines on how to use the protocol. These are not absolutely
necessary to participate in the network, but very important for an
effective network.


3.1 Flow Control

It is very important that all servants have a system for regulating 
the data that passes through a connection.

The most simple way is to close a connection if it gets overloaded.
A better way is to drop broadcasted packets to reduce the amount of
bandwidth used.  A much better way is to do the following:

Implement an output queue, listing pending outgoing messages in
FIFO order.  As long as the queue is less than, say, 25% of its
max size (in bytes queued, not in amount of messages), do nothing.
If the queue gets filled above 50%, enter flow-control mode.  You
stay in flow-control mode (FC mode for short) as long as the queue
does not drop below 25%.  This is called "hysteresis".

The queue size SHOULD be at least 150% of the maximum admissible
message size.

In FC mode, all incoming queries on the connection are dropped.
The rationale is that we would not want to queue back potentially
large results for this connection since it has a throughput problem.

Messages to be sent to the node (i.e. queued on the output queue)
are prioritized:

* For broadcasted messages, the more hops the packet has traveled,
  the less prioritary it is.  Or the less hops, the more prioritary.
  This means your own queries are the most prioritary (hops = 0).

* For replies (query hits), the more hops the packet has traveled,
  the more prioritary it is.  This is to maximize network usefulness.
  The packet was relayed by many hosts, so it should not be dropped
  or the bandwidth it used would become truly wasted.

* Individual messages are prioritized thusly, from the most
  prioritary to the least: Push, Query Hit, Pong, Query, Ping.
  The Bye message being special, it is always sent (i.e. the queue
  cannot be in FC mode since it needs to be cleared before sending
  Bye).

Normally, all messages are accepted.  However, when the message to
enqueue would make the queue fill to more than 100% of its maximum
size, any queued message less prioritary in the queue is dropped.
If enough room could be made, enqueue the packet.  Otherwise, if the
message is a Query, a Pong or a Ping, drop it.  If not, send a
Bye 502 (Send Queue Overflow) message.

Other known flow-control algorithms:

SACHRIFC is a simple, but still very effective flow control system 
that drops less important packets first. It can be found at:
http://groups.yahoo.com/group/the_gdf/message/5726

A more advanced flow control system can be found at:
http://www.grouter.net/gnutella/


3.2 Network Structure

[TODO: Ultrapeer is so important that the information required to 
implement it will be included here.]

Originally, all Gnutella nodes were connected to each other randomly.
It worked fine for users with broadband connections, but not for 
users with slow modems. That problem can be solved by organizing the 
network in a more structured form. 

The Ultrapeer system has been found effective for this purpose. It is
a scheme to have a hierarchical Gnutella network by categorizing the 
nodes on the network as client-nodes and super-nodes. A client-node 
keeps only a small number of connections open, and that is to super-
nodes. A super-node acts as a proxy to the Gnutella network for the 
client nodes connected to it. This has an effect of making the 
Gnutella network scale, by reducing the number of nodes on the 
network involved in message handling and routing, as well as reducing
the actual traffic among them.

A super-node only forwards a query to a leaf-node if it believes to 
leaf-node can answer it, and leaf-nodes never relay queries between 
super-nodes. Super-peers are connected to each other and to "normal" 
Gnutella hosts (hosts that do not implement the Ultrapeer system).


3.2.1 Query Routing Protocol

The Query Routing Protocol (QRP for short) is an essential part of
the Ultrapeer specification: it governs how the Ultrapeer will filter
queries and only forward those to the leaf nodes most likely to have
a match.  This is done without even knowing the resource names, by
looking the query words through a big hash table, that is sent by the
leaf node to its Ultrapeer.

The aim of the QRP is to avoid forwarding a query that cannot match,
it is not to forward only those queries that will match.

The overall operation goes thusly:

* At the leaf node level:

  + Break all the resource names into individual words.  A word is
    made of a consecutive sequence of letters and digits.

  + Hash each word with a well-known hash function and insert a
    "present" flag in the corresponding hash table slot.
    Note that this hash table is a big array, and we don't store
    the key, only the fact that a key ended up filling some slot.
    All words are lower-cased and all accents are removed from
    them, i.e. "d�j�" is transformed into "deja", so that only
    ASCII characters remain.  Only those words that are made of
    at least 3 letters are retained.

  + All words are re-hashed with their trailing 1, 2, or 3 letters
    removed, provided the word length after such trimming is at
    least 3 letter long.  This is a simple attempt to remove plural
    from words.  Optionally, nodes can chop off more letters from the
    end, provided that each hashed word is at least 3 character long.

  + The "boolean vector" built at later stage is optionally 
    compressed, broken up in small messages, and sent mixed with 
    regular Gnet traffic to the ultrapeer.

* At the Ultrapeer level:

  + Until the whole "boolean vector" is received from a leaf node, 
    all queries are forwarded to that node.

  + When the "boolean vector" is fully received, it is going to be
    used as the Query Routing table for that leaf node: queries are
    broken into individual words, all accentuated letters are 
    removed.

  + For each leaf node with a Query Routing table:

    . Each word is then hashed and looked up in the Query Routing
      table.
  
    . Depending on the query matching rules (see 2.2.7.3), either ALL
      the words will be required to be found in the Query Routing, or
      only some of them, to declare a Query Routing Hit.

    . Only those queries that were declared a Hit at the previous
      stage will be forwarded to a given leaf node.

The remaining sections define the hashing function, the mechanism
used to build up the "boolean vector" and compress it, the protocol
to transmit the vector to the Ultrapeer, and finally give operating
hints for the table sizing.

[TODO: finish the QRP description --RAM]

The Ultrapeer specification can be found at:
http://www.limewire.com/developer/Ultrapeers.html

The Query Routing Protocol (QRP) used in Ultrapeer can be found at:
http://www.limewire.com/developer/query_routing/keyword%20routing.htm

It is RECOMMENDED that servants implement the Ultrapeer system, or 
another system for decreasing the bandwidth load on modem users.


4 File Transfer

4.1 Normal File Transfer

Once a servant receives a QueryHit message, it may initiate the 
direct download of one of the files described by the message's Result
Set. Files are downloaded out-of-network i.e. a direct connection 
between the source and target servant is established in order to 
perform the data transfer. File data is never transferred over the 
Gnutella network.

The file download protocol is HTTP. It is RECOMMENDED to use HTTP 1.1
(RFC 2616), but HTTP 1.0 (RFC 1945) can be used instead. The full 
specifications are available in those RFCs. The following includes 
only the basic things. The following examples assumes that HTTP 1.1 
is used. 

The servant initiating the download sends a request string on the 
following form to the target server:

    GET /get/<File Index>/<File Name> HTTP/1.1<cr><lf>
    User-Agent: Gnutella<cr><lf>
    Host: 123.123.123.123:6346<cr><lf>
    Connection: Keep-Alive<cr><lf>
    Range: bytes=0-<cr><lf>
    <cr><lf>

where <File Index> and <File Name> are one of the File Index/File 
Name pairs from a QueryHit message's Result Set. For example, if the 
Result Set from a QueryHit message contained the entry

    File Index: 2468
    File Size: 4356789
    File Name: Foobar.mp3

then a download request for the file described by this entry would be
initiated as follows:

    GET /get/2468/Foobar.mp3 HTTP/1.1<cr><lf>
    User-Agent: Gnutella<cr><lf>
    Host: 123.123.123.123:6346<cr><lf>
    Connection: Keep-Alive<cr><lf>
    Range: bytes=0-<cr><lf>
    <cr><lf>

Servants MUST encode the filename in GET requests according the 
standard URL/URI encoding rules. Servants MUST accept URL-encoded GET
requests. Since some old servants does not support encoding, servants
SHOULD accept non-encoded requests and MAY try a non-encoded requests
if a 404 Not Found error is returned for the initial request. 

The Host header is required by HTTP 1.1 and specifies what address 
you have connected to. It is usually not used by the receiving 
servant, but its presence is required by the protocol.

The allowable values of the User-Agent string are defined by the HTTP
standard. Servant developers cannot make any assumptions about the 
value here. The use of 'Gnutella' is for illustration purposes only. 

The server receiving this download request responds with HTTP 1.1 
compliant headers such as

    HTTP/1.1 200 OK<cr><lf>
    Server: Gnutella<cr><lf>
    Content-type: application/binary<cr><lf>
    Content-length: 4356789<cr><lf>
    <cr><lf>

The file data then follows and should be read up to, and including, 
the number of bytes specified in the Content-length provided in the 
server's HTTP response.

Note: Servants SHOULD use HTTP version 1.1 for file transfer, but 
some support only HTTP version 1.0. Servants MUST accept incoming 
HTTP/1.0 requests, and SHOULD retry with HTTP/1.0 if the remote host
is not HTTP/1.1 compliant.

Though it is strongly RECOMMENDED to have full HTTP/1.1 
support, some servants do not. The most important features for 
Gnutella, range requests and Persistent Connections MUST be 
supported. Some old servants, however, do not.

Range requests are on the form

    GET /get/2468/Foobar.mp3 HTTP/1.1<cr><lf>
    User-Agent: Gnutella<cr><lf>
    Host: 123.123.123.123:6346<cr><lf>
    Connection: Keep-Alive<cr><lf>
    Range: bytes=4932766-5066083<cr><lf>
    <cr><lf>

Note that the Range header does not have to specify both start and 
end positions. The response is on the form

    HTTP/1.1 206 Partial Content<cr><lf>
    Server: Gnutella<cr><lf>
    Content-Type: audio/mpeg<cr><lf>
    Content-Length: 133318<cr><lf>
    Content-Range: bytes 4932766-5066083/5332732<cr><lf>
    <cr><lf>

The Connection header tells the remote host if the connection should 
be closed when the transfer is finished or not. "Connection: close" 
means that the connection MUST be closed after the transfer. 
"Connection: Keep-Alive" or no Connection header means the connection
MUST be kept open. The client MAY then issue another request for 
another range or another file. The request MAY be sent before the 
previous transfer is finished. Persistent Connections is described in
section 8.1 of RFC 2616. 

Headers unknown to the servant MUST be quietly ignored.

Servants SHOULD NOT attempt to download multiple files from the same 
source at once. Files SHOULD be locally queued instead.

Servants are also RECOMENDED to use and understand the http extension
described in HUGE. (see Appendix 1)


4.2 Firewalled servants

It is not always possible to establish a direct connection to a 
Gnutella servant in an attempt to initiate a file download. The 
servant may, for example, be behind a firewall that does not permit
incoming connections to its Gnutella port. If a direct connection 
cannot be established, the servant attempting the file download may 
request that the servant sharing the file "push" the file instead. A 
servant can request a file push by routing a Push request back to the
servant that sent the QueryHit message describing the target file.
The servant that is the target of the Push request (identified by the
Servant Identifier field of the Push message) SHOULD, upon receipt
of the Push message, attempt to establish a new TCP/IP connection 
to the requesting servant (identified by the IP Address and Port 
fields of the Push message). If this direct connection cannot be 
established, then it is likely that the servant that issued the Push 
request is itself behind a firewall. In this case, file transfer 
cannot take place by the means of what is described in this document.

If a direct connection can be established from the firewalled servant
to the servant that initiated the Push request, the firewalled 
servant should immediately send the following:

    GIV <File Index>:<Servant Identifier>/<File Name><lf><lf>

Where <File Index> and <Servant Identifier> are the values of the 
File Index and Servant Identifier fields respectively from the Push 
request received, and <File Name> is the name of the file in the 
local file table whose file index number is <File Index>. The File 
Name MAY be url/uri encoded. The servant that receives the GIV (the
servant that wants to receive a file) SHOULD ignore the File Index 
and File Name, and request the file it wants to download. The 
servant that sent the GIV MUST allow the client to request any 
file, and not just the one specified in the Push message.  The GET 
request and the remainder of the file download process is identical 
to that described in the section 4.1 (Normal File Transfer) above.

The <Servant Identifier> is formatted as hexadecimal, and must
be read case-insensitively.  For instance:

    GIV 36:809BC12168A1852CFF5D7A785833F600/Foo.txt<lf><lf>
    GIV 124:d51dff817f895598ff0065537c09d503/Bar.html<lf><lf>

If the TCP connection is lost during a Push initiated file transfer, 
it is strongly RECOMMENDED that the servant who initiated the TCP 
connection (the servant providing the file) attempt to re-connect. 
That is important, since the servant receiving the file might not be
able to get another Push message to the servant providing the file.


4.3 Busy Servants

Servants whose upload bandwidth is already saturated with transfers 
MAY reject a download request by returning the 503 response code. 
Servants MAY simply have a fixed number of available upload slots, 
but SHOULD use a system that utilizes upload bandwidth better. 
Allowing new downloads as long as 20% of total upload bandwidth is 
unused is one possibility.

Busy servants receiving a Push message SHOULD connect to the host 
requesting a push, and return the 503 Busy code when the remote host 
has requested the file. 

Servants MAY try requesting a download again when the servant 
providing the file returns the busy code, but MUST not do so more 
often than once per minute and file source. That means a Servant 
MUST NOT open new connections to a remote host more than once per 
minute. Servants SHOULD prevent other servants breaking the above 
rule from increasing their chanses to downlaoad a file. This can for 
example be archived by refusing any connection attempts from a 
particular host if a download request has been denies less than 50 
seconds ago, or by adding hosts that request too often to a ban list.

Servants MAY use queuing systems to allow downloaders to stand in 
queue to download a file, but that is outside the scope of this 
document.

If a transfer is interrupted, the serving servant SHOULD keep the 
allocated slot/bandwidth reserved for at least one minute. The 
downloader would then be allowed to reconnect and resume the 
transfer.


4.4 Sharing

Servants that are able to download files MUST also be able to share 
files with others. Servants SHOULD encourage users to share files.

Servants SHOULD attempt to prevent programs that are not able to 
share files from downloading files. This means that servant SHOULD 
not allow uploads to web browsers and download accelerators. The 
User-Agent http header tells what program the remote host is running.
Many servants return a html page instead, telling the user how 
Gnutella works, and where to get a servant.

Servants MUST NOT give precedence to other users using the same 
servant. They MUST answer Query messages and accept file download
requests using the same rules for all servants. Servants MAY,
however, attempt to block servants that do not follow the rules in 
this protocol in way that seriously hurts others experience of the
Gnutella network.

Servants SHOULD, by default, share the directory where downloaded 
files are placed. Servants SHOULD also share new downloaded files 
without waiting for the servant to be restarted. Servants SHOULD 
avoid changing the index numbers of shared files.

Servants MUST NOT share partially downloaded (incomplete) files as if
they were complete. This is often done by using a separate directory 
for incomplete downloads. When the download finishes, the file is 
move to the downloads directory (that should be shared). Partial 
files MAY be shared in a way the makes it clear to other servants 
that the file is incomplete. 


5 Security Considerations

[TODO: This section is very incomplete. Any suggestions are welcome.]


5.1 Threats against individual Gnutella participants

[TODO: Write about threats against individual Gnutella participants Such as flooding, 
fake files, DoS, etc. Flooding hostcache with faked pongs]
[TODO: How one can protect oneself and other gnet users]

Inexperienced users might share sensitive information, such as cookie
or password files, on the Gnutella network. Servants SHOULD warn 
users who try share such information.

Malicious file, such as viruses and trojans, might be shared on the 
Gnutella network by malicious or unexpecting users. Servants SHOULD 
encourage users to scan downloaded files for viruses etc. but this is
outside the scope of the Gnutella protocol.


5.2 Threats against the Gnutella network

[TODO: Write about threats against the Gnutella network
Such as query flooding, fake files, DoS, fake random pongs etc.]
[TODO: What a servant should do to protect the network]

[TODO: A solution is fair sharing of bandwidth between connections and message types]

5.3 Threats against third parties

[TODO: Write about threats against third parties. Such (D)DoS, etc.]
[TODO: How to avoid]

Would it be possible to use the power of the Gnutella network to 
attack internet hosts? That issue will be discussed in this section.

The ways of doing so an attacker might try is:

    * Responding to Ping messages with Pong messages containing the
      IP address and port of the target host. This would cause other
      Gnutella servants to attempt to connect to the target host, 
      thinking it is a Gnutella host.
      [TODO: Can this really be an effective attack? Would the target receive that many connection attempts?]

    * Responding to Query messages with Query Hit messages containing 
      the IP address and port of the target host. 
      [TODO: This might actually be an issue. How about recommending servants to attempt to drop faked QHs somehow?]

    * Responding to Query Hit messages with Push messages containing 
      the IP address and port of the target host. This would cause 
      the servant receiving the Push message to attempt a connection
      to the target host. It would, however, not be more than one 
      connection per Push message, so it could not be used to launch 
      large Denial-of-Service attacks. 
      [TODO: Is that correct, or is it a real threat?]


6 Credits

The authors would like to thank:
    [TODO fill]

New features mentioned in this document, not present in the original 
0.4 Gnutella specification document, published by the now defunct 
Clip2 company should be credited thusly:

    0.6 Handshaking Protocol            LimeWire LLC
    X-Try Header                        Mike Green
    Bye Packet                          Raphael Manfredi
    Pong Caching                        LimeWire LLC and others
    EQHD Block                          Free Peers Inc.
    GGEP                                Jason Thomas
    HUGE                                Gordon Mohr
    Ultrapeers                          LimeWire LLC
    Query Routing Protocol              LimeWire LLC
    XML queries                         LimeWire LLC
    Negotiation of Gnet Compression     Raphael Manfredi

    [TODO missing?]


Appendix 1: HUGE (Hash/URN Gnutella Extensions)

HUGE is used to provide capability for hashes (numbers uniquely 
identifying files) and other urn:s to Gnutella. HUGE SHOULD be 
implemented inside GGEP (Section 2.3), but can also be used as a 
stand-alone extension block. When inside GGEP, the GGEP extension-
identifier for HUGE info is 'u'. When used as a stand-alone 
extension, HUGE blocks start with "urn:". There MAY be multiple HUGE 
blocks in one Gnutella message (separated by 0x1C bytes).

HUGE also extends the direct file transfer between hosts, to allow
communication or urn:s, and to build "download meshes" that inform
servants of other locations of a file.

The HUGE documentation is available at:
http://groups.yahoo.com/group/the_gdf/files/Proposals/HUGE

Servants are RECOMMENDED to implement HUGE.


Appendix 2: XML

XML blocks can be used to issue rich queries and to include metadata 
about files in query hit messages. XML SHOULD be implemented inside 
GGEP, but can also be used as a stand-alone extension block. 

If there is a "{deflate}" before the XML block it is compressed using
the defalte algorithm. "{plaintext}" or no such prefix means 
uncompressed plaintext XML. Any other prefix mean the XML block is 
compressed using an algorithm that is not known at this time. 

XML blocks can be recognized by them starting with "<" or "{".
(Do not rely on "<?xml", because this string is optional and can be 
omitted).


XML is made to be extendable and compatible between different XML 
schemas, so the exact format does not need to be specified here.
The only XML system currently in use (02-04-13), if developed by 
LimeWire and available in two parts at:
http://www.limewire.com/index.jsp/metainfo_searches
http://www.limewire.com/developer/MetaProposal2.htm


Appendix 3: Finding a Gnutella host

In order to connect to the Gnutella network, a servant must find an
initial host to start its first connection to. Then, more hosts will 
supplied using Pong messages and X-Try headers. 

There are several permanent Gnutella hosts whose purpose is to 
provide a list of Gnutella hosts to any Gnutella servant connecting 
to them.  These are often called hostcaches. A list of known public 
hostcaches is available at:
http://groups.yahoo.com/group/the_gdf/database?method=reportRows&tbl=5
(Requires GDF membership)

To reduce the vulnerability, each Gnutella servant with a fairly 
large user base using hostcaches SHOULD have its own public 
hostcache.

There are several open source hostscache programs available. See for 
example:
http://groups.yahoo.com/group/the_gdf/message/1375

Since hostscaches can be overloaded, or go down, a servant MUST NOT 
be dependent on them. A popular technique is to keep a list of a few 
hundred hosts on the disc. The probability that at least one of 
them are running is very high, even when the list is several weeks 
old.


Appendix 4: When to open or accept new Gnutella connections

The number of Gnutella connections that is suitable depends on the 
speed the internet connection. Many servants selects a fixed number 
of connections based on the connection speed selected by the user. A
better way would be to automatically regulate the number of 
connections to make sure a suitable amount of bandwidth is used the 
Gnutella network traffic. 

A common technique is to open new outgoing connections until 2 or 3 
connections are running, and then wait for incoming connections to 
fill the remaining slots. If the host cannot accept incoming 
connections for some reason, the servant will have to open new 
connections until it has enough.

When opening a new connection a servants must select which address to
connect to. They SHOULD prefer hosts from Pong messages that are new 
(since it is more probable that the host can still accept incoming 
connections) and from Pong with a high Hops count (since it will 
reduce the amount of cycles on the network.

Servants MAY prefer other servants from the same vendor, or servants 
that have certain features, but MUST always make sure that network 
is not fragmented. It MUST always be possible for servants that 
follow this protocol to search and download files form each other. If
a servants is not full on connections it SHOULD NOT deny an incoming 
connection request unless the remote host is missing a very important
feature.


Appendix 5: Gnutella network traffic compression

Compression of the message traffic over a Gnutella connection is 
OPTIONAL. The following assumes the "deflate" scheme is used, but any
compression algorithm MAY be used.

Compression is meant to be done on a per-connection basis.  The 
"deflate" scheme is handled by the www.zlib.org library (the deflate/
inflate routines). This means there will be a compression dictionary 
and history per connection on both ends, meaning a good compression 
ability (compared to compressing each message individually).

However, this stream compression of the traffic means that we need to
individually compress the same packet on each connection, using a 
dedicated compressing state maintained per connection.

Because compression algorithms don't necessarily produce output when 
fed input (e.g. if you feed them with "aaaaa", they'll wait for the 
next char, and will do so until it's a "b" for instance), it is 
necessary to periodically direct them to flush output.  However, this
comes at a cost, because once flushed, data is inflatable by the 
decompressor, and this is done at the expense of sending the 
necessary dictionary information.

To negotiate compression, we're making full use of the 3-way 
handshaking. The idea is that decompression is fast, so it's OK to be
sent compressed data, but the compressing side must decide based on 
the resources it has available whether it will compress or not.

Basically, the side supporting decompression will say:

   Accept-Encoding: deflate<cr><lf>

Note that although we only specify "deflate" here, the servant MAY
advertise the set of various compression algorithms it knows,
subsequent items being separated by a ",".

And to accept compression, the other side acknowledges by sending:

   Content-Encoding: deflate<cr><lf>

The servant just picks the compression scheme it supports amongst
the ones advertised by the remote end in the Accept-Encoding line.
The Content-Encoding MUST contain only one value.

This also means that compression settings is asymmetric: a node can
send compressed data but receive uncompressed data.

Here's an example where both nodes support compression, comments
starting with "--", and ending <cr><lf> removed for clarity:

   GNUTELLA CONNECT/0.6
   Accept-Encoding: deflate       -- OK for reception of compressed data

       GNUTELLA/0.6 200 OK
       Accept-Encoding: deflate   -- I can also receive compressed data
       Content-Encoding: deflate  -- And I will send compressed data

   GNUTELLA/0.6 200 OK
   Content-Encoding: deflate      -- OK, will also compress data

   <compressed stream of Gnet messages>

Here's an example where compression will only be made on the 
transmission side of the first node (A is the node initiating the 
handshake, B is the node replying):

   GNUTELLA CONNECT/0.6
   Accept-Encoding: deflate      -- OK for reception of compressed data

       GNUTELLA/0.6 200 OK
       Accept-Encoding: deflate  -- I can also receive compressed data
                                 -- I refuse to compress data, sorry

   GNUTELLA/0.6 200 OK
   Content-Encoding: deflate     -- OK, I will compress data sent
                                 -- But I will receive uncompressed data

   <flow from A->B is compressed, flow from B->A is not>

Even though GGEP payloads (see Section 2.3) can be compressed, and 
this information is visible in the GGEP header, it is not advisable 
to decompress those payloads before sending them to the compressing 
layer.  The deflate algorithm does not expand already-compressed data
by a large factor and emits them as clearly marked non-compressible 
data (the overhead is limited to roughly 0.1%). If connection 
compression is widely used on the Gnutella network, individual GGEP 
extensions SHOULD NOT be compressed.


[TODO: Some spelling errors.]
[TODO: Many of the docs referred to are still drafts]

[TODO: RFCs have 72 char lines, and a 3 char left margin for most text blocks.
I intend to make lines max 69 chars, and text blocks will be indented 3 chars.]
[TODO: Break up into pages (like RFCs), but not before the final release (Or convert to XML)]

[HUGE should perhaps be integrated]
[The XML spec should not be integrated. We just specify where there is XML. The XML format that limewire uses is not a part of Gnutella. Anyone can use any XML format.]