Network Working Group T. Klingberg Request for Comments: NNNN R. Manfredi Category: Informational June 2002 Gnutella 0.6 Status of this Memo This is a draft. Copyright Notice Copyright (C) 2002, Tor Klingberg & Raphael Manfredi All Rights Reserved. Permission is granted to make verbatim copies of this document, provided the Copyright Notice is preserved. Rights of Free Implementation The authors of the various proposals that make up this document grant the rights to anyone to freely implement those proposals. Gnutella is an Open Protocol, where the specifications are public and free of any patent. Table of Contents 1 Introduction 1.1 Purpose 1.2 Requirements 1.3 Terminology 1.4 Extending the protocol 2 Protocol Definition 2.1 Initiating a Connection 2.2 Gnutella Messages 2.2.1 Message Header 2.2.2 Ping (0x00) 2.2.3 Pong (0x01) 2.2.4 Use of Ping and Pong messages 2.2.4.1 A simple pong caching scheme 2.2.4.2 Other pong caching schemes 2.2.5 Query (0x80) 2.2.6 Query Hit 2.2.7 Use of Query and Query Hit 2.2.7.1 Forwarding and routing of Query and Query Hit messages 2.2.7.2 When and how to send new Query messages. 2.2.7.3 When and how to respond with Query Hit messages. 2.2.8 Push (0x40) 2.2.9 Bye (0x02) 2.3 GGEP Extension blocks 2.3.1 GGEP Format 2.3.2 Creating Extension IDs 3 Protocol Usage 3.1 Flow Control 3.2 Network Structure 3.2.1 Query Routing Protocol (unfinished) 4 File Transfer 4.1 Normal File Transfer 4.2 Firewalled servants 4.3 Busy Servants 4.4 Sharing 5 Security Considerations 5.1 Threats against individual Gnutella participants 5.2 Threats against the Gnutella network 5.3 Threats against third parties 6 Credits Appendix 1 HUGE (Hash/URN Gnutella Extensions) Appendix 2 XML Appendix 3 Finding a Gnutella host Appendix 4 When to open or accept new Gnutella connections Appendix 5 Gnutella network traffic compression 1 Introduction 1.1 Purpose Gnutella is a decentralized peer-to-peer system. It allows the participants to share resources from their system for others to see and get, and locate resources shared by others on the network. Resources can be anything: mappings to other resources, cryptographic keys, files of any type, meta-information on keyable resources, etc. However, the semantics for locating and handling resources other than plain files are not specified in this document. Each participant launches a Gnutella program, which will seek out other Gnutella nodes to connect to. This set of connected nodes carries the Gnutella traffic, which is essentially made of queries, replies to those queries, and also other control messages to facilitate the discovery of other nodes. Users interact with the nodes by supplying them with the list of resources they wish to share on the network, can enter searches for other's resources, will hopefully get results from those searches, and can then select those resources amongst the results: if those resources are files, for instance, they can download them. But one can imagine other types of resources that, once fetched, will bring more than their content value. Resource data exchanges between nodes are negotiated using the standard HTTP protocol. The Gnutella network is only used to locate the nodes sharing those resources. This document is intended for readers with a fair knowledge of network programming, but do not require any previous Gnutella experience. Still, other implementations of this protocol will give useful information about implementation techniques that is not included in this document. A list of Gnutella programs can be found at http://www.gnutelliums.com 1.2 Requirements The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [34]. 1.2.1 The Gnutella Development Forum (the GDF) The Gnutella Development Forum is a good place to find more Gnutella documentation, proposals about changes and extensions and to discuss Gnutella development with other developers. The message archive is also a good source for information about the protocol and its implementation. Some of the links in this document requires membership in the Gnutella Development Forum. Everyone is, of course, allowed to become a member. The GDF is located at http://groups.yahoo.com/group/the_gdf There are many other forums for discussing Gnutella development as well. 1.3 Terminology Servant A program participating in the Gnutella network is called a servant. The words "peer", "node" and "host" have similar meanings, but refers to a network participant rather than a program. When a servant have a clear client or server role the words "client" or "server" may be used. The word "client" is sometimes used as a synonym for servant. Some other documents use the word "servent" instead of servant, as a contraction of "SERVer" and "cliENT". Message Messages are the entity in which information is transmitted over the network. Sometimes the word "packet" is used with the same meaning. Some other documents use the word "descriptor" GUID Globally Unique IDentifier. This is a 16-byte long value made of random bytes, whose purpose it is to identify servants and messages. This identification is not a signature, just a way to identify network entities in a unique manner. 1.4 Extending the protocol This document is the definition of the Gnutella 0.6 protocol. Servants MAY extend the protocol or even change parts of it (for example by compressing or encrypting the messages), but servants MUST always stay compatible with servants that follow this specification. If a servant, for example, wants to compress the Gnutella messages, it MUST first make sure the remote host of a connection can decompress the stream (during handshake), and otherwise leave the messages uncompressed. Servants MAY chose not to accept a connection with a servant that does not support a feature, but MUST always make sure that the Gnutella network is not split into separate networks. Separate networks for special purposes are, of course, allowed but then it is no longer the Gnutella network, but another network. This protocol also allows for extensions inside many messages. Such extensions can pass through servants that do not know about the extension to reach servants that do. 2 Protocol Definition The Gnutella protocol defines the way in which servants communicate over the network. It consists of a set of messages used for communicating data between servants and a set of rules governing the inter-servant exchange of messages. Currently, the following messages are defined: Ping Used to actively discover hosts on the network. A servant receiving a Ping message is expected to respond with one or more Pong messages. Pong The response to a Ping. Includes the address of a connected Gnutella servant, the listening port of that servant, and information regarding the amount of data it is making available to the network. Query The primary mechanism for searching the distributed network. A servant receiving a Query message will respond with a Query Hit if a match is found against its local data set. QueryHit The response to a Query. This message provides the recipient with enough information to acquire the data matching the corresponding Query. Push A mechanism that allows a firewalled servant to contribute file-based data to the network. Bye An optional message used to inform the remote host that you are closing the connection, and your reason for doing so. 2.1 Initiating a Connection A Gnutella servant connects itself to the network by establishing a connection with another servant currently on the network. Techniques for finding the first host are described in Appendix 3. Once the first connection is established, the addresses of more hosts will be supplied over the network. The default Gnutella port is 6346, but servants MAY use any unused port. If the desired port is used (probably by another Gnutella servant) the servant SHOULD attempt to listen on another port. This listening port is advertised by the servant through the Pong messages. Techniques and rules for how to select what other Gnutella hosts to connect to and when to accept connection requests can be found in Appendix 4. Once the address of another servant on the network is obtained, a TCP/IP connection to the servant is created, and a handshaking sequence is initiated. The client is the host initiating the connection and the server is the host receiving it. "" refers to ASCII character 13 (carriage return), and "" to 10 (new line). 1. The client establishes a TCP connection with the server. 2. The client sends "GNUTELLA CONNECT/0.6". 3. The client sends all capability headers--except for vendor-specific headers--each terminated by "", with an extra "" at the end. 4. The server responds with "GNUTELLA/0.6 200 ". SHOULD be "OK", but servants SHOULD just look for the "200" code. 5. The server sends all its headers, in the same format as in (3). 6. The client sends "GNUTELLA/0.6 200 OK, as in (4) if after parsing the server's headers, it still wishes to connect. Otherwise, it needs to reply with an error code and close the connection. 7. The client sends any vendor-specific headers as needed, in the same format as (3). 8. Both client and server send binary messages at will, using the information gained in (3) and (5). All headers SHOULD be registered with the GDF database at http://groups.yahoo.com/group/the_gdf/database?method=reportRows&tbl=9 (Requires GDF membership) Headers follow the standards described in RFC822 and RFC2616. Each header is made of a field name, followed by a colon, and then the value. Each line ends with the sequence, and the end of the headers is marked by a single line. Each line normally starts a new header, unless it begins with a space or an horizontal tab (ASCII codes 32 and 9 in decimal, respectively), in which case it continues the preceding header line. The extra spaces and tabs may be collapsed into a single space as far as the header value goes. For instance: First-Field: this is the value of the first field Second-Field: this is the value of the second field The header above is made of two fields, "First-Field" and "Second- Field" whose values are respectively "this is the value of the first field" and "this is the value of the second field" (leading spaces of the continuation were collapsed). Note that the leading space between the ":" ending the field name and the start of the value string does not count. Multiple header lines with the same field name are identical to one header line where all the values of the fields would be separated by ",". This means: Field: first Field: second is strictly equivalent to saying: Field: first,second In other words, order matters in that case. Here is a sample interaction between a client and a server. Data sent from client to server is shown on the left; data sent from server to client is shown on the right. Client Server ----------------------------------------------------------- GNUTELLA CONNECT/0.6 User-Agent: BearShare Query-Routing: 0.2 Pong-Caching: 0.1 GGEP: 0.5 GNUTELLA/0.6 200 OK User-Agent: BearShare Query-Routing: 0.1 Pong-Caching: 0.1 GGEP: 0.5 Private-Data: 5ef89a GNUTELLA/0.6 200 OK Private-Data: a04fce [binary messages] [binary messages] A few notes about the responses: first, the client (server) SHOULD disconnect if receiving any response other than "200" at step 4 (6). There is no need to define these error codes now. Second, servants SHOULD ignore higher version numbers in steps (2), (4), and (6). For example, it is perfectly legal for a future client to connect to a server and send "GNUTELLA CONNECT/0.7". The server SHOULD respond with "GNUTELLA/0.7 200 OK" if it supports the 0.7 protocol, or "GNUTELLA/0.6 200 OK" otherwise. A few notes about the headers: servants SHOULD use standard HTTP headers whenever appropriate. For example, servants SHOULD use the standard "User-Agent" header rather than make up a "Servant-Vendor" header. However, it is perfectly legal to add new headers (e.g., "Query-Routing") when no appropriate HTTP header exists, as long as they follow HTTP syntax. Headers unknown to the servant MUST be ignored. Some older servants will initiate the handshake by sending "GNUTELLA CONNECT/0.4". The server SHOULD then reply with "GNUTELLA OK" followed by binary messages, if it can accept the connection. Servants MAY retry using the 0.4 connect string if the 0.6 connection attempt were rejected. No handshaking headers can be used in 0.4 handshaking. When rejecting a connection, a servant MUST, if possible, provide the remote host with a list of other Gnutella hosts, so it can try connecting to them. This SHOULD be done using the X-Try header. An X-Try header can look like: X-Try:1.2.3.4:1234,3.4.5.6:3456 There MAY be a space after the colon and after each comma. There MAY be multiple X-Try headers in one header set. The header MAY end with an extra comma. The header MAY be formatted on several lines using continuations. Each item in the X-Try header gives the IP address of a servant and its listening port number. This is sometimes referred to as being a "connection pong". If the server sending the X-Try implements Pong-Caching, then the connection pongs being sent must be fresh ones. The normal status code for rejecting a connection because the servant is busy is "503 " followed by "Busy" or another description string. 2.2 Gnutella Messages Once a servant has connected successfully to the network, it communicates with other servants by sending and receiving Gnutella protocol messages. Each message is preceded by a Message Header with the byte structure given below. Note 1: One IP packet may contain several Gnutella messages, and one Gnutella message may be split up on multiple IP-packets. This means one can never assume a Gnutella message ends when the chunk of data read from the socket ends. Note 2: All fields in the following structures are in little-endian byte order unless otherwise specified. Note 3: All IP addresses in the following structures are in IPv4 format. For example, the IPv4 byte array 0xD0 0x11 0x32 0x04 byte 0 byte 1 byte 2 byte 3 represents the dotted address 208.17.50.4. 2.2.1 Message Header The message header is 23 bytes divided into the following fields. Bytes: Description: 0-15 Message ID/GUID (Globally Unique ID) 16 Payload Type 17 TTL (Time To Live) 18 Hops 19-22 Payload Length Message ID A 16-byte string (GUID) uniquely identifying the message on the network. Servants SHOULD store all 1's (0xff) in byte 8 of the GUID. (Bytes are numbered 0-15, inclusive.) This serves to tag the GUID as being from a modern servant. Servants SHOULD initially store all 0's in byte 15 of the GUID. This is reserved for future use. The other bytes SHOULD have random values. Payload Indicates the type of message Type 0x00 = Ping 0x01 = Pong 0x02 = Bye 0x40 = Push 0x80 = Query 0x81 = Query Hit Other Gnutella messages can be used, but if so the servant MUST first make sure that the remote host supports this new message type. This can be done using handshaking headers. TTL Time To Live. The number of times the message will be forwarded by Gnutella servants before it is removed from the network. Each servant will decrement the TTL before passing it on to another servant. When the TTL reaches 0, the message will no longer be forwarded (and MUST not). Hops The number of times the message has been forwarded. As a message is passed from servant to servant, the TTL and Hops fields of the header must satisfy the following condition: TTL(0) = TTL(i) + Hops(i) Where TTL(i) and Hops(i) are the value of the TTL and Hops fields of the message, and TTL(0) is maximum number of hops a message will travel (usually 7). Payload The length of the message immediately following Length this header. The next message header is located exactly this number of bytes from the end of this header i.e. there are no gaps or pad bytes in the Gnutella data stream. Messages SHOULD NOT be larger than 4 kB. The Payload Length field is the only reliable way for a servant to find the beginning of the next message in the input stream. Therefore, servants SHOULD rigorously validate the Payload Length field for each message received. If a servant becomes out of synch with its input stream, it SHOULD close the connection associated with the stream since the upstream servant is either generating, or forwarding, invalid messages. Abuse of the TTL field in broadcasted messages (Query) will lead to an unnecessary amount of network traffic and poor network performance. Therefore, servants SHOULD carefully check the TTL fields of received query messages and lower them as necessary. Assuming the servant's maximum admissible Query message life is 7 hops, then if TTL + Hops > 7, TTL SHOULD be decreased so that TTL + Hops = 7. Broadcasted messages with very high TTL values (>15) SHOULD be dropped. Immediately following the message header, is a payload consisting of one of the following messages. 2.2.2 Ping (0x00) Ping messages MAY contain a GGEP extension block (see Section 2.3), but no other payload. 2.2.3 Pong (0x01) Pong messages contains information about a Gnutella host. The message has the following fields Bytes: Description: 0-1 Port number. The port number on which the responding host can accept incoming connections. 2-5 IP Address. The IP address of the responding host. Note: This field is in big-endian format. 6-9 Number of shared files. The number of files that the servant with the given IP address and port is sharing on the network. 10-13 Number of kilobytes shared. The number of kilobytes of data that the servant with the given IP address and port is sharing on the network. 14- OPTIONAL GGEP extension block. (see Section 2.3) Pong messages are only sent in response to an incoming Ping message. It is valid for more than one Pong message to be sent in response to a single Ping message. This enables host caches to send cached servant address information in response to a Ping request. The Message ID of a Pong message MUST be the Message ID of the Ping message it is sent in reply to. The fields specifying the number of shared files and the number of kilobytes shared was intended to allow one to measure the amount of data available on the network. With a very large Gnutella network, and minimized Ping and Pong message traffic, this can no longer be done. Still, these fields SHOULD be filled out correctly. 2.2.4 Use of Ping and Pong messages In early versions Gnutella, Ping messages were broadcasted over the network. Pong messages were then routed back to the originator of the Ping message the same way as Query Hits messages are routed (se section 2.2.7). That system consumed a lot of network bandwidth, so modern Gnutella servants cache Pong messages, or use other means of minimizing the bandwidth used by Ping and Pong messages. There are different systems for handling Ping and Pong messages, but what they have in common is: * When a Ping message is received (TTL>1 and it was at least one second since another Ping was received on that connection), a servant MUST, if possible, respond with a number of Pong Messages. These pongs MUST have the same message ID as the incoming ping, and a TTL no lower than the hops value of the ping. The number of pongs returned may vary, but 10 is a reasonable number. Servants that are able to accept incoming Gnutella SHOULD reply to these Ping messages. * The pongs sent SHOULD have a good quality. That includes high probability that they are connectable and a good spread of hosts from across the network * The bandwidth used by Ping and Pong messages SHOULD be minimized. Servants MUST never output very high quantities of Ping and Pong messages. * An incoming Ping message with TTL = 1 and Hops = 0 or 1 is used to probe the remote host of a connection, and MUST always be replied to with a pong having information about the host who received the ping. * An incoming Ping message with TTL = 2 and Hops = 0 is a "Crawler Ping" used to scan the network. It and SHOULD be replied to with pongs containing information about the host receiving the ping and all other hosts it is connected to. The information about neighbour nodes can be provided either by creating pongs on their behalf, or by forwarding the ping to them, and forward the pongs returned to the crawler. Servants fulfilling these requirements MUST provide a the header "Pong-Caching: 0.1" (or a higher number if a later version is used) during the handshake. That allows other nodes to know if pong caching in any form is supported. Note that this applies to servants do not really cache pong messages as well, as long as the rules above applies. Servants are strongly RECOMMENDED to follow the rules above, and provide the Pong-Caching header. When storing or forwarding Pong messages, any GGEP payload SHOULD be included. When sending a Ping message, one cannot know if it will reach only the neighbour host, or many hosts on the network. It depends on what system for handling Ping and Pong messages other servants are using. Servants MUST NOT make assumptions of how far a Ping message (and its payload) will reach. 2.2.4.1 A simple pong caching scheme This is one system for handling Ping and Pong messages. There are others available (see sect. 2.2.4.2), and any system that abides to the rules in sect. 2.2.4 is ok. For each connection an array of Pong Messages are stored. 10 may be a good number. When a pong comes in, it overwrites the oldest stored pong in array of he connection the pong came from. The information that must be stored for each pong is: * IP Address * Port number * Number of files shared * Number of kilobytes shared * GGEP extension block (if present) * Hops value, i.e. how far away on the network the host using the stored address is When a Ping message, called P, is received over connection C, and it has been at least one second since last time a ping was received over C, the servant will return a number of pongs (10 for example) from its stored pongs. The pongs will be pick from all connections except from C, since it would be no good sending pongs back where they came from. A servant should also return a pong with information about itself, if it can accept incoming connections. The outgoing pong will have the same message ID as P, not the message ID it had when the pong was received. The Hops is set to the stored hops value + 1, and TTL so that TTL+Hops=7. If the TTL is less than P's Hops value, the current stored pong will not be sent. This also means that pongs whose Hops value already is 7 will not be propagated any further. Exactly how to select which of the stored pongs to send in response to an incoming ping is up to each servant. A good idea is to pick pongs from different connections and with varying stored Hops values. To keep the cache fresh, a ping (TTL=7, Hops=0) is sent over all connections at small interval (like every 3 seconds). This look like very often, but remember that the neighbour servants will just respond with pongs from its own cache. The short time ensures that pongs are always fresh. To neighbour hosts who has not indicated that they support pong caching (using the Pong-Caching handshaking header), one ping per minute might be a better number. Incoming pings with TTL=1 and Hops=0 or 1 (see above section 2.2.4) is replied to with a single pong containing information about the local host. Pings with TTL=2 and Hops=0 are replied to with one pong about the local host, and one about each other host the local host is connected to. Information about the neighbour hosts is retrieved when a new connection is started by sending a TTL=1, Hops=0 ping and storing the pong returned. This can be done using handshaking headers instead. The bandwidth used by this scheme is very limited. Assuming a ping is sent every 3 seconds and that 10 pongs are returned to every ping. Since a (without extensions) is 23 bytes and a pong (without extensions) is 37 bytes, the amount of bandwidth used per connection is (23+10*37)/3 = 131 bytes/sec/connection. If extensions are used in ping and/or pong messages, the bandwidth usage will increase, but will still be kept on an acceptable level. If the bandwidth usage must re decreased further, the interval between update pings could be increased. 2.2.4.2 Other pong caching schemes A slightly more advanced scheme for pong caching is available at http://www.limewire.com/index.jsp/pingpong A different, but compatible scheme can be found at http://groups.yahoo.com/group/the_gdf/files/Proposals/PONG/Variants/ pingreduce.txt Other schemes might have been created after this was written. 2.2.5 Query (0x80) Since Query messages are broadcasted to many nodes, the total size of the message SHOULD not be larger than 256 bytes. Servants MAY drop Query messages larger that 256 bytes, and SHOULD drop Query messages larger than 4 kB. A Query message has the following fields: Bytes: Description: 0-1 Minimum Speed. The minimum speed (in kb/second) of servants that should respond to this message. A servant receiving a Query message with a Minimum Speed field of n kb/s SHOULD only respond with a Query Hit if it is able to communicate at a speed >= n kb/s. 2- Search Criteria. This field is terminated by a NUL (0x00). See section 2.2.7.3 for rules and information on how to interpret the Search Criteria Rest OPTIONAL extensions block. The rest of the query message is used for extensions to the original query format. The allowed extension types are GGEP, HUGE and XML (see Section 2.3 and Appendixes 1 and 2). If two or more of these extension types exist together, they are separated by a 0x1C (file separator) byte. Since GGEP blocks can contain 0x1C bytes, the GGEP block, if present, MUST be located after any HUGE and XML blocks. The type of each block can be determined by looking for the prefixes "urn:" for a HUGE block, "<" or "{" for XML and 0xC3 for GGEP. The extension block SHOULD NOT be followed by a null (0x00) byte, but some servants wrongly do that. 2.2.6 Query Hit Query Hit messages has the following fields: Bytes: Description: 0 Number of Hits. The number of query hits in the result set (see below). 1-2 Port. The port number on which the responding host can accept incoming HTTP file requests. This is usually the same port as is used for Gnutella network traffic, but any port MAY be used. 3-6 IP Address. The IP address of the responding host. Note: This field is in big-endian format. 7-10 Speed The speed (in kb/second) of the responding host. 11- Result Set. A set of responses to the corresponding Query. This set contains Number_of_Hits elements, each with the following structure: Bytes: Description: 0-3 File Index. A number, assigned by the responding host, which is used to uniquely identify the file matching the corresponding query. 4-7 File Size. The size (in bytes) of the file whose index is File_Index. 8- File Name. The name of the file whose index is File_Index. Terminated by a null (i.e. 0x00) x Extensions block. Allowed extension types are HUGE, GGEP and plain text metadata. This field is terminated by a null (0x00), even if there are no extensions (resulting in a double null). Also, the extensions block itself MUST NOT contain any null bytes. If two or more of these extension types exist together, they are separated by a 0x1C (file separator) byte. Since GGEP blocks can contain 0x1C bytes, the GGEP block, if present, MUST be located after any HUGE and plan text blocks. The type of each block can be determined by looking for the prefixes "urn:" for a HUGE block, 0xC3 for GGEP and anything else is probably plain text metadata. Plain text metadata is intended to be displayed directly to the user. It was first invented by Gnotella (a now discontinued Gnutella servant) to tag MP3 files. Examples: "192 kbps 44 kHz 3:23" "120 kbps(VBR) 44kHz 3:55" (variable bitrate) Other plan text formats MAY be used. x RECOMMENDED extra block. This block is not required, but strongly recommended. It is sometimes called EQHD, or (incorrectly) just QHD. It has the following format: Bytes: 0-3 Vendor Code. Four case-insensitive characters representing a vendor code. For example "LIME" for LimeWire. See registered codes and register yours at http://groups.yahoo.com/group/the_gdf/database? method=reportRows&tbl=6 (Requires GDF membership) 4 Open Data Size. Contains the length (in bytes) of the Open Data field. Set to 2 in most current implementations, and 4 in those that support XML metadata outside GGEP (see Section 2.3 and Appendix 2). The Open Data area MAY be larger to allow future extensions. x Open Data. Contains two 1-byte flags fields with the following layout and in the specified order: bit: Description: 7,6 Reserved for future use 5 flagGGEP 4 flagUploadSpeed 3 flagHaveUploaded 2 flagBusy 1 Reserved for future use 0 flagPush The first flag byte can be viewed as an "enabler" for the flags in the second byte, the "setter". Only those bits that were enabled must be considered by the servant as being valid. This logic is reversed for flagPush, which is set in the first byte and enabled in the second. The enabling byte allows you to know which flags are supported by a given servant. Bits 5,4,3,2 in the first byte MUST be set if and only if the corresponding flag in the second byte is meaningful. Bit 0 in the second byte MUST be set if and only if the corresponding flag in the second byte is meaningful. Yes, the order is reversed for this flag. flagGGEP is set is set if and only if the private data block (see below) contains a GGEP block. flagUploadSpeed is set if and only if the Speed field of the QueryHit message contains the highest average transfer rate (in kbps) of the last 10 uploads. Otherwise Speed field contains the hosts total upload speed as set by the user, and therefore less reliable. flagHaveUploaded is set if and only if the servant has successfully uploaded at least one file. flagBusy is set if and only if the all of the servant's upload slots are currently full. flagPush is set if and only if the servant is firewalled or cannot accept incoming TCP connections for any other reason. The reserved flags MUST not be set, unless they are used for a future extension. If XML metadata (Appendix 2) is included in the current Query Hit message, the following 2 bytes of Open Data area will contain the size of the XML block. The XML block itself is placed in the private area (see below). x Private Data. Undocumented vendor-specific data. This field continues till the servant Identifier, which uses the last 16 bytes of the message. If the flagGGEP in the open data block is set, this block contains a GGEP (see Section 2.3) extension block. The GGEP block starts with a 0xC3 byte. Any data before or after the GGEP block is vendor-specific data, and MUST be ignored, if not recognized. Servants are NOT RECOMMENDED to use the private data area for vendor specific data. Servants SHOULD use GGEP extensions instead. If the Open Data area indicates an XML block is will also be placed in the private area (see Appendix 2). Assuming that the two bytes in the Open Data area specifies an XML block of m bytes, that block can be found by extracting the last m bytes of the private area. Both GGEP and XML can exist in the same Private Data area, but XML SHOULD be implemented inside GGEP. [TODO: How about the nul after the XMP block? What is it good for?] Last 16 Servant Identifier. A 16-byte string uniquely identifying the responding servant on the network. This SHOULD be constant for all Query Hit messages emitted by a servant and is typically some function of the servant's network address. The servant Identifier is mainly used for routing the Push Message (see below). 2.2.7 Use of Query and Query Hit 2.2.7.1 Forwarding and routing of Query and Query Hit messages A servant SHOULD forward incoming Query messages to all of its directly connected servants, except the one that delivered the incoming Query. Servants using Flow control or Ultrapeers (sections 3.1 and 3.2) will not always forward every Query over every connection. A servant MUST decrement a message header's TTL field, and increment its Hops field, before it forwards the message to any directly connected servant. If, after decrementing the header's TTL field, the TTL field is found to be zero, the message MUST NOT be forwarded along any connection. A servant receiving a message with the same Payload Message and Message ID as one it has received before, MUST discard the message. It means the message has already been seen. QueryHit messages MUST only be sent along the same path that carried the incoming Query message. This ensures that only those servants that routed the Query message will see the QueryHit message in response. A servant that receives a QueryHit message with Message ID = n, but has not seen a Query message with Message ID = n SHOULD remove the QueryHit message from the network. 2.2.7.2 When and how to send new Query messages. Query messages are usually sent when the user initiates a search. A servant MAY also create Queries automatically, to find more locations of a resource for example. If doing so the servant MUST be very careful not overload the network. A servant SHOULD not send more than one automatic query per hour. Servants SHOULD NOT allow the user to create a large amount of queries by repeatedly clicking on a button. Servants SHOULD watch queries originating from its neighbours (Hops=0) If those queries are too frequent, are duplicates or indicate bad servants behavior in any other way, the servants SHOULD drop those queries or even close the connection. The TTL value of a new query created by a servant SHOULD NOT be higher than 7, and MUST NOT be higher than 10. The hops value MUST be set to 0. 2.2.7.3 When and how to respond with Query Hit messages. When a servant receives an incoming Query message it SHOULD match the Search Criteria of the query against its local shared files. The Search Criteria is text, and it has never been specified which charset that text was encoded with. Therefore, servants MUST assume it is pure ASCII only. If any byte with the 7th bit set (high bit) is found, then either there is a GGEP extension specifying the encoding used, or the servant SHOULD guess the proper encoding. Most likely, it will be ISO-latin-1 or UTF-8. Exactly how to interpret the Search Criteria is not specified either, but here are some guidelines for interoperability between servants: The Search Criteria is a string of keywords. A servant SHOULD only respond with files that has all the keywords. It is RECOMMENDED to break up the words on any non-alphanumeric characters (anything but letters and numbers). A space is the standard separator between words. Servants MAY also require that all matching terms be present in the same number and order as in the query. Regular expressions are not supported and common regexp "meta- characters" such as "*" or "." will either stand for themselves or be ignored. The matching SHOULD be case insensitive. Empty queries or queries containing only 1-letter words SHOULD be ignored. GGEP extensions MAY be used to provide details on how to parse the Search Criteria (such as specifying that regular expressions matching should be used), but a servant can never be sure other servants will understand the GGEP extension. Servants MAY ignore queries whose Search Criteria is shorter than a chosen length. The reason is to ignore too broad searches. Query messages with TTL=1, hops=0 and Search Criteria=" " (four spaces) are used to index all files a host is sharing. Servants SHOULD reply to such queries with all its shared files. Multiple Query Hit messages SHOULD be used if sharing many files. Allowed reasons not to respond to index queries include privacy and bandwidth. Query Hit messages MUST have the same Message ID as the Query message it is sent in reply to. The TTL SHOULD be set to at least the hops value of the corresponding query plus 2, to allow the Query Hit to take a longer route back, if necessary. The TTL value MUST be at least the hops value of the corresponding query, and the initial hops value of the Query Hit message MUST (as usual) be set to 0. Some servants use a TTL of (2 * Query_TTL + 2) in their replies to be sure that the reply will reach its destination. Replies with high TTL level SHOULD be allowed to pass through. 2.2.8 Push (0x40) A Push message has the following fields: Bytes: Description: 0-15 Servant Identifier. The 16-byte string uniquely identifying the servant on the network who is being requested to push the file with index File_Index. The servant initiating the push request MUST set this field to the Servant_Identifier returned in the corresponding QueryHit message. This is used to route the Push message to the sender of the Query Hit message. 16-19 File Index. The index uniquely identifying the file to be pushed from the target servant. The servant initiating the push request MUST set this field to the value of one of the File_Index fields from the Result Set in the corresponding QueryHit message. 20-23 IP Address. The IP address of the host to which the file with File_Index should be pushed. This field is in big-endian format. 24-25 Port. The port number the receiver of this message should push to. 26- OPTIONAL GGEP extension block. (see Section 2.3) A servant may send a Push message if it receives a QueryHit message from a servant that doesn't support incoming connections. This might occur when the servant sending the QueryHit message is behind a firewall. When a servant receives a Push message, it SHOULD act upon the push request if and only if the servant_Identifier field contains the value of its servant identifier. The Message_Id field in the Message Header of the Push message SHOULD not contain the same value as that of the associated QueryHit message, but SHOULD contain a new value generated by the servant's Message_Id generation algorithm. Push messages are forwarded back to the originator of the Query Hits message using the Servant Identifier value. This means multiple Push messages can have the same Servant Identifier. Push messages MUST only be considered as duplicates if the Message ID in the header is the same. Since Push messages are not broadcasted, duplicate messages should be very rare. 2.2.9 Bye (0x02) The Bye message is an OPTIONAL message used to inform the servant you are connected to that you are closing the connection. Servants supporting the Bye message MUST indicate that by sending the following header in the handshaking sequence: Bye-Packet: 0.1 Servants MUST NOT send Bye messages to hosts that has not indicated support using the above header. Future versions will be backwards compatible, so Bye messages MAY also be sent to hosts providing the above header with a later version number. A Bye packet MUST be sent with TTL=1 (to avoid accidental propagation by an unaware servant), and hops=0 (of course). A servant receiving a Bye message MUST close he connection immediately. The servant that sent the packet MUST wait a few seconds for the remote host to close the connection before closing it. Other data MUST NOT be sent after the Bye message. Make sure any send queues are cleared. The servant that sent by Bye message MAY also call shutdown() with 'how' set to 1 after sending the Bye message, partially closing the connection. Doing a full close() immediately after sending the Bye messages would prevent the remote host from possibly seeing the Bye message. After sending the Bye message, and during the "grace period" when we don't immediately close the connection, the servant MUST read all incoming messages, and drop them unless they are Query Hits or Push, which MAY still be forwarded (it would be nice to the network). The connection will be closed as soon as the servant gets an EOF condition when reading, or when the "grace period" expires. A Bye message has the following fields: Bytes: Description: 0-1 Code. The presence of the Code allows for automated processing of the message, and the regular SMTP classification of error code ranges should apply. Of particular interests are the 200..299, 400..499 and 500..599 ranges. Here is the general classification ("User" here refers to the remote node that we are disconnecting from): 2xx The User did nothing wrong, but the servant chose to close the connection: it is either exiting normally (200), or the local manager of the servant requested an explicit close of the connection (201). 4xx The User did something wrong, as far as the servant is concerned. It can send packets deemed too big (400), too many duplicate messages (401), relay improper queries (402), relay messages deemed excessively long- lived [hop+TTL > max] (403), send too many unknown messages (404), the node reached its inactivity timeout (405), it failed to reply to a ping with TTL=1 (406), or it is not sharing enough (407). 5xx The servant noticed an error, but it is an "internal" one. It can be an I/O error or other bad error (500), a protocol desynchronization (501), the send queue became full (502). 2- NULL-terminated Description String. The format of the String is the following ( refers to "\r" and to "\n"): Error message, as descriptive as possible or optionally, something more qualified, with HTTP-like headers giving out more information: Error message, as descriptive as possible Server: some server/version X-Gnutella-XXX: some specific Gnutella header for instance telling the host about alternate nodes it could connect to The presence of a at the end of message indicates that HTTP-like headers are present. The absence of any indicates that the short error message form was used. Unless circumstances making that impossible (urgent disconnection due to a memory fault), the HTTP-like headers version SHOULD be used, with at least a Server: header, allowing better tracing and debugging. For further information about the Bye message, please refer to the original documentation located at: http://groups.yahoo.com/group/the_gdf/files/Proposals/BYE/ 2.3 GGEP Extension blocks The Gnutella Generic Extension Protocol (GGEP) allows arbitary extensions in Gnutella messages. A GGEP block is a framework for other extensions. If you wish to implement a new extension to a packet, you MUST do so inside GGEP. Some extensions that were invented before GGEP (XML metadata for example) are allowed to existoutside GGEP. Servants are RECOMMENDED to implement GGEP. However, all servants MUST pass on GGEP extension blocks inside Gnutella messages. servants that have support the forwarding of all packets that contain GGEP extensions (whether or not they can process them), MUST include a new header in the Gnutella 0.6 connection handshake indicating this support. This will allow other servants to know what types of packets this servant can accept. The format of this header is GGEP : majorversion.minorversion As the current version of GGEP is 0.5 when this was written the header would be GGEP: 0.5 Servants SHOULD remove any GGEP blocks from Ping, Pong and Push messages before sending those messages to hosts that have not indicated GGEP support. For the original GGEP documentation see http://groups.yahoo.com/group/the_gdf/files/Proposals/GGEP/ 2.3.1 GGEP Format A GGEP block always starts with a magic byte used to help distinguish GGEP extensions from legacy data which may exist. It must be set to the value 0xC3. When a GGEP block is used between the nulls in a result in a Query Hits message, it is not allowed to contain any null bytes. This requires some special tricks in the field format. The magic byte is followed by any number of extensions. They SHOULD be processed in the order in which they appear. The following is the format of each extension: Bytes used: Field Name: 1 Flags 1-15 ID 1-3 Data Length x Extension Data Flags: These are options which describe the encoding of the extension header and data. Bit Name 7 Last Extension. When set, this is the last extension in the GGEP block. 6 Encoding. The value contained in this field dictates the type of encoding which should be applied to the extension data (after possible compression). 0 = There is no encoding on the data. 1 = The data is encoded using the COBS scheme. Details about the COBS encoding scheme can be found at http://www.acm.org/sigcomm/sigcomm97/ papers/p062.pdf 5 Compression. The value contained in this field dictates the type of compression that should be applied to the extension data. 0 = The extension data has not been compressed. 1 = The extension data should be decompressed using the deflate algorithm. One should only compress data if doing so will make a material difference in the resulting packet size. Details about the Deflate compression scheme may be found at http://www.gzip.org/zlib/ and http://www.faqs.org/rfcs/rfc1951.html 4 Reserved. This field is currently reserved for future use. It must be set to 0. 3-0 ID Len Value 1-15 can be stored here. Since this will not be zero, it ensures this byte will not be 0x0. ID: The raw binary data in this field is the extension ID. The length of this field can range between 1 and 15 bytes, and is determined by the Flags field. See section 2.3.2 below on suggestions and rules for creating extension IDs. No byte in the extension header may be 0x0. Data Length: This is the length of the raw extension data. Please note that most Gnutella clients will drop messages, and possibly connections if the message size is larger than a certain threshold (which varies according to message type). Please pay attention to these limits when creating and bundling new extensions. This field uses an encoding technique that ensures that 0x0 is never the value of any byte. Steps were also taken to ensure that the encoding is compact. In this technique, a length field is the concatenation of length chunks. The format of each length chunk (which contains 6 bits of length info) is described in bit level below: Format: 76543210 MLxxxxxx M = 1 if there is another length chunk in the sequence, else 0 L = 1 if this is the last length chunk in the sequence, else 0 xxxxxx = 6 bits of data. 01aaaaaa ==> aaaaaa (2^6 values = 0-63) 10bbbbbb 01aaaaaa ==> bbbbbbaaaaaa (2^12 values = 0-4095) 10ccccccc 10bbbbbb 01aaaaaa ==> ccccccbbbbbbaaaaaa (2^18 values = 0-262143) Boundary Cases: 0 = 01 000000b = 0x40 63 = 01 111111b = 0x7f 64 = 10 000001 01 000000b = 0x8140 4095 = 10 111111 01 111111b = 0xbf7f 4096 = 10 000001 10 000000 01 000000b = 0x818040 262143 = 10 111111 10 111111 01 111111b = 0xbfbf7f As you see, when the bits are concatenated, the number is in big endian format. Extension Data The actual extension data. The format of this field varies between extensions. A servant that does not recognize and extension will not be able to parse the Extension data, but since the length of this field is specified by Data Length, it can still skip to the next extension. Note that extensions MAY be empty. 2.3.2 Creating Extension IDs The Extension ID field in the GGEP header is a binary field consisting of between 1 and 15 bytes. It cannot contain the byte 0x0, and one must be able to compare IDs with a simple binary comparison. Aside from those rules, GGEP does not mandate any particular format, but does encourage the creation of short IDs that are free from conflicts. One should also note that Extension IDs are meant to be consumed by machines. Still, the following rules apply. GDF Registered Extensions: Any Extension ID of less than 4 bytes MUST be stored in the appropriate GDF database. Any Extension ID of less than 3 bytes must also be approved by the GDF. The format of the extension data must also be registered. VendorID Extensions This simple technique allows for the creation of ExtensionIDs based upon uses the following format VendorID.BinaryID VendorID for a Gnutella servant is a 4 byte value that has been registered in the GDF Peer Codes database. In the QueryHit Descriptor, this case is case insensitive. With ExtensionIDs, the case matters, as one must be able to perform a binary comparison on the ID. This means an ExtensionID of "SWAP.1" and "swap.1" are different, but both "belong" the vendor who ones the code "SWAP." This technique may be good for experimental and strictly vendor- specific extensions, but should be avoided for extension that may be useful for other vendors as well. Marking an extension by a vendor ID makes it harder for other vendors to use the extension in their servants. Extension implementers SHOULD publish the ID, format, and expected data size for their extensions in the GDF database called "GGEP Extensions." located at http://groups.yahoo.com/group/the_gdf/database?method=reportRows&tbl=10 (Requires GDF membership) 3 Protocol Usage Apart from the protocol definition in section 2, there are also some guidelines on how to use the protocol. These are not absolutely necessary to participate in the network, but very important for an effective network. 3.1 Flow Control It is very important that all servants have a system for regulating the data that passes through a connection. The most simple way is to close a connection if it gets overloaded. A better way is to drop broadcasted packets to reduce the amount of bandwidth used. A much better way is to do the following: Implement an output queue, listing pending outgoing messages in FIFO order. As long as the queue is less than, say, 25% of its max size (in bytes queued, not in amount of messages), do nothing. If the queue gets filled above 50%, enter flow-control mode. You stay in flow-control mode (FC mode for short) as long as the queue does not drop below 25%. This is called "hysteresis". The queue size SHOULD be at least 150% of the maximum admissible message size. In FC mode, all incoming queries on the connection are dropped. The rationale is that we would not want to queue back potentially large results for this connection since it has a throughput problem. Messages to be sent to the node (i.e. queued on the output queue) are prioritized: * For broadcasted messages, the more hops the packet has traveled, the less prioritary it is. Or the less hops, the more prioritary. This means your own queries are the most prioritary (hops = 0). * For replies (query hits), the more hops the packet has traveled, the more prioritary it is. This is to maximize network usefulness. The packet was relayed by many hosts, so it should not be dropped or the bandwidth it used would become truly wasted. * Individual messages are prioritized thusly, from the most prioritary to the least: Push, Query Hit, Pong, Query, Ping. The Bye message being special, it is always sent (i.e. the queue cannot be in FC mode since it needs to be cleared before sending Bye). Normally, all messages are accepted. However, when the message to enqueue would make the queue fill to more than 100% of its maximum size, any queued message less prioritary in the queue is dropped. If enough room could be made, enqueue the packet. Otherwise, if the message is a Query, a Pong or a Ping, drop it. If not, send a Bye 502 (Send Queue Overflow) message. Other known flow-control algorithms: SACHRIFC is a simple, but still very effective flow control system that drops less important packets first. It can be found at: http://groups.yahoo.com/group/the_gdf/message/5726 A more advanced flow control system can be found at: http://www.grouter.net/gnutella/ 3.2 Network Structure [TODO: Ultrapeer is so important that the information required to implement it will be included here.] Originally, all Gnutella nodes were connected to each other randomly. It worked fine for users with broadband connections, but not for users with slow modems. That problem can be solved by organizing the network in a more structured form. The Ultrapeer system has been found effective for this purpose. It is a scheme to have a hierarchical Gnutella network by categorizing the nodes on the network as client-nodes and super-nodes. A client-node keeps only a small number of connections open, and that is to super- nodes. A super-node acts as a proxy to the Gnutella network for the client nodes connected to it. This has an effect of making the Gnutella network scale, by reducing the number of nodes on the network involved in message handling and routing, as well as reducing the actual traffic among them. A super-node only forwards a query to a leaf-node if it believes to leaf-node can answer it, and leaf-nodes never relay queries between super-nodes. Super-peers are connected to each other and to "normal" Gnutella hosts (hosts that do not implement the Ultrapeer system). 3.2.1 Query Routing Protocol The Query Routing Protocol (QRP for short) is an essential part of the Ultrapeer specification: it governs how the Ultrapeer will filter queries and only forward those to the leaf nodes most likely to have a match. This is done without even knowing the resource names, by looking the query words through a big hash table, that is sent by the leaf node to its Ultrapeer. The aim of the QRP is to avoid forwarding a query that cannot match, it is not to forward only those queries that will match. The overall operation goes thusly: * At the leaf node level: + Break all the resource names into individual words. A word is made of a consecutive sequence of letters and digits. + Hash each word with a well-known hash function and insert a "present" flag in the corresponding hash table slot. Note that this hash table is a big array, and we don't store the key, only the fact that a key ended up filling some slot. All words are lower-cased and all accents are removed from them, i.e. "déjà" is transformed into "deja", so that only ASCII characters remain. Only those words that are made of at least 3 letters are retained. + All words are re-hashed with their trailing 1, 2, or 3 letters removed, provided the word length after such trimming is at least 3 letter long. This is a simple attempt to remove plural from words. Optionally, nodes can chop off more letters from the end, provided that each hashed word is at least 3 character long. + The "boolean vector" built at later stage is optionally compressed, broken up in small messages, and sent mixed with regular Gnet traffic to the ultrapeer. * At the Ultrapeer level: + Until the whole "boolean vector" is received from a leaf node, all queries are forwarded to that node. + When the "boolean vector" is fully received, it is going to be used as the Query Routing table for that leaf node: queries are broken into individual words, all accentuated letters are removed. + For each leaf node with a Query Routing table: . Each word is then hashed and looked up in the Query Routing table. . Depending on the query matching rules (see 2.2.7.3), either ALL the words will be required to be found in the Query Routing, or only some of them, to declare a Query Routing Hit. . Only those queries that were declared a Hit at the previous stage will be forwarded to a given leaf node. The remaining sections define the hashing function, the mechanism used to build up the "boolean vector" and compress it, the protocol to transmit the vector to the Ultrapeer, and finally give operating hints for the table sizing. [TODO: finish the QRP description --RAM] The Ultrapeer specification can be found at: http://www.limewire.com/developer/Ultrapeers.html The Query Routing Protocol (QRP) used in Ultrapeer can be found at: http://www.limewire.com/developer/query_routing/keyword%20routing.htm It is RECOMMENDED that servants implement the Ultrapeer system, or another system for decreasing the bandwidth load on modem users. 4 File Transfer 4.1 Normal File Transfer Once a servant receives a QueryHit message, it may initiate the direct download of one of the files described by the message's Result Set. Files are downloaded out-of-network i.e. a direct connection between the source and target servant is established in order to perform the data transfer. File data is never transferred over the Gnutella network. The file download protocol is HTTP. It is RECOMMENDED to use HTTP 1.1 (RFC 2616), but HTTP 1.0 (RFC 1945) can be used instead. The full specifications are available in those RFCs. The following includes only the basic things. The following examples assumes that HTTP 1.1 is used. The servant initiating the download sends a request string on the following form to the target server: GET /get// HTTP/1.1 User-Agent: Gnutella Host: 123.123.123.123:6346 Connection: Keep-Alive Range: bytes=0- where and are one of the File Index/File Name pairs from a QueryHit message's Result Set. For example, if the Result Set from a QueryHit message contained the entry File Index: 2468 File Size: 4356789 File Name: Foobar.mp3 then a download request for the file described by this entry would be initiated as follows: GET /get/2468/Foobar.mp3 HTTP/1.1 User-Agent: Gnutella Host: 123.123.123.123:6346 Connection: Keep-Alive Range: bytes=0- Servants MUST encode the filename in GET requests according the standard URL/URI encoding rules. Servants MUST accept URL-encoded GET requests. Since some old servants does not support encoding, servants SHOULD accept non-encoded requests and MAY try a non-encoded requests if a 404 Not Found error is returned for the initial request. The Host header is required by HTTP 1.1 and specifies what address you have connected to. It is usually not used by the receiving servant, but its presence is required by the protocol. The allowable values of the User-Agent string are defined by the HTTP standard. Servant developers cannot make any assumptions about the value here. The use of 'Gnutella' is for illustration purposes only. The server receiving this download request responds with HTTP 1.1 compliant headers such as HTTP/1.1 200 OK Server: Gnutella Content-type: application/binary Content-length: 4356789 The file data then follows and should be read up to, and including, the number of bytes specified in the Content-length provided in the server's HTTP response. Note: Servants SHOULD use HTTP version 1.1 for file transfer, but some support only HTTP version 1.0. Servants MUST accept incoming HTTP/1.0 requests, and SHOULD retry with HTTP/1.0 if the remote host is not HTTP/1.1 compliant. Though it is strongly RECOMMENDED to have full HTTP/1.1 support, some servants do not. The most important features for Gnutella, range requests and Persistent Connections MUST be supported. Some old servants, however, do not. Range requests are on the form GET /get/2468/Foobar.mp3 HTTP/1.1 User-Agent: Gnutella Host: 123.123.123.123:6346 Connection: Keep-Alive Range: bytes=4932766-5066083 Note that the Range header does not have to specify both start and end positions. The response is on the form HTTP/1.1 206 Partial Content Server: Gnutella Content-Type: audio/mpeg Content-Length: 133318 Content-Range: bytes 4932766-5066083/5332732 The Connection header tells the remote host if the connection should be closed when the transfer is finished or not. "Connection: close" means that the connection MUST be closed after the transfer. "Connection: Keep-Alive" or no Connection header means the connection MUST be kept open. The client MAY then issue another request for another range or another file. The request MAY be sent before the previous transfer is finished. Persistent Connections is described in section 8.1 of RFC 2616. Headers unknown to the servant MUST be quietly ignored. Servants SHOULD NOT attempt to download multiple files from the same source at once. Files SHOULD be locally queued instead. Servants are also RECOMENDED to use and understand the http extension described in HUGE. (see Appendix 1) 4.2 Firewalled servants It is not always possible to establish a direct connection to a Gnutella servant in an attempt to initiate a file download. The servant may, for example, be behind a firewall that does not permit incoming connections to its Gnutella port. If a direct connection cannot be established, the servant attempting the file download may request that the servant sharing the file "push" the file instead. A servant can request a file push by routing a Push request back to the servant that sent the QueryHit message describing the target file. The servant that is the target of the Push request (identified by the Servant Identifier field of the Push message) SHOULD, upon receipt of the Push message, attempt to establish a new TCP/IP connection to the requesting servant (identified by the IP Address and Port fields of the Push message). If this direct connection cannot be established, then it is likely that the servant that issued the Push request is itself behind a firewall. In this case, file transfer cannot take place by the means of what is described in this document. If a direct connection can be established from the firewalled servant to the servant that initiated the Push request, the firewalled servant should immediately send the following: GIV :/ Where and are the values of the File Index and Servant Identifier fields respectively from the Push request received, and is the name of the file in the local file table whose file index number is . The File Name MAY be url/uri encoded. The servant that receives the GIV (the servant that wants to receive a file) SHOULD ignore the File Index and File Name, and request the file it wants to download. The servant that sent the GIV MUST allow the client to request any file, and not just the one specified in the Push message. The GET request and the remainder of the file download process is identical to that described in the section 4.1 (Normal File Transfer) above. The is formatted as hexadecimal, and must be read case-insensitively. For instance: GIV 36:809BC12168A1852CFF5D7A785833F600/Foo.txt GIV 124:d51dff817f895598ff0065537c09d503/Bar.html If the TCP connection is lost during a Push initiated file transfer, it is strongly RECOMMENDED that the servant who initiated the TCP connection (the servant providing the file) attempt to re-connect. That is important, since the servant receiving the file might not be able to get another Push message to the servant providing the file. 4.3 Busy Servants Servants whose upload bandwidth is already saturated with transfers MAY reject a download request by returning the 503 response code. Servants MAY simply have a fixed number of available upload slots, but SHOULD use a system that utilizes upload bandwidth better. Allowing new downloads as long as 20% of total upload bandwidth is unused is one possibility. Busy servants receiving a Push message SHOULD connect to the host requesting a push, and return the 503 Busy code when the remote host has requested the file. Servants MAY try requesting a download again when the servant providing the file returns the busy code, but MUST not do so more often than once per minute and file source. That means a Servant MUST NOT open new connections to a remote host more than once per minute. Servants SHOULD prevent other servants breaking the above rule from increasing their chanses to downlaoad a file. This can for example be archived by refusing any connection attempts from a particular host if a download request has been denies less than 50 seconds ago, or by adding hosts that request too often to a ban list. Servants MAY use queuing systems to allow downloaders to stand in queue to download a file, but that is outside the scope of this document. If a transfer is interrupted, the serving servant SHOULD keep the allocated slot/bandwidth reserved for at least one minute. The downloader would then be allowed to reconnect and resume the transfer. 4.4 Sharing Servants that are able to download files MUST also be able to share files with others. Servants SHOULD encourage users to share files. Servants SHOULD attempt to prevent programs that are not able to share files from downloading files. This means that servant SHOULD not allow uploads to web browsers and download accelerators. The User-Agent http header tells what program the remote host is running. Many servants return a html page instead, telling the user how Gnutella works, and where to get a servant. Servants MUST NOT give precedence to other users using the same servant. They MUST answer Query messages and accept file download requests using the same rules for all servants. Servants MAY, however, attempt to block servants that do not follow the rules in this protocol in way that seriously hurts others experience of the Gnutella network. Servants SHOULD, by default, share the directory where downloaded files are placed. Servants SHOULD also share new downloaded files without waiting for the servant to be restarted. Servants SHOULD avoid changing the index numbers of shared files. Servants MUST NOT share partially downloaded (incomplete) files as if they were complete. This is often done by using a separate directory for incomplete downloads. When the download finishes, the file is move to the downloads directory (that should be shared). Partial files MAY be shared in a way the makes it clear to other servants that the file is incomplete. 5 Security Considerations [TODO: This section is very incomplete. Any suggestions are welcome.] 5.1 Threats against individual Gnutella participants [TODO: Write about threats against individual Gnutella participants Such as flooding, fake files, DoS, etc. Flooding hostcache with faked pongs] [TODO: How one can protect oneself and other gnet users] Inexperienced users might share sensitive information, such as cookie or password files, on the Gnutella network. Servants SHOULD warn users who try share such information. Malicious file, such as viruses and trojans, might be shared on the Gnutella network by malicious or unexpecting users. Servants SHOULD encourage users to scan downloaded files for viruses etc. but this is outside the scope of the Gnutella protocol. 5.2 Threats against the Gnutella network [TODO: Write about threats against the Gnutella network Such as query flooding, fake files, DoS, fake random pongs etc.] [TODO: What a servant should do to protect the network] [TODO: A solution is fair sharing of bandwidth between connections and message types] 5.3 Threats against third parties [TODO: Write about threats against third parties. Such (D)DoS, etc.] [TODO: How to avoid] Would it be possible to use the power of the Gnutella network to attack internet hosts? That issue will be discussed in this section. The ways of doing so an attacker might try is: * Responding to Ping messages with Pong messages containing the IP address and port of the target host. This would cause other Gnutella servants to attempt to connect to the target host, thinking it is a Gnutella host. [TODO: Can this really be an effective attack? Would the target receive that many connection attempts?] * Responding to Query messages with Query Hit messages containing the IP address and port of the target host. [TODO: This might actually be an issue. How about recommending servants to attempt to drop faked QHs somehow?] * Responding to Query Hit messages with Push messages containing the IP address and port of the target host. This would cause the servant receiving the Push message to attempt a connection to the target host. It would, however, not be more than one connection per Push message, so it could not be used to launch large Denial-of-Service attacks. [TODO: Is that correct, or is it a real threat?] 6 Credits The authors would like to thank: [TODO fill] New features mentioned in this document, not present in the original 0.4 Gnutella specification document, published by the now defunct Clip2 company should be credited thusly: 0.6 Handshaking Protocol LimeWire LLC X-Try Header Mike Green Bye Packet Raphael Manfredi Pong Caching LimeWire LLC and others EQHD Block Free Peers Inc. GGEP Jason Thomas HUGE Gordon Mohr Ultrapeers LimeWire LLC Query Routing Protocol LimeWire LLC XML queries LimeWire LLC Negotiation of Gnet Compression Raphael Manfredi [TODO missing?] Appendix 1: HUGE (Hash/URN Gnutella Extensions) HUGE is used to provide capability for hashes (numbers uniquely identifying files) and other urn:s to Gnutella. HUGE SHOULD be implemented inside GGEP (Section 2.3), but can also be used as a stand-alone extension block. When inside GGEP, the GGEP extension- identifier for HUGE info is 'u'. When used as a stand-alone extension, HUGE blocks start with "urn:". There MAY be multiple HUGE blocks in one Gnutella message (separated by 0x1C bytes). HUGE also extends the direct file transfer between hosts, to allow communication or urn:s, and to build "download meshes" that inform servants of other locations of a file. The HUGE documentation is available at: http://groups.yahoo.com/group/the_gdf/files/Proposals/HUGE Servants are RECOMMENDED to implement HUGE. Appendix 2: XML XML blocks can be used to issue rich queries and to include metadata about files in query hit messages. XML SHOULD be implemented inside GGEP, but can also be used as a stand-alone extension block. If there is a "{deflate}" before the XML block it is compressed using the defalte algorithm. "{plaintext}" or no such prefix means uncompressed plaintext XML. Any other prefix mean the XML block is compressed using an algorithm that is not known at this time. XML blocks can be recognized by them starting with "<" or "{". (Do not rely on " Note that although we only specify "deflate" here, the servant MAY advertise the set of various compression algorithms it knows, subsequent items being separated by a ",". And to accept compression, the other side acknowledges by sending: Content-Encoding: deflate The servant just picks the compression scheme it supports amongst the ones advertised by the remote end in the Accept-Encoding line. The Content-Encoding MUST contain only one value. This also means that compression settings is asymmetric: a node can send compressed data but receive uncompressed data. Here's an example where both nodes support compression, comments starting with "--", and ending removed for clarity: GNUTELLA CONNECT/0.6 Accept-Encoding: deflate -- OK for reception of compressed data GNUTELLA/0.6 200 OK Accept-Encoding: deflate -- I can also receive compressed data Content-Encoding: deflate -- And I will send compressed data GNUTELLA/0.6 200 OK Content-Encoding: deflate -- OK, will also compress data Here's an example where compression will only be made on the transmission side of the first node (A is the node initiating the handshake, B is the node replying): GNUTELLA CONNECT/0.6 Accept-Encoding: deflate -- OK for reception of compressed data GNUTELLA/0.6 200 OK Accept-Encoding: deflate -- I can also receive compressed data -- I refuse to compress data, sorry GNUTELLA/0.6 200 OK Content-Encoding: deflate -- OK, I will compress data sent -- But I will receive uncompressed data B is compressed, flow from B->A is not> Even though GGEP payloads (see Section 2.3) can be compressed, and this information is visible in the GGEP header, it is not advisable to decompress those payloads before sending them to the compressing layer. The deflate algorithm does not expand already-compressed data by a large factor and emits them as clearly marked non-compressible data (the overhead is limited to roughly 0.1%). If connection compression is widely used on the Gnutella network, individual GGEP extensions SHOULD NOT be compressed. [TODO: Some spelling errors.] [TODO: Many of the docs referred to are still drafts] [TODO: RFCs have 72 char lines, and a 3 char left margin for most text blocks. I intend to make lines max 69 chars, and text blocks will be indented 3 chars.] [TODO: Break up into pages (like RFCs), but not before the final release (Or convert to XML)] [HUGE should perhaps be integrated] [The XML spec should not be integrated. We just specify where there is XML. The XML format that limewire uses is not a part of Gnutella. Anyone can use any XML format.]