Sophie

Sophie

distrib > Mandriva > 2010.0 > i586 > media > contrib-release > by-pkgid > 1e07de2009c1b4cf9b7971c91f42461d > files > 198

python-foolscap-0.4.2-1mdv2010.0.noarch.rpm

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>The Banana Protocol</title>
<link href="stylesheet-unprocessed.css" type="text/css" rel="style" />
</head>

<body>
<h1>The Banana Protocol</h1>

<em>NOTE! This is all preliminary and is more an exercise in semiconscious
protocol design than anything else. Do not believe this document. This
sentence is lying. So there.</em>

<h2>Banana tokens</h2>

<p>At the lowest layer, the wire transport takes the form of Tokens. These
all take the shape of header/type-byte/body.</p>

<ul>
 <li>Header: zero or more bytes, all of which have the high bit clear (they
 range in value from 0 to 127). They form a little-endian base-128 number,
 so 1 is represented as 0x01, 128 is represented as 0x00 0x01, 130 as 0x02
 0x01, etc. 0 can be represented by any string of 0x00 bytes, including an
 empty string. The maximum legal header length is 64 bytes, so it has a
 maximum value of 2**(64*7)-1. Not all tokens have headers.</li>

 <li>Type Byte: the high bit is set to distinguish it from the header bytes
 that precede it (it has a value from 128 to 255). The Type Byte determines
 how to interpret both the header and the body. All valid type bytes are
 listed below.</li>

 <li>Body: zero or more arbitrary bytes, length is specified by the
 header. Not all tokens have bodies.</li>
</ul>

<p>Tokens are described below as [header-TOKEN-body], where either
<q>header</q> or <q>body</q> may be empty. For example, [len-LIST-empty]
indicates that the length is put into the header, <q>LIST</q> is the token
being used, and the body is empty.</p>

<p>The possible Token types are:</p>

<ul>
  <li>
  <code>0x80: LIST (old): [len-LIST-empty]</code>

  <p>This token marks the beginning of a list with LEN elements. It acts as
  the <q>open parenthesis</q>, and the matching <q>close parenthesis</q> is
  implicit, based upon the length of the list. It will be followed by LEN
  things, which may be tokens like INTs or STRINGS, or which may be
  sublists. Banana keeps a list stack to handle nested sublists.</p>

  <p>This token (and the notion of length-prefixed lists in general) is from
  oldbanana. In newbanana it is only used during the initial dialect
  negotiation (so that oldbanana peers can be detected). Newbanana requires
  that LIST(old) tokens be followed exclusively by strings and have a rather
  limited allowable length (say, 640 dialects long).</p>
  </li>

  <li>
  <code>0x81: INT: [value-INT-empty]</code>

  <p>This token defines a single positive integer. The protocol defines its
  range as [0, 2**31), so the largest legal value is 2**31-1. The
  recipient is responsible for choosing an appropriate local type to hold the
  number. For Python, if the value represented by the incoming base-128
  digits grows larger than a regular Python IntType can accomodate, the
  receiving system will use a LongType or a BigNum as necessary.</p>

  <p>Anything larger than this range must be sent with a LONGINT token
  instead.</p>

  <p>(oldbanana compatibility note: a python implementation can accept
  anything in the range [0, 2**448), limited by the 64-byte maximum header
  size).</p>
  
  <p>The range was chosen to allow INT values to always fit in C's s_int32_t
  type, so an implementation that doesn't have a useful bignum type can
  simply reject LONGINT tokens.</p>
  </li>

  <li>
  <code>0x82: STRING [len-STRING-chars]</code>

  <p>This token defines a string. To be precise, it defines a sequence of
  bytes. The length is a base-128-encoded integer. The type byte is followed
  by LEN bytes of data which make up the string. LEN is required to be
  shorter than 640k: this is intended to reduce the amount of memory that
  can be consumed on the receiving end before user code gets to decide
  whether to accept the data or not.</p>
  </li>

  <li>
  <code>0x83: NEG: [value-NEG-empty]</code>

  <p>This token defines a negative integer. It is identical to the
  <code>INT</code> tag except that the results are negated before storage.
  The range is defined as [-2**31, 0), again to make an implementation using
  s_int32_t easier. Any numbers smaller (more negative) than this range must
  be sent with a LONGNEG token.</p>

  <p>Implementations should be tolerant when receiving a <q>negative zero</q>
  and turn it into a 0, even though they should not send such things.</p>

  <p>Note that NEG can represent a number (-2**31) whose absolute value
  (2**31) is one larger than the greatest number that INT can represent
  (2**31-1).</p>
  </li>

  <li>
  <code>0x84: FLOAT [empty-FLOAT-value]</code>

  <p>This token defines a floating-point number. There is no header, and the
  type byte is followed by 8 bytes which are a 64-bit IEEE <q>double</q>, as
  defined by <code class="python">struct.pack("!d", num)</code>.</p>
  </li>

  <li>
  <p><code>0x85: OLDLONGINT: [value-OLDLONGINT-empty]</code></p>
  <p><code>0x86: OLDLONGNEG: [value-OLDLONGNEG-empty]</code></p>

  <p>These were used by oldbanana to represent large numbers. Their size was
  limited by the number of bytes in the header (max 64), so they can
  represent [0, 2**448).</p>
  </li>

  <li>
  <code>0x87: VOCAB: [index-VOCAB-empty]</code>
  
  <p>This defines a tokenized string. Banana keeps a mapping of common
  strings, each one is assigned a small integer. These strings can be sent
  compressed as a two-byte (index, VOCAB) sequence. They are delivered to
  Jelly as plain strings with no indication that they were compressed for
  transit.</p>

  <p>The strings in this mapping are populated by the sender when it sends a
  special <q>vocab</q> OPEN sequence. The intention is that this mapping
  will be sent just once when the connection is first established, but a
  sufficiently ambituous sender could use this to implement adaptive forward
  compression.</p>
  </li>

  <li>
  <p><code>0x88: OPEN: [[num]-OPEN-empty]</code></p>
  <p><code>0x89: CLOSE: [[num]-CLOSE-empty]</code></p>

  <p>These tokens are the newbanana parenthesis markers. They carry an
  optional number in their header: if present, the number counts the
  appearance of OPEN tokens in the stream, starting at 0 for the first OPEN
  used for a given connection and incrementing by 1 for each subsequent
  OPEN. The matching CLOSE token must contain an identical number. These
  numbers are solely for debugging and may be omitted. They may be removed
  from the protocol once development has been completed.</p>

  <p>In contrast to oldbanana (with the LIST token), newbanana does not use
  length-prefixed lists. Instead it relies upon the Banana layer to track
  OPEN/CLOSE tokens.</p>

  <p>OPEN markers are followed by the <q>Open Index</q> tuple: one or more
  tokens to indicate what kind of new sub-expression is being started. The
  first token must be a string (either STRING or VOCAB), the rest may be
  strings or other primitive tokens. The recipient decides when the Open
  Index has finished and the body has begun.</p>
  </li>

  <li>
  <p><code>0x8A: ABORT: [[num]-ABORT-empty]</code></p>

  <p>This token indicates that something has gone wrong on the sender side,
  and that the resulting object must not be handed upwards in the unslicer
  stack. It may be impossible or inconvenient for the sender to stop sending
  the tokens associated with the unfortunate object, so the receiver must be
  prepared to silently drop all further tokens up to the matching STOP
  marker. The STOP token must always follow eventually: this is just a
  courtesy notice.</p>

  <p>The number, if present, will be the same one used by the OPEN
  token.</p>
  </li>

  <li>
  <p><code>0x8B: LONGINT: [len-LONGINT-bytes]</code></p>
  <p><code>0x8C: LONGNEG: [len-LONGNEG-bytes]</code></p>

  <p>These are processed like STRING tokens, but the bytes form a base-256
  encoded number, most-significant-byte first (note that this may require
  several passes and some intermediate storage). The size is (barely) limited
  by the length field, so the theoretical range is [0, 2**(2**(64*7)-1)-1),
  but the receiver can impose whatever length limit they wish.</p>

  <p>LONGNEG is handled exactly like LONGINT but the number is negated
  first.</p>
  </li>

  <li>
  <p><code>0x8D: ERROR [len-ERROR-chars]</code></p>

  <p>This token defines a string of ASCII characters which hold an error
  message. When a severe protocol violation occurs, the offended side will
  emit an ERROR token and then close the transport. The side which receives
  the ERROR token should put the message in a developer-readable logfile and
  close the transport as well.</p>

  <p>The ERROR token is formatted exactly like the STRING token, except that
  it is defined to be encoded in ASCII (the STRING token does not claim to
  be encoded in any particular character set, nor does it necessarily
  represent human-readable characters).</p>

  <p>The ERROR token is limited to 1000 characters.</p>
  </li>

  <li>
  <p><code>0x8E: PING [[num]-PING-empty]</code></p>
  <p><code>0x8F: PONG [[num]-PONG-empty]</code></p>

  <p>These tokens have no semantic value, but are used to implement
  connection timeouts and keepalives. When one side receives a PING message,
  it should immediately queue a PONG message on the return stream. The
  optional number can be used to associate a PONG with the PING that prompted
  it: if present, it must be duplicated in the response.</p>

  <p>Other than generating a PONG, these tokens are ignored by both ends.
  They are not delivered to higher levels. They may appear in the middle of
  an OPEN sequence without affecting it.</p>

  <p>The intended use is that each side is configured with two timers: the
  idle timer and the disconnect timer. The idle timer specifies how long the
  inbound connection is allowed to remain quiet before poking it. If no data
  has been received for this long, a PING is sent to provoke some kind of
  traffic. The disconnect timer specifies how long the inbound connection is
  allowed to remain quiet before concluding that the other end is dead and
  thus terminating the connection.</p>
  </li>

  <p>These messages can also be used to estimate the connection's round-trip
  time (including the depth of the transmit/receive queues at either end).
  Just send a PING with a unique number, and measure the time until the
  corresponding PONG is seen.</p>
  
</ul>

<p>TODO: Add TRUE, FALSE, and NONE tokens. (maybe? These are currently
handled as OPEN sequences)</p>


<h2>Serialization</h2>

<p>When serializing an object, it is useful to view it as a directed graph.
The root object is the one you start with, any objects it refers to are
children of that root. Those children may point back to other objects that
have already been serialized, or which will be serialized later.</p>

<p>Banana, like pickle and other serialization schemes, does a depth-first
traversal of this graph. Serialization is begun on each node before going
down into the child nodes. Banana tracks previously-handled nodes and
replaces them with numbered <code>reference</code> tokens to break loops in
the graph.</p>


<h3>Banana Slicers</h3>

<p>A <em>Banana Slicer</em> is responsible for serializing a single user
object: it <q>slices</q> that object into a series of smaller pieces, either
fundamental Banana tokens or other Sliceable objects. On the receiving end,
there is a corresponding <em>Banana Unslicer</em> which accepts the incoming
tokens and re-creates the user object. There are different kinds of Slicers
and Unslicers for lists, tuples, dictionaries, etc. Classes can provide
their own Slicers if they want more control over the serialization
process.</p>

<p>In general, there is a Slicer object for each act of serialization of a
given object (although this is not strictly necessary). This allows the
Slicer to contain state about the serialization process, which enables
producer/consumer -style pauses, and slicer-controlled streaming
serialization. The entire context is stored in a small tuple (which includes
the Slicer), so it can be set aside for a while. In the future, this will
allow interleaved serialization of multiple objects (doing context switching
on the wire), to do things like priority queues and avoid head-of-line
blocking.</p>

<p>The most common pattern is to have the Slicer be the <code>ISlicer</code>
Adapter for the object, in which it gets a new Slicer case each it is
serialized. Classes which do not need to store a lot of state can have a
single Slicer per serialized object, presumably through some adapter tricks.
It is also valid to have the serialized object be its own Slicer.</p>

<p>The Slicer has other duties (described below), but the main one is to
implement the <code>slice</code> method, which should return a sequence or
an iterable which yields the Open Index Tokens, followed by the body tokens.
(Note that the Slicer should not include the OPEN or CLOSE tokens: those are
supplied by the SendBanana wrapping code). Any item which is a fundamental
type (int, string, float) will be sent as a banana token, anything else will
be handled by recursion (with a new Slicer).</p>

<p>Most subclasses of <code>BaseSlicer</code> implement a companion method
named <code>sliceBody</code>, which supplies just the body tokens. (This
makes the code a bit easier to follow). <code>sliceBody</code> is usually
just a <q>return [token, token]</q>, or a series of <code>yield</code>
statements, one per token. However, classes which wish to have more control
over the process can implement <code>sliceBody</code> or even
<code>slice</code> differently.</p>



<pre class="python">
class ThingySlicer(slicer.BaseSlicer):
    opentype = ('thingy',)
    trackReferences = True

    def sliceBody(self, streamable, banana):
        return [self.obj.attr1, self.obj.attr2]
</pre>

<p>If <q>attr1</q> and <q>attr2</q> are integers, the preceding Slicer would
create a token sequence like: OPEN STRING(thingy) 13 16 CLOSE. If
<q>attr2</q> were actually another Thingy instance, it might produce OPEN
STRING(thingy) 13 OPEN STRING(thingy) 19 18 CLOSE CLOSE. </p>

<p>Doing this with a generator gives the same basic results but avoids the
temporary buffer, which can be important when sending large amounts of data.
The following Slicer could be combined with a concatenating Unslicer to
implement the old FilePager class without the extra round-trip
inefficiencies.</p>

<pre class="python">
class DemandSlicer(slicer.BaseSlicer):
    opentype = ('demandy',)
    trackReferences = True

    def sliceBody(self, streamable, banana):
        f = open("data", "r")
        for chunk in f.read(2048):
            yield chunk
</pre>

<p>The SendBanana code controls the pacing: if the transport is full, it has
the option of pausing the generator until the receiving end has caught up.
It also has the option of pulling tokens out of the Slicer anyway, and
buffering them in memory. This may be necessary to achieve serialization
coherency, discussed below.</p>

<p>If the <q>streamable</q> flag is set, then the <em>slicer</em> gets to
control the pacing too: it is allowed to yield a Deferred where it would
normally provide a regular token. This tells Banana that serialization needs
to wait for a while (perhaps we are streaming data from another source which
has run dry, or we are trying to implement some kind of rate limiting).
Banana will wait until the Deferred fires before attempting to retrieve
another token. If the <q>streamable</q> flag is <em>not</em> set, then a
parent Slicer has decided that it is unwilling to allow streaming (perhaps
it needs to serialize a coherent state, and a pause for streaming would
allow that state to change before it was completely serialized). The Slicer
is not allowed to return a Deferred when streaming is disabled.</p>

<pre class="python">
class URLGetterSlicer(slicer.BaseSlicer):
    opentype = ('urldata',)
    trackReferences = True

    def gotPage(self, page):
        self.page = page

    def sliceBody(self, streamable, banana):
        yield self.url
        d = web.client.getPage(self.url)
        d.addCallback(self.gotPage)
        yield d
        # here we hover in limbo until it fires
        yield self.page
</pre>

<p>(the code is a bit kludgy because generators have no way to pass data
back out of the <q>yield</q> statement).</p>

<p>The Slicer can also raise a <q>Violation</q> exception, in which case the
slicer will be aborted: no further tokens will be pulled from it. This
causes an ABORT token to be sent over the wire, followed immediately by a
CLOSE token. The dead Slicer's parent is notified with a
<code>childAborted</code> method, then the Banana continues to extract
tokens from the parent as if the child had finished normally. (TODO: we need
a convenient way for the parent to indicate that it wishes to give up too,
such as raising a Violation from within <code>childAborted</code>).</p>


<h3>Serialization Coherency</h3>

<p>Streaming serialization means the object is serialized a little bit at a
time, never consuming too much memory at once. The tradeoff is that, by
doing other useful work inbetween, our object may change state while it is
being serialized. In oldbanana this process was uninterruptible, so
coherency was not an issue. In newbanana it is optional. Some objects may
have more trouble with this than others, so Banana provides Slicers with a
means to influence the process.</p>

<p>Banana makes certain promises about what takes place between successive
<q>yield</q> statements, when the Slicer gives up control to Banana. The
most conservative approach is to:</p>

<ul>
  <li>disable the RootSlicer's <q>streamable</q> flag to tell all Slicers
  that they should not return Deferreds: this avoids loss of control due
  to child Slicers giving it away</li>

  <li>set the SendBanana policy to buffer data in memory rather than do a
  .pauseProducing: this removes pauses due to the output channel filling
  up</li>

  <li>return a list from <code>slice</code> (or <code>sliceBody</code>)
  instead of using a generator: this fixes the object contents at a single
  point in time. (you can also create a list at the beginning of that
  routine and then yield pieces of it, which has exactly the same
  effect)</li>
</ul>

<p>Slicers aren't supposed to do anything which changes the state observed
by other Slicers: if this is really the case than it is safe to use a
generator. A parent Slicer which yields a non-primitive object will give up
control to the child Slicer needed to handle that object, but that child
should do its business and finish quickly, so there should be no way for the
parent object's state to change in the meantime. </p>

<p>If the SendBanana is allowed to give up control (.pauseProducing), then
arbitrary code will get to run in between <q>yield</q> calls, possibly
changing the state being accessed by those yields. Likewise child Slicers
might give up control, threatening the coherency of one of their parents.
Slicers can invoke <code>banana.inhibitStreaming()</code> (TODO: need a
better name) to inhibit streaming, which will cause all child serialization
to occur immediately, buffering as much data in memory as necessary to
complete the operation without give up control.</p>

<p>Coherency issues are a new area for Banana, so expect new tools and
techniques to be developed which allow the programmer to make sensible
tradeoffs.</p>



<h3>The Slicer Stack</h3>

<!-- directions are inconsistent: the RootSlicer is the parent, but lives at
the bottom of the stack. I think of delegation as going "upwards" to your
parent (like upcalls), so I describe it that way, but that "up" is at odds
with the stack's "bottom" -->

<p>The serialization context is stored in a <q>SendBanana</q> object, which
is one of the two halves of the Banana object (a subclass of Protocol). This
holds a stack of Banana Slicers, one per object currently being serialized
(i.e. one per node in the path from the root object to the object currently
being serialized).</p>

<p>For example, suppose a class instance is being serialized, and this class
chose to use a dictionary to hold its instance state. That dictionary holds
a list of numbers in one of its values. While the list of numbers is being
serialized, the Slicer Stack would hold: the RootSlicer, an InstanceSlicer,
a DictSlicer, and finally a ListSlicer.</p>

<p>The stack is used to determine two things:</p>

<ul>
  <li> How to handle a child object: which Slicer should be used, or if a
       Violation should be raised</li>
  <li> How to track object references, to break cycles in the object graph</li>
</ul>

<p>When a new object needs to be sent, it is first submitted to the top-most
Slicer (to its <code>slicerForObject</code> method), which is responsible
for either returning a suitable Slicer or raising a Violation exception (if
the object is rejected by a security policy). Most Slicers will just
delegate this method up to the RootSlicer, but Slicers which wish to pass
judgement upon enclosed objects (or modify the Slicer selected) can do
something else. Unserializable objects will raise an exception here.</p>

<p>Once the new Slicer is obtained, the OPEN token is emitted, which
provides the <q>openID</q> number (just an implicit count of how many OPEN
tokens have been sent over the wire). This is where we break cycles in the
object graph: before serializing the object, we record a reference to it
(the openID), and any time we encounter the object again, we send the
reference number instead of a new copy. This reference number is tracked in
the SlicerStack, by handing the number/object pair to the top-most Slicer's
<code>registerReference</code> method. Most Slicers will delegate this up to
the RootSlicer, but again they can perform additional registrations or
consume the request entirely. This is used in PB to provide <q>scoped
references</q>, where (for example) a list <em>should</em> be sent twice if
it occurs in two separate method calls. In this case the CallSlicer (which
sits above the PBRootSlicer) does its own registration.</p>

<p>The <code>slicerForObject</code> process is responsible for catching the
second time the object is sent. It looks in the same mapping created by
<code>registerReference</code> and returns a <code>ReferenceSlicer</code>
instead of the usual one.</p>

<p>The <code>RootSlicer</code>, which sits at the bottom of the stack, is a
special case. It is never pushed or popped, and implements most of the
policy for the whole Banana process. The RootSlicer can also be interpreted
as a <q>root object</q>, if you imagine that any given user object being
serialized is somehow a child of the overall serialization context. In PB,
for example, the root object would be related to the connection and needs to
track things like which remotely-invokable objects are available.</p>

<p>The default RootSlicer implements the following behavior:</p>

<ul>
  <li>Allow all objects to be serialized that can be</li>

  <li>Use its <code>.slicerTable</code> to get a Slicer for an object. If
  that fails, adapt the object to ISlicer</li>

  <li>Record object references in its <code>.references</code> dict</li>
</ul>

<p>The <code>RootSlicer</code> class only does <q>safe</q> serialization:
basic types and whatever you've registered an ISlicer adapter for. The
<code>TrustingRootSlicer</code> uses that .slicerTable mapping to serialize
unsafe things (arbitrary instances, classes, etc), which is suitable for
local storage instead of network communication (i.e. when you want to use
banana as a pickle replacement).</p>

<p>TODO: The idea is to let other serialization contexts do other things.
For example, the final tokens could go to the parent slice for handling
instead of straight to the Protocol, which would provide more control over
turning the tokens into bytes and sending over a wire, saving to a file,
etc.</p>

<p>Finally, the stack can be queried to find out what path leads from the
root object to the one currently being serialized. If something goes wrong
in the serialization process (an exception is thrown), this path can make it
much easier to find out <em>when</em> the trouble happened, as opposed to
merely where. Knowing that the <q>.oops</q> method of your FooObject failed
during serialization isn't very useful when you have 500 FooObjects inside
your data structure and you need to know whether it was
<code>bar.thisfoo</code> or <code>bar.thatfoo</code> which caused the
problem. To this end, each Slicer has a <code>.describe</code> method which
is supposed to return a short string that explains how to get to the child
node currently being processed. When an error occurs, these strings are
concatenated together and put into the failure object.</p>



<h2>Deserialization</h2>

<p>The other half of the Banana class is the <code>ReceiveBanana</code>,
which accepts incoming tokens and turns them into objects. It is organized
just like the <code>SendBanana</code>, with a stack of <q>Banana
Unslicer</q> objects, each of which assembles tokens or child objects into a
larger one. Each Unslicer receives the tokens emitted by the matching Slicer
on the sending side. The whole stack is used to create new Unslicers,
enforce restrictions upon what objects will be accepted, and manage object
references.</p>

<p>Each Unslicer accepts tokens that turn into an object of some sort. They
pass this object up to their parent Unslicer. Eventually a finished object
is given to the <code>RootUnslicer</code>, which decides what to do with it.
When the Banana is being used for data storage (like pickle), the root will
just deliver the object to the caller. When Banana is used in PB, the actual
work is done by some intermediate objects like the
<code>CallUnslicer</code>, which is responsible for a single method
invocation.</p>

<p>The <code>ReceiveBanana</code> itself is responsible for pulling
well-formed tokens off the incoming data stream, tracking OPEN and CLOSE
tokens, maintaining synchronization with the transmitted token stream, and
discarding tokens when the receiving Unslicers have rejected one of the
inbound objects. Unslicer methods may raise Violation exceptions: these are
caught by the Unbanana and cause the object currently being unserialized to
fail: its parent gets a UnbananaFailure instead of the dict or list or
instance that it would normally have received.</p>

<p>OPEN tokens are followed by a short list of tokens called the
<q>opentype</q> to indicate what kind of object is being started. This is
looked up in the UnbananaRegistry just like object types are looked up in
the BananaRegistry (TODO: need sensible adapter-based registration scheme
for unslicing). The new Unslicer is pushed onto the stack.</p>

<p><q>ABORT</q> tokens indicate that something went wrong on the sending
side and that the current object is to be aborted. It causes the receiver to
discard all tokens until the CLOSE token which closes the current node. This
is implemented with a simple counter of how many levels of discarding we
have left to do.</p>

<p><q>CLOSE</q> tokens finish the current node. The Unslicer will pass its
completed object up to the <q>receiveChild</q> method of its parent.</p>

<h3>Open Index tokens: the Opentype</h3>

<p>OPEN tokens are followed by an arbitrary list of other tokens which are
used to determine which UnslicerFactory should be invoked to create the new
Unslicer. Basic Python types are designated with a simple string, like (OPEN
<q>list</q>) or (OPEN <q>dict</q>), but instances are serialized with two
strings (OPEN <q>instance</q> <q>classname</q>), and various exotic PB
objects like method calls may involve a list of strings and numbers (OPEN
<q>call</q> reqID objID methodname). The unbanana code works with the
unslicer stack to apply constraints to these indexing tokens and finally
obtain the new Unslicer when enough indexing tokens have been received.</p>

<p>The reason for assembling this <q>opentype</q> list before creating the
Unslicer (instead of using a generic InstanceUnslicer which switches
behavior depending upon its first received token) is to support classes or
PB methods which wish to push custom Unslicers to handle their
deserialization process. For example, a class could push a
StreamingFileUnslicer that accepts a series of string tokens and appends
their contents to a file on disk. This Unslicer could reduce memory
consumption (by only holding one chunk at a time) and update some kind of
progress indicator as the data arrives. This particular feature was provided
by the old StringPager utility, but custom Unslicers offer more flexibility
and better efficiency (no additional round-trips).</p>

<p>(note: none of this affects the serialization side: those Slicers emit
both their indexing tokens and their state tokens. It is only the receiving
side where the index tokens are handled by a different piece of code than
the content tokens).</p>

<p>In yet greater detail:</p>

<ul>

  <li>Each OPEN sequence is divided into an <q>Index phase</q> and a
  <q>Contents phase</q>. The first one (or two or three) tokens are the
  Index Tokens and the rest are the Body Tokens. The sequence ends with a
  CLOSE token.</li>

  <li>Banana.inOpen is a boolean which indicates that we are in the Index
  Phase. It is set to True when the OPEN token is received and returns to
  False after the new Unslicer has been pushed.</li>

  <li>Banana.opentype is a list of Index Tokens that are being accumulated.
  It is cleared each time .inOpen is set to True. The tuple form of opentype
  is passed to Slicer.doOpen, Constraint.checkOpentype, and used as a key in
  the RootSlicer.openRegistry dictionary. Each Unslicer type is indexed by
  an opentype tuple.</li>

</ul>

<p>If .inOpen is True, each new token type will be passed (through
Banana.getLimit and top.openerCheckToken) to the opener's .openerCheckToken
method, along with the current opentype tuple. The opener gets to decide if
the token is acceptable (possibly raising a Violation exception). Note that
the opener does not maintain state about what phase the decoding process is
in, so it may want to condition its response upon the length of the
opentype.</p>

<p>After each index token is complete, it is appended to .opentype, then the
list is passed (through Banana.handleOpen, top.doOpen, and top.open) to the
opener's .open method. This can either return an Unslicer (which will finish
the index phase: all further tokens will be sent to the new Unslicer),
return None (to continue the index phase), raise a Violation (which causes
an UnbananaFailure to be passed to the current top unslicer), or raise
another exception (which causes the connection to be abandoned).</p>

<h3>Unslicer Lifecycle</h3>

<p>Each Unslicer has access to the following attributes:</p>

<ul>
  <li><code>.parent</code>: This is set by the ReceiveBanana before
  <code>.start</code> is invoked, and provides a reference to the Unslicer
  responsible for the containing object. You can follow <code>.parent</code>
  all the way up the object graph to the single <code>RootUnslicer</code>
  object for this connection. It is appropriate to invoke
  <code>openerCheckToken</code> and <code>open</code> on your parent.</li>

  <li><code>.protocol</code>: This is set by the ReceiveBanana before
  <code>.start</code> is invoked, and provides access to the Banana object
  which maintains the connection on which this object is being received. It
  is appropriate to examine the <code>.debugReceive</code> attribute on the
  protocol. It is also appropriate to invoke <code>.setObject</code> on it
  to register references for shared containers (like lists).</li>

  <li><code>openCount</code>: This is set by the ReceiveBanana before
  <code>.start</code> is invoked, and contains the optional OPEN-count for
  this object, an implicit sequence number incremented for each OPEN token
  seen on the wire. During protocol development and testing the OPEN tokens
  may include an explicit OPEN-count value, but usually it is left out of
  the packet. If present, it is used by Banana.handleClose to assert that
  the CLOSE token is associated with the right OPEN token. Unslicers will
  not normally have a use for it.</li>

  <li><code>.count</code>: This is provided as the <q>count</q> argument to
  <code>.start</code>, and contains the <q>object counter</q> for this
  object. This is incremented for each new object which is created by the
  receive Banana code. This is similar to (but not always the same as) the
  OPEN-count. Containers should call <code>self.protocol.setObject</code> to
  register a Deferred during <code>start</code>, then call it again in
  <code>receiveClose</code> with the real (finished) object. It is sometimes
  also included in a debug message.</li>

  <li><code>.broker</code>: PB objects are given .broker, which is exactly
  equal to the .protocol attribute. The synonym exists because it makes
  several PB routines easier to read.</li>

</ul>

<p>Each Unslicer handles a single <q>OPEN sequence</q>, which starts with an
OPEN token and ends with a CLOSE token.</p>


<h4>Creation</h4>

<p>Acceptance of the OPEN token simply sets a flag to indicate that we are
in the Index Phase. (The OPEN token might not be accepted: it is submitted
to checkToken for approval first, as described below). During the Index
Phase, all tokens are appended to the current <code>opentype</code> list and
handed as a tuple to the top-most Unslicer's <code>doOpen</code> method.
This method can do one of the following things:</p>

<ul>
  <li>Return a new Unslicer object. It does this when there are enough index
  tokens to specify a new Unslicer. The new child is pushed on top of the
  Unslicer stack (Banana.receiveStack) and initialized by calling the
  <code>start</code> method described below. This ends the Index Phase.</li>

  <li>Return None. This indicates that more index tokens are required. The
  Banana protocol object simply remains in the Index Phase and continues to
  accumulate index tokens.</li>

  <li>Raise a Violation. If the open type is unrecognized, then a Violation
  is a good way to indicate it.</li>
</ul>

<p>When a new Unslicer object is pushed on the top of the stack, it has its
<code>.start</code> method called, in which it has an opportunity to create
whatever internal state is necessary to record the incoming content tokens.
Each created object will have a separate Unslicer instance. The start method
can run normally, or raise a Violation exception.</p>

<p><code>.start</code> is distinct from the Unslicer's constructor function
to minimize the parameter-passing requirements for doOpen() and friends. It
is also conceivable that keeping arguments out of <code>__init__</code>
would make it easier to use adapters in this context, although it is not
clear why that might be useful on the Unslicing side. TODO: consider merging
<code>.start</code> into the constructor.</p>

<p>This Unslicer is responsible for all incoming tokens until either 1: it
pushes a new one on the stack, or 2: it receives a CLOSE token.</p>


<h4>checkToken</h4>

<p>Each token starts with a length sequence, up to 64 bytes which are turned
into an integer. This is followed by a single type byte, distinguished from
the length bytes by having the high bit set (the type byte is always 0x80 or
greater). When the typebyte is received, the topmost Unslicer is asked about
its suitability by calling the <code>.checkToken</code> method. (note that
CLOSE and ABORT tokens are always legal, and are not submitted to
checkToken). Both the typebyte and the header's numeric value are passed to
this methoed, which is expected to do one of the following:</p>

<ul>
  <li>Return None to indicate that the token and the header value are
  acceptable.</li>

  <li>Raise a <code>Violation</code> exception to reject the token or the
  header value. This will cause the remainder of the current OPEN sequence
  to be discarded (all tokens through the matching CLOSE token). Unslicers
  should raise this if their constraints will not accept the incoming
  object: for example a constraint which is expecting a series of integers
  can accept INT/NEG/LONGINT/LONGNEG tokens and reject
  OPEN/STRING/VOCAB/FLOAT tokens. They should also raise this if the header
  indicates, e.g., a STRING which is longer than the constraint is willing
  to accept, or a LONGINT/LONGNEG which is too large. The topmost Unslicer
  (the same one which raised Violation) will receive (through its
  <code>.receiveChild</code> method) an UnbananaFailure object which
  encapsulates the reason for the rejection </li>
</ul>

<p>If the token sequence is in the <q>index phase</q> (i.e. it is just after
an OPEN token and a new Unslicer has not yet been pushed), then instead of
<code>.checkToken</code> the top unslicer is sent
<code>.openerCheckToken</code>. This method behaves just like checkToken,
but in addition to the type byte it is also given the opentype list (which
is built out of all the index tokens received during this index phase).</p>

<h4>receiveChild</h4>

<p>If the type byte is accepted, and the size limit is obeyed, then the rest
of the token is read and a finished (primitive) object is created: a string
or number (TODO: maybe add boolean and None). This object is handed to the
topmost Unslicer's <code>.receiveChild</code> method, where again it is has
a few options:</p>

<ul>
  <li>Run normally: if the object is acceptable, it should append or record
  it somehow.</li>

  <li>Raise Violation, just like checkToken.</li>

  <li>invoke <code>self.abort</code>, which does
  <code>protocol.abandonUnslicer</code></li>
</ul>

<p>If the child is handed an UnbananaFailure object, and it wishes to pass
it upwards to its parent, then <code>self.abort</code> is the appropriate
thing to do. Raising a Violation will accomplish the same thing, but with a
new UnbananaFailure that describes the exception raised here instead of the
one raised by a child object. It is bad to both call <code>abort</code> and
raise an exception.</p>

<h4>Finishing</h4>

<p>When the CLOSE token arrives, the Unslicer will have its
<code>.receiveClose</code> method called. This is expected to do:</p>

<ul>
  <li>Return an object: this object is the finished result of the
  deserialization process. It will be passed to <code>.receiveChild</code>
  of the parent Unslicer.</li>

  <li>Return a Deferred: this indicates that the object cannot be created
  yet (tuples that contain references to an enclosing tuple, for example).
  The Deferred will be fired (with the object) when it completes.</li>

  <li>Raise Violation</li>
</ul>

<p>After receiveClose has finished, the child is told to clean up by calling
its <code>.finish</code> method. This can complete normally or raise a
Violation.</p>

<p>Then, the old top-most Unslicer is popped from the stack and discarded.
Its parent is now the new top-most Unslicer, and the newly-unserialized
object is given to it with the <code>.receiveChild</code> method. Note that
this method is used to deliver both primitive objects (from raw tokens)
<em>and</em> composite objects (from other Unslicers).</p>


<h3>Error Handling</h3>

<p>Schemas are enforced by Constraint objects which are given an opportunity
to pass judgement on each incoming token. When they do not like something
they are given, they respond by raising a <code>Violation</code> exception.
The Violation exception is sometimes created with an argument that describes
the reason for the rejection, but frequently it is just a bare exception.
Most Violations are raised by the <code>checkOpentype</code> and
<code>checkObject</code> methods of the various classes in
<code>schema.py</code>.</p>

<p>Violations which occur in an Unslicer can be confined to a single
sub-tree of the object graph. The object being deserialized (and all of its
children) is abandoned, and all remaining tokens for that object are
discarded. However, the parent object (to which the abandoned object would
have been given) gets to decide what happens next: it can either fail
itself, or absorb the failure (much like an exception handler can choose to
re-raise the exception or eat it).</p>

<p>When a Violation occurs, it is wrapped in an <code>UnbananaFailure</code>
object (just like Deferreds wrap exceptions in Failure objects). The
UnbananaFailure behaves like a regular
<code>twisted.python.failure.Failure</code> object, except that it has an
attribute named <code>.where</code> which indicate the object-graph pathname
where the problem occurred.</p>

<p>The Unslicer which caused the Violation is given a chance to do cleanup
or error-reporting by invoking its <code>reportViolation</code> method. It
is given the UnbananaFailure so it can modify or copy it. The default
implementation simply returns the is expected to return the UnbananaFailure
it was given, but it is also allowed to return a different one. It must
return an UnbananaFailure: it cannot ignore the Violation by returning None.
This method should not raise any exceptions: doing so will cause the
connection to be dropped.</p>

<p>The UnbananaFailure returned by <code>reportViolation</code> is passed up
the Unslicer stack in lieu of an actual object. Most Unslicers have code in
their <code>receiveChild</code> methods to detect an UnbananaFailure and
trigger an abort (<code>propagateUnbananaFailures</code>), which causes all
further tokens of the sub-tree to be discarded. The connection is not
dropped. Unslicers which partition their children's sub-graphs (like the
PBRootUnslicer, for which each child is a separate operation) can simply
ignore the UnbananaFailure, or respond to it by sending an error message to
the other end.</p>

<p>Other exceptions may occur during deserialization. These indicate coding
errors or severe protocol violations and cause the connection to be dropped
(they are not caught by the Banana code and thus propagate all the way up to
the reactor, which drops the socket). The exception is logged on the local
side with <code>log.err</code>, but the remote end will not be told any
reason for the disconnection. The banana code uses the BananaError exception
to indicate protocol violations, but others may be encountered.</p>

<p>The Banana object can also choose to respond to Violations by terminating
the connection. For example, the <code>.hangupOnLengthViolation</code> flag
causes string-too-long violations to be raised directly instead of being
handled, which will cause the connection to be dropped (as it occurs in the
dataReceived method).</p>


<h3>Example</h3>

<p>The serialized form of <code class="python">["foo",(1,2)]</code> is the
following token sequence: OPEN STRING(list) STRING(foo) OPEN STRING(tuple)
INT(1) INT(2) CLOSE CLOSE. In practice, the STRING(list) would really be
something like VOCAB(7), likewise the STRING(tuple) might be VOCAB(8). Here
we walk through how this sequence is processed.</p>

<p>The initial Unslicer stack consists of the single RootUnslicer
<code>rootun</code>.</p>

<pre>
OPEN
  rootun.checkToken(OPEN) : must not raise Violation
  enter index phase

VOCAB(7)  (equivalent to STRING(list))
  rootun.openerCheckToken(VOCAB, ()) : must not raise Violation
  VOCAB token is looked up in .incomingVocabulary, turned into "list"
  rootun.doOpen(("list",)) : looks in UnslicerRegistry, returns ListUnslicer
  exit index phase
  the ListUnslicer is pushed on the stack
  listun.start()

STRING(foo)
  listun.checkToken(STRING, 3) : must return None
  string is assembled
  listun.receiveChild("foo") : appends to list

OPEN
  listun.checkToken(OPEN) : must not raise Violation
  enter index phase

VOCAB(8)  (equivalent to STRING(tuple))
  listun.openerCheckToken(VOCAB, ()) : must not raise Violation
  VOCAB token is looked up, turned into "tuple"
  listun.doOpen(("tuple",)) : delegates through:
                                 BaseUnslicer.open
                                 self.opener (usually the RootUnslicer)
                                 self.opener.open(("tuple",))
                              returns TupleUnslicer
  exit index phase
  TupleUnslicer is pushed on the stack
  tupleun.start()

INT(1)
  tupleun.checkToken(INT) : must not raise Violation
  integer is assembled
  tupleun.receiveChild(1) : appends to list

INT(2)
  tupleun.checkToken(INT) : must not raise Violation
  integer is assembled
  tupleun.receiveChild(2) : appends to list

CLOSE
  tupleun.receiveClose() : creates and returns the tuple (1,2)
                           (could also return a Deferred)
  TupleUnslicer is popped from the stack and discarded
  listun.receiveChild((1,2))

CLOSE
  listun.receiveClose() : creates and returns the list ["foo", (1,2)]
  ListUnslicer is popped from the stack and discarded
  rootun.receiveChild(["foo", (1,2)])
</pre>


<h2>Other Issues</h2>


<h3>Deferred Object Recreation: The Trouble With Tuples</h3>

<p>Types and classes are roughly classified into containers and
non-containers. The containers are further divided into mutable and
immutable. Some examples of immutable containers are tuples and bound
methods. Lists and dicts are mutable containers. Ints and strings are
non-containers. Non-containers are always leaf nodes in the object
graph.</p>

<p>During unserialization, objects are in one of three states: uncreated,
referenceable (but not complete), and complete. Only mutable containers can
be referenceable but not complete: immutable containers have no intermediate
referenceable state.</p>

<p>Mutable containers (like lists) are referenceable but not complete during
traversal of their child nodes. This means those children can reference the
list without trouble.</p>

<p>Immutable containers (like tuples) present challenges when unserializing.
The object cannot be created until all its components are referenceable.
While it is guaranteed that these component objects will be complete before
the graph traversal exits the current node, the child nodes are allowed to
reference the current node during that traversal. The classic example is the
graph created by the following Python fragment:</p>

<pre class="python">
a = ([],)
a[0].append((a,))
</pre>

<p>To handle these cases, the TupleUnslicer installs a Deferred into the
object table when it begins unserializing (in the .start method). When the
tuple is finally complete, the object table is updated and the Deferred is
fired with the new tuple.</p>

<p>Containers (both mutable and immutable) are required to pay attention to
the types of their incoming children and notice when they receive Deferreds
instead of normal objects. These containers are not complete (in the sense
described above) until those Deferreds have been replaced with referenceable
objects. When the container receives the Deferred, it should attach a
callback to it which will perform the replacement. In addition, immutable
containers should check after each update to see if all the Deferreds have
been cleared, and if so, complete their own object (and fire their own
Deferreds so any containers <em>they</em> are a child of may be updated
and/or completed).</p>

<p>TODO: it would be really handy to have the RootUnslicer do Deferred
Accounting: each time a Deferred is installed instead of a real object, add
its the graph-path to a list. When the Deferred fires and the object becomes
available, remove it. If deserialization completes and there are still
Deferreds hanging around, flag an error that points to the culprits instead
of returning a broken object.</p>

<h3>Security Model</h3>

<p>Having the whole Slicer stack get a chance to pass judgement on the
outbound object is very flexible. There are optimizations possibly because
of the fact that most Slicers don't care, perhaps a separate stack for the
ones that want to participate, or a chained delegation function. The
important thing is to make sure that exception cases don't leave a
<q>taster</q> stranded on the stack when the object that put it there has
gone away.</p>

<p>On the receiving side, the top Unslicer gets to make a decision about the
token before its body has arrived (limiting memory exposure to no more than
65 bytes). In addition, each Unslicer receives component tokens one at a
time. This lets you catch the dangerous data before it gets turned into an
object. However, tokens are a pretty low-level place to do security checks.
It might be more useful to have some kind of <q>instance taster stack</q>,
with tasters that are asked specifically about (class,state) pairs and
whether they should be turned into objects or not.</p>

<p>Because the Unslicers receive their data one token at a time, things like
InstanceUnslicer can perform security checks one attribute at a time.
<q>traits</q>-style attribute constraints (see the Chaco project or the
PyCon-2003 presentation for details) can be implemented by having a
per-class dictionary of tests that attribute values must pass before they
will be accepted. The instance will only be created if all attributes fit
the constraints. The idea is to catch violations before any code is run on
the receiving side. Typical checks would be things like <q>.foo must be a
number</q>, <q>.bar must not be an instance</q>, <q>.baz must implement the
IBazzer interface</q>.</p>

<p>TODO: the rest of this section is somewhat out of date.</p>

<p>Using the stack instead of a single Taster object means that the rules
can be changed depending upon the context of the object being processed. A
class that is valid as the first argument to a method call may not be valid
as the second argument, or inside a list provided as the first argument. The
PBMethodArgumentsUnslicer could change the way its .taste method behaves as
its state machine progresses through the argument list.</p>

<p>There are several different ways to implement this Taster stack:</p>

<ul>
  <li> Each object in the Unslicer stack gets to raise an exception if they
  don't like what they see: unanimous consent is required to let the token or
  object pass</li>
  
  <li> The top-most unslicer is asked, and it has the option of asking the
  next slice down. It might not, allowing local <q>I'm sure this is safe</q>
  classes to override higher-level paranoia.</li>

  <li> Unslicer objects may add and remove Taster objects on a separate
  stack. This is undoubtedly faster but must be done carefully to make sure
  Tasters and Unslicers stay in sync.</li>
</ul>

<p>Of course, all this holds true for the sending side as well. A Slicer
could enforce a policy that no objects of type Foo will be sent while it is
on the stack.</p>

<p>It is anticipated that something like the current Jellyable/Unjellyable
classes will be created to offer control over the Slicer/Unslicers used to
handle instance of that class.</p>

<p>One eventual goal is to allow PB to implement E-like argument
constraints.</p>


<h3>Streaming Slices</h3>

<p>The big change from the old Jelly scheme is that now
serialization/unserialization is done in a more streaming format. Individual
tokens are the basic unit of information. The basic tokens are just numbers
and strings: anything more complicated (starting at lists) involves
composites of other tokens.</p>

<p>Producer/Consumer-oriented serialization means that large objects which
can't fit into the socket buffers should not consume lots of memory, sitting
around in a serialized state with nowhere to go. This must be balanced
against the confusion caused by time-distributed serialization. PB method
calls must retain their current in-order execution, and it must not be
possible to interleave serialized state (big mess). One interesting
possibility is to allow multiple parallel SlicerStacks, with a
context-switch token to let the receiving end know when they should switch
to a different UnslicerStack. This would allow cleanly interleaved streams
at the token level. <q>Head-of-line blocking</q> is when a large request
prevents a smaller (quicker) one from getting through: grocery stores
attempt to relieve this frustration by grouping customers together by
expected service time (the express lane). Parallel stacks would allow the
sender to establish policies on immediacy versus minimizing context
switches.</p>

<h3>CBanana, CBananaRun, RunBananaRun</h3>

<p>Another goal of the Jelly+Banana-&gt;JustBanana change is the hope of
writing Slicers and Unslicers in C. The CBanana module should have C objects
(structs with function pointers) that can be looked up in a registry table
and run to turn python objects into tokens and vice versa. This ought to be
faster than running python code to implement the slices, at the cost of less
flexibility. It would be nice if the resulting tokens could be sent directly
to the socket at the C level without surfacing into python; barring this it
is probably a good idea to accumulate the tokens into a large buffer so the
code can do a few large writes instead of a gazillion small ones.</p>

<p>It ought to be possible to mix C and Python slices here: if the C code
doesn't find the slice in the table, it can fall back to calling a python
method that does a lookup in an extensible registry.</p>

<h2>Beyond Banana</h2>

<p>Random notes and wild speculations: take everything beyond here with
<em>two</em> grains of salt</p>

<h3>Oldbanana usage</h3>

<p>The oldbanana usage model has the layer above banana written in one of
two ways. The simple form is to use the <code
class="python">banana.encode</code> and <code
class="python">banana.decode</code> functions to turn an object into a
bytestream. This is used by twisted.spread.publish . The more flexible model
is to subclass Banana. The largest example of this technique is, of course,
twisted.spread.pb.Broker, but others which use it are twisted.trial.remote
and twisted.scripts.conch (which appears to use it over unix-domain
sockets).</p>

<p>Banana itself is a Protocol. The Banana subclass would generally override
the <code>expressionReceived</code> method, which receives s-expressions
(lists of lists). These are processed to figure out what method should be
called, etc (processing which only has to deal with strings, numbers, and
lists). Then the serialized arguments are sent through Unjelly to produce
actual objects.</p>

<p>On output, the subclass usually calls <code>self.sendEncoded</code> with
some set of objects. In the case of PB, the arguments to the remote method
are turned into s-expressions with jelly, then combined with the method
meta-data (object ID, method name, etc), then the whole request is sent to
<code>sendEncoded</code>.</p>

<h3>Newbanana</h3>

<p>Newbanana moves the Jelly functionality into a stack of Banana Slices,
and the lowest-level token-to-bytestream conversion into the new Banana
object. Instead of overriding <code>expressionReceived</code>, users could
push a different root Unslicer. to get more control over the receive
process.

Currently, Slicers call Banana.sendOpen/sendToken/sendClose/sendAbort, which
then creates bytes and does transport.write .

To move this into C, the transport should get to call CUnbanana.receiveToken
There should be CBananaUnslicers. Probably a parent.addMe(self) instead of
banana.stack.append(self), maybe addMeC for the C unslicer.

The Banana object is a Protocol, and has a dataReceived method. (maybe in
some C form, data could move directly from a CTransport to a CProtocol). It
parses tokens and hands them to its Unslicer stack. The root Unslicer is
probably created at connectionEstablished time. Subclasses of Banana could
use different RootUnslicer objects, or the users might be responsible for
setting up the root unslicer.

The Banana object is also created with a RootSlicer. Banana.writeToken
serializes the token and does transport.write . (a C form could have CSlicer
objects which hand tokens to a little CBanana which then hands bytes off to
a CTransport).

Doing the bytestream-to-Token conversion in C loses a lot of utility when
the conversion is done token at a time. It made more sense when a whole mess
of s-lists were converted at once.

All Slicers currently have a Banana pointer.. maybe they should have a
transport pointer instead? The Banana pointer is needed to get to top of the
stack.

want to be able to unserialize lists/tuples/dicts/strings/ints (<q>basic
types</q>) without surfacing into python. want to deliver the completed
object to a python function.


</p>

<h3>Streaming Methods</h3>

<p>It would be neat if a PB method could indicate that it would like to
receive its arguments in a streaming fashion. This would involve calling the
method early (as soon as the objectID and method name were known), then
somehow feeding objects to it as they arrive. The object could return a
handler or consumer sub-object which would be fed as tokens arrive over the
wire. This consumer should have a way to enforce a constraint on its
input.</p>

<p>This consumer object sounds a lot like an Unslicer, so maybe the method
schema should indicate that the method will would like to be called right
away so it can return an Unslicer to be pushed on the stack. That Unslicer
could do whatever it wanted with the incoming tokens, and could enforce
constraints with the usual checkToken/doOpen/receiveChild/receiveClose
methods.</p>

<p>On the sending side, it would be neat to let a callRemote() invocation
provide a Producer or a generator that will supply data as the network
buffer becomes available. This could involve pushing a Slicer. Slicers are
generators.</p>



<h2>Common token sequences</h2>

<p>Any given Banana instance has a way to map objects to the Open Index
tuples needed to represent them, and a similar map from such tuples to
incoming object factories. These maps give rise to various <q>classes</q> of
objects, depending upon how widespread any particular object type is. A List
is a fairly common type of object, something you would expect to find
implemented in pretty much any high-level language, so you would expect a
Banana implementation in that language to be capable of accepting an (OPEN,
'list') sequence. However, a Failure object (found in
<code>twisted.python.failure</code>, providing an asynchronous-friendly way
of reporting python exceptions) is both Python- and Twisted- specific. Is it
reasonable for one program to emit an (OPEN, 'failure') sequence and expect
another speaker of the generic <q>Banana</q> protocol to understand it?</p>

<p>This level of compatibility is (somewhat arbitrarily) named <q>dialect
compatibility</q>. The set of acceptable sequences will depend upon many
things: the language in which the program at each end of the wire is
implemented, the nature of the higher-level software that is using Banana at
that moment (PB is one such layer), and application-specific registrations
that have been performed by the time the sequence is received (the set of
<code>pb.Copyable</code> sequences that can be received without error will
depend upon which <code>RemoteCopyable</code> class definitions and
<code>registerRemoteCopy</code> calls have been made).</p>

<p>Ideally, when two Banana instances first establish a connection, they
will go through a negotiation phase where they come to an agreement on what
will be sent across the wire. There are two goals to this negotiation:</p>

<ol>
  <li>least-surprise: if one side cannot handle a construct which the other
  side might emit at some point in the future, it would be nice to know
  about it up front rather than encountering a Violation or
  connection-dropping BananaError later down the line. This could be
  described as the <q>strong-typing</q> argument. It is important to note
  that different arguments (both for and against strong typing) may exist
  when talking about remote interfaces rather than local ones.</li>

  <li>adapability: if one side cannot handle a newer construct, it may be
  possible for the other side to back down to some simpler variation without
  too much loss of data.</li>
</ol>

<p>Dialect negotiation is a very much still an active area of
development.</p>


<h3>Base Python Types</h3>

<p>The basic python types are considered <q>safe</q>: the code which is
invoked by their receipt is well-understood and there is no way to cause
unsafe behavior during unserialization. Resource consumption attacks are
mitigated by Constraints imposed by the receiving schema.</p>

<p>Note that the OPEN(dict) slicer is implemented with code that sorts the
list of keys before serializing them. It does this to provide deterministic
behavior and make testing easier.</p>

<table border="" width="">
  <tr><td>IntType, LongType (small+)</td><td>INT(value)</td></tr>
  <tr><td>IntType, LongType (small-)</td><td>NEG(value)</td></tr>
  <tr><td>IntType, LongType (large+)</td><td>LONGINT(value)</td></tr>
  <tr><td>IntType, LongType (large-)</td><td>LONGNEG(value)</td></tr>
  <tr><td>FloatType</td><td>FLOAT(value)</td></tr>
  <tr><td>StringType</td><td>STRING(value)</td></tr>
  <tr><td>StringType (tokenized)</td><td>VOCAB(tokennum)</td></tr>
  <tr><td>UnicodeType</td>
      <td>OPEN(unicode) STRING(str.encode('UTF-8')) CLOSE</td></tr>
  <tr><td>ListType</td><td>OPEN(list) elem.. CLOSE</td></tr>
  <tr><td>TupleType</td><td>OPEN(tuple) elem.. CLOSE</td></tr>
  <tr><td>DictType, DictionaryType</td>
      <td>OPEN(dict) (key,value).. CLOSE</td></tr>
  <tr><td>NoneType</td><td>OPEN(none) CLOSE</td></tr>
  <tr><td>BooleanType</td><td>OPEN(boolean) INT(0/1) CLOSE</td></tr>
</table>

<h3>Extended (unsafe) Python Types</h3>

<p>To serialize arbitrary python object graphs (including instances)
requires that we allow more types in. This begins to get dangerous: with
complex graphs of inter-dependent objects, instances may need to be used (by
referencing objects) before they are fully initialized. A schema can be used
to make assertions about what object types live where, but in general the
contents of those objects are difficult to constrain.</p>

<p>For this reason, these types should only be used in places where you
trust the creator of the serialized stream (the same places where you would
be willing to use the standard Pickle module). Saving application state to
disk and reading it back at startup time is one example.</p>

<table border="" width="">
  <tr><td colspan="2">Extended (unsafe) Python Types</td></tr>

  <tr><td>InstanceType</td><td>OPEN(instance) STRING(reflect.qual(class))
    (attr,value).. CLOSE</td></tr>
  <tr><td>ModuleType</td><td>OPEN(module) STRING(__name__) CLOSE</td></tr>
  <tr><td>ClassType</td>
      <td>OPEN(class) STRING(reflect.qual(class)) CLOSE</td></tr>
  <tr><td>MethodType</td>
      <td>OPEN(method) STRING(__name__) im_self im_class CLOSE</td></tr>
  <tr><td>FunctionType</td>
      <td>OPEN(function) STRING(module.__name__) CLOSE</td></tr>
</table>


<h3>PB Sequences</h3>

<p>See the <a href="pb.xhtml">PB document</a> for details.</p>


<h3>Unhandled types</h3>

<p>The following types are not handled by any slicer, and will raise a
KeyError if one is referenced by an object being sliced. This technically
imposes a limit upon the kinds of objects that can be serialized, even by a
<q>unsafe</q> serializer, but in practice it is not really an issue, as many
of these objects have no meaning outside the program invocation which
created them.</p>

<ul>
  <li>- types that might be nice to have</li>
  <li>ComplexType</li>
  <li>SliceType</li>
  <li>TypeType</li>
  <li>XRangeType</li>

  <li>- types that aren't really that useful</li>
  <li>BufferType</li>
  <li>BuiltinFunctionType</li>
  <li>BuiltinMethodType</li>
  <li>CodeType</li>
  <li>DictProxyType</li>
  <li>EllipsisType</li>
  <li>NotImplementedType</li>
  <li>UnboundMethodType</li>

  <li>- types that are meaningless outside the creator</li>
  <li>TracebackType</li>
  <li>FileType</li>
  <li>FrameType</li>
  <li>GeneratorType</li>
  <li>LambdaType</li>
</ul>

<h3>Unhandled (but don't worry about it) types</h3>

<p><code>ObjectType</code> is the root class of all other types. All objects
are known by some other type in addition to <code>ObjectType</code>, so the
fact that it is not handled explicitly does not matter.</p>

<p><code>StringTypes</code> is simply a list of <code>StringType</code> and
<code>UnicodeType</code>, so it does not need to be explicitly handled
either.</p>

<h3>Internal types</h3>

<p>The following sequences are internal.</p>

<p>The OPEN(vocab) sequence is used to update the forward compression
token-to-string table used by the VOCAB token. It is followed by a series of
number/string pairs. All numbers that appear in VOCAB tokens must be
associated with a string by appearing in the most recent OPEN(vocab)
sequence.</p>

<table border="" width="">
  <tr><td colspan="2">internal types</td></tr>
  <tr><td>vocab dict</td><td>OPEN(vocab) (num,string).. CLOSE</td></tr>
</table>

</body> </html>