13.7. 跨模块设计决策

在理想环境中,每个重要的设计决策都将封装在一个类中。不幸的是,真实的系统不可避免地最终会影响到多个类的设计决策。例如,网络协议的设计将影响发送方和接收方,并且它们可以在不同的地方实现。跨模块决策通常是复杂而微妙的,并且会导致许多错误,因此,为它们提供良好的文档至关重要。

跨模块文档的最大挑战是找到一个放置它的位置,以便开发人员自然地发现它。有时,放置此类文档的中心位置很明显。例如,RAMCloud 存储系统定义一个状态值,每个请求均返回该状态值以指示成功或失败。为新的错误状况添加状态需要修改许多不同的文件(一个文件将状态值映射到异常,另一个文件为每个状态提供人类可读的消息,依此类推)。幸运的是,添加新的状态值(即 Status 枚举的声明)时,开发人员必须去一个明显的地方。我们通过在该枚举中添加注释来标识所有其他必须修改的地方,从而利用了这一点:在理想环境中,每个重要的设计决策都将封装在一个类中。不幸的是,真实的系统不可避免地最终会影响到多个类的设计决策。例如,网络协议的设计将影响发送方和接收方,并且它们可以在不同的地方实现。跨模块决策通常是复杂而微妙的,并且会导致许多错误,因此,为它们提供良好的文档至关重要。

typedef enum Status {
    STATUS_OK = 0,
    STATUS_UNKNOWN_TABLET                = 1,
    STATUS_WRONG_VERSION                 = 2,
    ...
    STATUS_INDEX_DOESNT_EXIST            = 29,
    STATUS_INVALID_PARAMETER             = 30,
    STATUS_MAX_VALUE                     = 30,
    // Note: if you add a new status value you must make the following
    // additional updates:
    // (1)  Modify STATUS_MAX_VALUE to have a value equal to the
    //      largest defined status value, and make sure its definition
    //      is the last one in the list. STATUS_MAX_VALUE is used
    //      primarily for testing.
    // (2)  Add new entries in the tables "messages" and "symbols" in
    //      Status.cc.
    // (3)  Add a new exception class to ClientException.h
    // (4)  Add a new "case" to ClientException::throwException to map
    //      from the status value to a status-specific ClientException
    //      subclass.
    // (5)  In the Java bindings, add a static class for the exception
    //      to ClientException.java
    // (6)  Add a case for the status of the exception to throw the
    //      exception in ClientException.java
    // (7)  Add the exception to the Status enum in Status.java, making
    //      sure the status is in the correct position corresponding to
    //      its status code.
}

新状态值将添加到现有列表的末尾,因此注释也将放置在最有可能出现的末尾。

不幸的是,在许多情况下,并没有一个明显的中心位置来放置跨模块文档。RAMCloud 存储系统中的一个例子是处理僵尸服务器的代码,僵尸服务器是系统认为已经崩溃但实际上仍在运行的服务器。中和 zombie server 需要几个不同模块中的代码,这些代码都相互依赖。没有一段代码明显是放置文档的中心位置。一种可能性是在每个依赖文档的位置复制文档的部分。然而,这是令人尴尬的,并且随着系统的发展,很难使这样的文档保持最新。或者,文档可以位于需要它的位置之一,但是在这种情况下,开发人员不太可能看到文档或者知道在哪里查找它。

我最近一直在尝试一种方法,该方法将跨模块问题记录在一个名为 designNotes 的中央文件中。该文件分为清楚标记的部分,每个主要主题一个。例如,以下是该文件的摘录:

...
Zombies
-------
A zombie is a server that is considered dead by the rest of the
cluster; any data stored on the server has been recovered and will
be managed by other servers. However, if a zombie is not actually
dead (e.g., it was just disconnected from the other servers for a
while) two forms of inconsistency can arise:
* A zombie server must not serve read requests once replacement servers have taken over; otherwise it may return stale data that does not reflect writes accepted by the replacement servers.
* The zombie server must not accept write requests once replacement servers have begun replaying its log during recovery; if it does, these writes may be lost (the new values may not be stored on the replacement servers and thus will not be returned by reads).
RAMCloud uses two techniques to neutralize zombies. First,
...

然后,在与这些问题之一相关的任何代码段中,都有一条简短的注释引用了 designNotes 文件:

// See "Zombies" in designNotes.

使用这种方法,文档只有一个副本,因此开发人员在需要时可以相对容易地找到它。但是,这样做的缺点是,文档离它依赖的任何代码段都不近,因此随着系统的发展,可能难以保持最新。

In a perfect world, every important design decision would be encapsulated within a single class. Unfortunately, real systems inevitably end up with design decisions that affect multiple classes. For example, the design of a network protocol will affect both the sender and the receiver, and these may be implemented in different places. Cross-module decisions are often complex and subtle, and they account for many bugs, so good documentation for them is crucial.

The biggest challenge with cross-module documentation is finding a place to put it where it will naturally be discovered by developers. Sometimes there is an obvious central place to put such documentation. For example, the RAMCloud storage system defines a Status value, which is returned by each request to indicate success or failure. Adding a Status for a new error condition requires modifying many different files (one file maps Status values to exceptions, another provides a human-readable message for each Status, and so on). Fortunately, there is one obvious place where developers will have to go when adding a new status value, which is the declaration of the Status enum. We took advantage of this by adding comments in that enum to identify all of the other places that must also be modified:

typedef enum Status {
    STATUS_OK = 0,
    STATUS_UNKNOWN_TABLET                = 1,
    STATUS_WRONG_VERSION                 = 2,
    ...
    STATUS_INDEX_DOESNT_EXIST            = 29,
    STATUS_INVALID_PARAMETER             = 30,
    STATUS_MAX_VALUE                     = 30,
    // Note: if you add a new status value you must make the following
    // additional updates:
    // (1)  Modify STATUS_MAX_VALUE to have a value equal to the
    //      largest defined status value, and make sure its definition
    //      is the last one in the list. STATUS_MAX_VALUE is used
    //      primarily for testing.
    // (2)  Add new entries in the tables "messages" and "symbols" in
    //      Status.cc.
    // (3)  Add a new exception class to ClientException.h
    // (4)  Add a new "case" to ClientException::throwException to map
    //      from the status value to a status-specific ClientException
    //      subclass.
    // (5)  In the Java bindings, add a static class for the exception
    //      to ClientException.java
    // (6)  Add a case for the status of the exception to throw the
    //      exception in ClientException.java
    // (7)  Add the exception to the Status enum in Status.java, making
    //      sure the status is in the correct position corresponding to
    //      its status code.
}

New status values will be added at the end of the existing list, so the comments are also placed at the end, where they are most likely to be seen.

Unfortunately, in many cases there is not an obvious central place to put cross-module documentation. One example from the RAMCloud storage system was the code for dealing with zombie servers, which are servers that the system believes have crashed, but in fact are still running. Neutralizing zombie servers required code in several different modules, and these pieces of code all depend on each other. None of the pieces of code is an obvious central place to put documentation. One possibility is to duplicate parts of the documentation in each location that depends on it. However, this is awkward, and it is difficult to keep such documentation up to date as the system evolves. Alternatively, the documentation can be located in one of the places where it is needed, but in this case it’s unlikely that developers will see the documentation or know where to look for it.

I have recently been experimenting with an approach where cross-module issues are documented in a central file called designNotes. The file is divided up into clearly labeled sections, one for each major topic. For example, here is an excerpt from the file:

...
Zombies
-------
A zombie is a server that is considered dead by the rest of the
cluster; any data stored on the server has been recovered and will
be managed by other servers. However, if a zombie is not actually
dead (e.g., it was just disconnected from the other servers for a
while) two forms of inconsistency can arise:
* A zombie server must not serve read requests once replacement servers have taken over; otherwise it may return stale data that does not reflect writes accepted by the replacement servers.
* The zombie server must not accept write requests once replacement servers have begun replaying its log during recovery; if it does, these writes may be lost (the new values may not be stored on the replacement servers and thus will not be returned by reads).
RAMCloud uses two techniques to neutralize zombies. First,
...

Then, in any piece of code that relates to one of these issues there is a short comment referring to the designNotes file:

// See "Zombies" in designNotes.

With this approach, there is only a single copy of the documentation and it is relatively easy for developers to find it when they need it. However, this has the disadvantage that the documentation is not near any of the pieces of code that depend on it, so it may be difficult to keep up-to-date as the system evolves.