The first part to the format discovery is 90% completed.
The program is now able to tokenize the sample packets and sort them to clusters according to token pattern.
The structure for a token looks like this:

// definition of a node for initial tokenization
struct sToken {
struct inferProperty* sProperty;
struct inferSemantic* sSemantic;
struct formatDistinguisher* sFD;
struct sToken* next;

struct inferProperty {
char szType[4]; //"s-c/c-s" / "bin" / "txt"
unsigned char* pValue; //value of token. Will include

null and unicode, if there is
bool bVar; //default = false (0) = token is constant
bool bDelim; //default = false (0) = token is not/has no delimiter
bool bNull; //for text-token: default = false (0) = text token is not null terminated
bool bUnicode; //for text-token: default = false (0) = text token is not unicode (eg: 'A' is '41 00')

struct inferSemantic {
bool bBigEndian; //len field is in big/little endian
int lenSize; //size of length field. [0: not len][1: byte][2: word (LE), this and next byte][4: dword (LE), this and next 3 bytes]

int lenStart; //start of len segment, wrt to len field (ie: len field=index 0), in terms of tokens
int lenEnd; //end of len segment, wrt to len field (ie: len field=index 0), in terms of tokens

int ip; //0: not ip
//1: hex notation -> 0x0a020304
//2: ascii dotted notation ->
//3: comma notation -> 10,2,3,4

struct formatDistinguisher {
int iFD; //no of FD for this token (ie: no of rows for pFD)
unsigned char** pFD; //values of FD for this token (each row of pFD contains bin/txt for each FD value)

These are the information that will describe each token.
The fields for the token properties and semantics are either self-explanatory or have already been explained in previous posts.

I shall elaborate on Format Distinguisher (FD) here since it is not explained previously. FD are the values that the protocol may have to represent special meaning. For example, if a particular token (ie: byte) of this packet has a specific value (eg: 0x32), then it means this packet is message type XXX. Another value (eg: 0x54) may mean this packet is type YYY. It should be noted that the FD is meant for its own cluster. Which means that if this cluster only consists of message type XXX, then 0x32 will not be identified as FD. But this may not affect our replay later as we need not change any FD values. The discovery of FD is an simplification of the paper that format discovery is based on, and you may refer to it if interested.

Next, I shall discuss about the output, which is the attached .txt file.
The .pcap contains the packets for this cluster.

The first thing to note is that the title, which is actually a hash of the token pattern. Storing this information will help us match the replay packet to this format. Each row represent the information for each token, while each packet is separated by "-----------------------------". The important information that we need is the inferring of length token. As seen, the length tokens have been identified correctly. These are:

[bin<->00<->00<->00<->00<->00] [01<->04<->00000004<->00000064<->00] [00000000]
[bin<->00<->00<->00<->00<->00] [00000000] [00000000]
[bin<->00<->00<->00<->00<->00] [01<->02<->00000002<->00000062<->00] [00000000]
[bin<->6c<->01<->00<->00<->00] [01<->01<->00000001<->00000061<->00]

[bin<->ff<->00<->00<->00<->00] [00000000] [00000000]

[bin<->00<->00<->00<->00<->00] [01<->02<->00000026<->00000030<->00] [00000000]
[bin<->30<->01<->00<->00<->00] [01<->01<->00000025<->00000029<->00]


[bin<->00<->00<->00<->00<->00] [01<->04<->00000018<->00000022<->00] [00000000]
[bin<->00<->00<->00<->00<->00] [00000000] [00000000]
[bin<->00<->00<->00<->00<->00] [01<->02<->00000016<->00000020<->00] [00000000]
[bin<->30<->01<->00<->00<->00] [01<->01<->00000015<->00000019<->00] [00000007<->30<->1c<->4a<->42<->4c<->44<->2a]

[bin<->00<->00<->00<->00<->00] [01<->04<->00000005<->00000014<->00] [00000000]
[bin<->00<->00<->00<->00<->00] [00000000] [00000000]
[bin<->00<->00<->00<->00<->00] [01<->02<->00000003<->00000012<->00] [00000000]
[bin<->35<->01<->00<->00<->00] [01<->01<->00000002<->00000011<->00] [00000007<->35<->21<->4f<->47<->51<->49<->2f]

Taking SMB.Trans2-Response.Data-Count as an example, the output shows that if read as big-endian, the range of tokens that constitute this length starts from the 15th token away till 19th token way. This length field is currently unable to differentiate whether in the actual protocol, the field is 1, 2, or 4 bytes wide. Hence all possible "wide-ness" are identified. Here, we note that SMB.Trans2-Response.Word-Count is not identified. This is because in the cluster, the segment of tokens described by Word-Count is unchanged. Hence it failed to be inferred. Therefore, if it happen that during replay, some portion of this segment need to be change, then we are unable to change Word-Count accordingly.

Lastly, you may have note that the length field is also identified as FD. During replay, length field identification will take precedence over FD. Hence, though length field is also inferred as FD, but we will ignore it. FD is just
additional information which may or may not be useful in future, considering that method for inferring it is simplified.

In the next few weeks, I need to:
1) have a function to summarise all output for this program. You may have realized that Semantic and FD structures are repeated in subsequent packets, and not really needed subsequently.

2) solve a memory leak problem in this program =(

3) match replay packet to format, and if length segment changes (eg: due to shellcode change), then length field needs to change.

4) from replay ip, find IP tokens and change it.