librsync  2.3.1
TODO.md
1 # librsync TODO
2 
3 * We have a few functions to do with reading a netint, stashing
4  it somewhere, then moving into a different state. Is it worth
5  writing generic functions for that, or would it be too confusing?
6 
7 * Optimisations and code cleanups;
8 
9  scoop.c: Scoop needs major refactor. Perhaps the API needs
10  tweaking?
11 
12  rsync.h: rs_buffers_s and rs_buffers_t should be one typedef?
13 
14  mdfour.c: This code has a different API to the RSA code in libmd
15  and is coupled with librsync in unhealthy ways (trace?). Recommend
16  changing to RSA API?
17 
18 * Just how useful is rs_job_drive anyway?
19 
20 * Don't use the rs_buffers_t structure.
21 
22  There's something confusing about the existence of this structure.
23  In part it may be the name. I think people expect that it will be
24  something that behaves like a FILE* or C++ stream, and it really
25  does not. Also, the structure does not behave as an object: it's
26  really just a shorthand for passing values in to the encoding
27  routines, and so does not have a lot of identity of its own.
28 
29  An alternative might be
30 
31  result = rs_job_iter(job,
32  in_buf, &in_len, in_is_ending,
33  out_buf, &out_len);
34 
35  where we update the length parameters on return to show how much we
36  really consumed.
37 
38  One technicality here will be to restructure the code so that the
39  input buffers are passed down to the scoop/tube functions that need
40  them, which are relatively deeply embedded. I guess we could just
41  stick them into the job structure, which is becoming a kind of
42  catch-all "environment" for poor C programmers.
43 
44 * Meta-programming
45 
46  * Plot lengths of each function
47 
48  * Some kind of statistics on delta each day
49 
50 * Encoding format
51 
52  * Include a version in the signature and difference fields
53 
54  * Remember to update them if we ever ship a buggy version (nah!) so
55  that other parties can know not to trust the encoded data.
56 
57 * abstract encoding
58 
59  In fact, we can vary on several different variables:
60 
61  * what signature format are we using
62 
63  * what command protocol are we using
64 
65  * what search algorithm are we using?
66 
67  * what implementation version are we?
68 
69  Some are more likely to change than others. We need a chart
70  showing which source files depend on which variable.
71 
72 * Encoding algorithm
73 
74  * Self-referential copy commands
75 
76  Suppose we have a file with repeating blocks. The gdiff format
77  allows for COPY commands to extend into the *output* file so that
78  they can easily point this out. By doing this, they get
79  compression as well as differencing.
80 
81  It'd be pretty simple to implement this, I think: as we produce
82  output, we'd also generate checksums (using the search block
83  size), and add them to the sum set. Then matches will fall out
84  automatically, although we might have to specially allow for
85  short blocks.
86 
87  However, I don't see many files which have repeated 1kB chunks,
88  so I don't know if it would be worthwhile.
89 
90 * Support compression of the difference stream. Does this
91  belong here, or should it be in the client and librsync just have
92  an interface that lets it cleanly plug in?
93 
94  I think if we're going to just do plain gzip, rather than
95  rsync-gzip, then it might as well be external.
96 
97  rsync-gzip: preload with the omitted text so as to get better
98  compression. Abo thinks this gets significantly better
99  compression. On the other hand we have to important and maintain
100  our own zlib fork, at least until we can persuade the upstream to
101  take the necessary patch. Can that be done?
102 
103  abo says
104 
105  It does get better compression, but at a price. I actually
106  think that getting the code to a point where a feature like
107  this can be easily added or removed is more important than the
108  feature itself. Having generic pre and post processing layers
109  for hit/miss data would be useful. I would not like to see it
110  added at all if it tangled and complicated the code.
111 
112  It also doesn't require a modified zlib... pysync uses the
113  standard zlib to do it by compressing the data, then throwing
114  it away. I don't know how much benefit the rsync modifications
115  to zlib actually are, but if I was implementing it I would
116  stick to a stock zlib until it proved significantly better to
117  go with the fork.
118 
119 * Licensing
120 
121  Will the GNU Lesser GPL work? Specifically, will it be a problem
122  in distributing this with Mozilla or Apache?
123 
124 * Testing
125 
126  * Just more testing in general.
127 
128  * Test broken pipes and that IO errors are handled properly.
129 
130  * Test files >2GB, >4GB. Presumably these must be done in streams
131  so that the disk requirements to run the test suite are not too
132  ridiculous. I wonder if it will take too long to run these
133  tests? Probably, but perhaps we can afford to run just one
134  carefully-chosen test.
135 
136  * Fuzz instruction streams. <https://code.google.com/p/american-fuzzy-lop/>?
137 
138  * Generate random data; do random mutations.
139 
140  * Tests should fail if they can't find their inputs, or have zero
141  inputs: at present they tend to succeed by default.
142 
143 * Security audit
144 
145  * If this code was to read differences or sums from random machines
146  on the network, then it's a security boundary. Make sure that
147  corrupt input data can't make the program crash or misbehave.
Description of input and output buffers.
Definition: librsync.h:322
rs_result rs_job_drive(rs_job_t *job, rs_buffers_t *buf, rs_driven_cb in_cb, void *in_opaque, rs_driven_cb out_cb, void *out_opaque)
Actively process a job, by making callbacks to fill and empty the buffers until the job is done...
Definition: job.c:152