log in | register | forums
Show:
Go:
Forums
Username:

Password:

User accounts
Register new account
Forgot password
Forum stats
List of members
Search the forums

Advanced search
Recent discussions
- Archive Edition 27:2 reviewed (News:)
- RISC OS 5.30 arrives (News:11)
- Rougol Talk May 2024 - Andy Vawer (News:)
- Rougol May 2024 meeting on monday with Andy Vawer (News:1)
- WROCC May 2024 meeting - Gerph talks games (News:)
- Drag'n'Drop 13i3 edition reviewed (News:1)
- Wakefield Show 2024 in Pictures (News:5)
- April 2024 News Summary (News:2)
- Upgrading your RISC OS system to 5.30 (News:2)
- WROCC May 2024 meeting on wednesday - Gerph talks games (News:)
Latest postings RSS Feeds
RSS 2.0 | 1.0 | 0.9
Atom 0.3
Misc RDF | CDF
 
View on Mastodon
@www.iconbar.com@rss-parrot.net
Site Search
 
Article archives
The Icon Bar: Programming: explicit data format
 
  explicit data format
  (16:51 16/5/2002)
  davidb (09:18 17/5/2002)
    johnstlr (14:58 15/6/2002)
      Hertzsprung (13:09 17/5/2002)
        johnstlr (14:58 15/6/2002)
  [mentat] (14:58 15/6/2002)
    Phlamethrower (14:58 15/6/2002)
      Hertzsprung (18:33 16/5/2002)
        [mentat] (21:56 16/5/2002)
  Loris (14:58 15/6/2002)
    johnstlr (17:36 17/5/2002)
      Loris (14:58 15/6/2002)
        johnstlr (14:58 15/6/2002)
          Loris (14:58 15/6/2002)
            johnstlr (14:58 15/6/2002)
              davidb (17:20 21/5/2002)
    Matthew (21:10 17/5/2002)
      Phlamethrower (19:50 18/5/2002)
        Loris (09:57 20/5/2002)
    Phlamethrower (14:58 15/6/2002)
      Loris (14:58 15/6/2002)
 
Loris Message #5054, posted at 16:51, 16/5/2002
Unregistered user I've been thinking recently about the problem of data formats becoming obsolete, and after a while unreadable. I reckon it would be possible to create a new, extendable, open format which would describe the data it contained fairly explicitly, so even if the program was lost the data could be recovered.
Before I waffle on further, does anyone think it is a) a good idea, and b) want to discuss it?
  ^[ Log in to reply ]
 
Hertzsprung Message #5057, posted at 18:33, 16/5/2002, in reply to message #5056
Unregistered user Err - XML?!!?! That's it basically. XML. Rules. OK.
  ^[ Log in to reply ]
 
I don't have tourettes you're just a cun Message #5058, posted by [mentat] at 21:56, 16/5/2002, in reply to message #5057
[mentat]Fear is the mind-killer
Posts: 6266
Whilst I see your point, XML only fits the bill because (some of - can't be a**ed with semantics) the data is stored as text, and I don't think that's quite what Loris was getting at...(?)
  ^[ Log in to reply ]
 
davidb Message #5059, posted at 09:18, 17/5/2002, in reply to message #5054
Unregistered user
I've been thinking recently about the problem of data formats becoming obsolete, and after a while unreadable. I reckon it would be possible to create a new, extendable, open format which would describe the data it contained fairly explicitly, so even if the program was lost the data could be recovered.

I've been thinking about format specifications along the lines of something like regular expressions.

The binary format of the fundamental elements in the file are defined in named declarations then used to describe more complex structures.

The problem is defining the meaning of these structures: how do you describe that the file contains an image, audio data or a series of page definitions?

Before I waffle on further, does anyone think it is a) a good idea, and b) want to discuss it?

a) Yes, as long as you can decipher each format to start with.

b) I'm interested in this sort of discussion because I'm interested in writing tools to recover data in old formats.

  ^[ Log in to reply ]
 
Hertzsprung Message #5061, posted at 13:09, 17/5/2002, in reply to message #5060
Unregistered user I still believe XML can do what you want. Though XML is text-based, you can embed binary data, I believe. XSD is the document definition for XML. However, XML technologies are still very young and I am having trouble finding the necessary (free) tools on my PC to do what I want with XML.
  ^[ Log in to reply ]
 
johnstlr Message #5065, posted at 17:36, 17/5/2002, in reply to message #5063
Unregistered user
One problem as I see it with file formats is that they get upgraded somehow. Programs are usually backwards compatable - at least for a while, but sometimes they just stop.

I don't see this as an issue. I've never heard of an application that couldn't read at least the pervious version of it's data files and all you've got to do is save out in the new format.

The only problem is if someone sends you data that is from an older version of the software. Ok you can't read it but is it your fault? Is it the fault of the software? From a software engineering perspective it's neither the fault of you or the software. Sometimes it's only practical to maintain backwards compatibility for so long and then the effort of doing so outweights the benefits.


After a while the program you have which analyses it just stops working on your old data. OR you lose the program. Or you want to read a file on a different format (dear to our hearts, this one) etc.

Again all of these problems could be solved by simply documenting the file format. You don't need a gee whiz, hyper intelligent, mega complex file format and analyser to do this.


My idea is that it would be nice to have a format which you could refer back to after (say) 10-50 years or so, and still be able to interpret the data in exactly the way it was meant.

There are several problems with your suggestions

1) You still need to be able to construct an appropriate pa**er and without one you still can't move between platforms.

2) What if the embedded natural language isn't accurate enough? Also why not have the description outside of the file in a document - it's more efficient.

3) What is the font definition for? What if I'm storing a picture or a movie?

4) A header is only as good as the pa**er that reads it. Almost all file formats contain a header which describes what follows. The problem comes when the header can't describes a new extension.

As far as I'm aware pretty much all you want to do can be achieved already in XML, but it's a case of industry wide standards taking time to seep through to RISC OS (thanks to Justin Fletcher for making the start)

  ^[ Log in to reply ]
 
Matthew Message #5068, posted at 21:10, 17/5/2002, in reply to message #5063
Unregistered user
One method would be to use 8*8 characters, using 64 bytes for each value. For example to define A could be: [snip]

This would obviously take up some space, but it is very explicit. It does have problems with defining space, though. (Although the numbers could be switched).

What if the language was something like Japanese?

  ^[ Log in to reply ]
 
Phlamethrower Message #5069, posted at 19:50, 18/5/2002, in reply to message #5068
Unregistered user
What if the language was something like Japanese?

Which is (partly) why I'm suggesting just quoting an ISO number. However having no alphabet to quote the number in isn't too much of a problem - just include a picture of it instead, like the alphabet would be defined in.

  ^[ Log in to reply ]
 
Loris Message #5070, posted at 09:57, 20/5/2002, in reply to message #5069
Unregistered user
What if the language was something like Japanese?

Well, first of all, this is not a problem for the english version. Other language implementations would have to be designed in sympathy with their language format.
Japanese is not so much of a problem, anyway. Well the initial header might be slightly larger, but this would be offset by the reduced size of the embedded description.
(I've started to learn Japanese, so I know a little about it.)
Japanese actually is not really the problem - Chinese is the problem. But I'm happy to ignore their difficulties at this point.
Japabese has 3 character sets: Hiragana and Katakana, which are phonetic and have less than 100 characters [1], and Kanji which are pictograms - there are many more of these.
The header could get away with defining the Katakana - these would be sufficient to bootstrap the rest.

[1] There are less than 50 basic symbols, which can be modified by the addition of a couple of ticks or a small circle to give around 100. The obvious thing to do would be to define the basic 50 and the modifiers, then use them to describe the rest.


Which is (partly) why I'm suggesting just quoting an ISO number. However having no alphabet to quote the number in isn't too much of a problem - just include a picture of it instead, like the alphabet would be defined in.

This is a fantastic idea[2], and would probably be necessary for the Chinese (etc) versions.
However, I've been having some more thoughts on the subject, so:
One problem with using standards is that there probably are not any which have exactly the meanings we require. The text is only part of the story - we would also need control characters for things like newline, start of new component etc.
Of course one could use the standard as far as it went, then modify it using textual descriptions with the unmodified characters ... This is not what I'd propose to do, though.

One would only have to define, say, capital letters (+space, comma etc) explicitly, then the others can just be described using those. This would give a moderate size saving.
Describing some values using text would in any case be necessary for the control codes.

[2] Why didn't I think of that? - so simple yet effective.

  ^[ Log in to reply ]
 
davidb Message #5074, posted at 17:20, 21/5/2002, in reply to message #5073
Unregistered user
As far as I know, the only true way of making the description completely universal would be to have a mathematical description of the file *gulp*

I'm not sure that it would be regarded as a mathematical description, but the sort of specification that I am interested in is concerned with structures in the file and descriptions of low level information.

For example, the Drawfile format assumes 32 bit signed integers for coordinates, has a standard format for objects (type-length-object), and depends on external information about fonts. This information, and other properties of Drawfiles, could be encapsulated in some way without having to describe what the file actually represents.

A tool which used this information would be a diagnostic tool for someone writing conversion tools rather than a universal file converter.

  ^[ Log in to reply ]
 
Phlamethrower Message #5064, posted at 14:58, 15/6/2002, in reply to message #5063
Unregistered user *scratches head*

One problem as I see it with file formats is that they get upgraded somehow. Programs are usually backwards compatable - at least for a while, but sometimes they just stop. After a while the program you have which analyses it just stops working on your old data. OR you lose the program. Or you want to read a file on a different format (dear to our hearts, this one) etc.

Surely just writing simple conversion tools isn't too bad?


1) Header - this is to try and minimise problems finding the start of the file, and indicate exactly how bytes are used, etc.
I reckon it should look something like this:

[00][ff][00][01][02][03][04]

where each of those is a byte.
then be followed by a font definition. This will be used in the descriptive section. I'm not sure about this. One method would be to use 8*8 characters, using 64 bytes for each value. For example to define A could be:
00000000
00AAA000
0A000A00
0AAAAA00
0A000A00
0A000A00
0A000A00
00000000
I hope you can forgive my s***ty ascii art skills.
This would obviously take up some space, but it is very explicit. It does have problems with defining space, though. (Although the numbers could be switched).
A more compact solution would give the number followed by 8 bytes of data.

What about qouting a simple ISO code? Even if they do get superseeded at some point there's still bound to be a full definiton of each alphabet somewhere.

2) Descriptive text.
The first bit would describe the file format itself and how it worked.
The next bits would describe how to interpret each of the following tagged data fields. Each of these descriptions contains:
i) A tag number - this is specific to this document.
ii) A textual string; this is used by programs to decide whether they can display/use that data type.
iii) A version number, with major and minor parts
iv) An english (or whatever) description of how to use the data in the tagged field. This is necessarily very explicit and precise. All terms needed are described.

So a tag number is the identifier for each chunk of data? e.g. tag 1, tag 2, tag 3, etc.?

3) The data you want to store. Each of these sections has:
i) a tag number
ii) a length to the next section
iii) the data

Sounds OK, but it might be worthwhile changing it to a chunk format & chunk ID. That way you can have a listing for each chunk format in section 2, rather than a listing for every single chunk.

That is my concept.
Some of the stuff comes from the Drawfile format, some I made up myself.
Does it suck donkeys?
I am interested in all suggestions.

I suppose it would work, but might not catch on for day-to-day use. Small, compact files might have their size doubled or trippled due to all the extra data they have to carry, and the format does nothing to actually allow programs to read old formats - it just acts as a description of that file format, so you may still have the problem of one piece of software bringing out a new format and ignoring its older versions.

One thing might be to 'componentise' the data descriptions, to allow reference to other sections and prevent all of them requiring the same repetitive definition of primitives.

Yes, that's a good idea as I've (sort of) already suggested above wink

  ^[ Log in to reply ]
 
johnstlr Message #5073, posted at 14:58, 15/6/2002, in reply to message #5072
Unregistered user Ok, I've a few more comments to make, but I think I've got it, hence why I've snipped a lot.

Not necessarily. Even open formats can be lost, if they are rare enough. As specialist formats often will be.

Exhibit A (Although this may be to some extent a hardware problem, it proves the point.)

Fair comment. One thing I'm curious about is whether the Domesday project was ever fully documented. If not then it was never a truly "open" format.


Well, this is true, but the additional 'load' would be minimal, and you would really only have to pay for it (in programming) once. One small program could do this job, and pass on the actual data to the program which knew how to deal with it.

Assuming that engineers are willing to reuse this program (believe it or not, while sensible, it's a dangerous thing to assume) I guess you would need an architecture for for mapping files onto programs that could process them. However I accept that this is detail outside of your proposal.

Um, yeah. I suspect that you might have missed one of my postings - please check. This is probably the addition of most importance; the header we discussed in some detail is only an introduction for this embedded information to make sure that the reader can use it.

Sorry, I think I was a little confused. If I interpet this correctly the header would "point" to the specification which could be something like

http://www.compuphase.com/flic.htm


Then I suggest you are dealing with basically epheremial data - not archiving stuff. Therefore you would not be using the format for what it is intended.
However, there would be nothing to stop you serving just the 'actual' data, extracting it from the archive in real time.

No but there's also nothing stopping the person watching the film from saving out the data without the description.

I doubt that any researcher would be so machine-like as to repeatedly translate identical texts. Instead, the header would serve as a good marker that the following information was once considered important, and was intended to be recoverable.
As I have already said, this format is not intended to be a time-capsule, although obviously it would be superiour to data without any meta-data. The 'header' is intended to allow you to read the natural language description.

While I understand that the format isn't designed to cope with time-capsule style scenarios the one slight sticking point is that so far it seems to be assumed that natural language descriptions will be in English despite the fact that more people speak Mandarin.

I realise it's a real nit picky point because, currently, most IT development is done either in the Western world (where English dominates) or countries like Japan, but other countries are catching up.

As far as I know, the only true way of making the description completely universal would be to have a mathematical description of the file *gulp*


I've just had a quick look into XML on the web, and I don't think it does what I'm suggesting. One may be able to make a sort of explicit data system using XML, but that is not my concern. After all, one may be able to create a Word document which described its format before getting on to useful stuff, but who would?

But I thought that was effectively what you're advocating. Ok I know you're describing a header here, but aren't the natural language descriptions really just a description of the file format before getting to the useful stuff?


Um, the proposal is an attempt to break this chain. That is why I suggested the structure I did, which basically says:

*snip*

Ok, one final question. I assume the header specification isn't self describing so I guess a specification of it would have to be published. Is there an assumption here that the "header spec" won't get lost or do files carry around a description of the header?


If I gave you a complex XML file, would you be able to reconstruct the data exactly as intended? (Using only the information in the file - I'll accept you know the how the information is stored and ASCIIsmile).
Is XML really universal - to use your example, would it be a good format to store a movie in?

Technically you could, but I concede the point that a completely self describing XML file is likely to be extremely complex.

You could store a movie in it as XML allows binary data to be embedded. Of course an XML specification of the embedded data is, as I say, likely to be "quite" complex.


I appreciate this discussion, even if it is critical. I'd like to think I've answered every one of your points now. I do wonder whether you see the intent for the format, especially when you talk about XML. I think it fulfills a different niche.

I apologise if I've come over as very critical as that wasn't my intention and I do appreciate that you've taken the time to answer my misconceptions.


Whether the format is worth specifying, let alone implementing is another question.

I think it's an interesting area. Actually there may be some parallels between reflective architectures and what you're describing, but whereas current meta-descriptions of reflective architectures tend to be fairly domain specific you seem to be opening up a much larger "domain".

  ^[ Log in to reply ]
 
Loris Message #5072, posted at 14:58, 15/6/2002, in reply to message #5071
Unregistered user Since this is getting long, I'm going to prune it fairly hard.

...
Often you want to recover old data but can't. It isn't really anyones fault, it just happens. The aim is to address that.

But if the file format is open (ie documented) it isn't as issue. If we really want to get really nit picky, if your file is 20yrs old how are you going to recover it if the media it is stored on has degraded? wink

I've already mentioned this problem, it is outside the scope of the format (and obviously so). The only solution I can think of to this would be to print it out.


1) the program is no longer maintained, and doesn't run now the operating system has changed.

This only applies to closed file formats.

Not necessarily. Even open formats can be lost, if they are rare enough. As specialist formats often will be.

Exhibit A (Although this may be to some extent a hardware problem, it proves the point.)

3) You have a file and don't have any idea even what sort of data it contains.

Again this is a documentation problem.

No it isn't.
If you'd never seen a spritefile, would you know what it was?


Well they could be. In fact they probably are.
But in 20 years time, the documentation is lost and you are screwed.

Again I don't completely agree. If the format is an open standard then the standards body which owns it should have an archive of the documentation. I realise it's not quite as secure as having the information in the file itself but it's unlikely that no one will have a copy.

Well, replace 'not quite as' by 'nothing like as' and you have my stance. And relying on happenstance is what I'm aiming to avoid.

From what I've described I think the p@rser would actually be really simple. ...

I don't agree that the pa**er would necessarily be simple - it would actually be more complex than a pa**er for the same data without the description.

Well, this is true, but the additional 'load' would be minimal, and you would really only have to pay for it (in programming) once. One small program could do this job, and pass on the actual data to the program which knew how to deal with it.

...The embedded description has to be accurate enough. I suppose people will have to judge that themselves. For an open file-format, there can easily be recognised authorities (cf Linus Torvolds) who can try to polish formats. The latter is basically identical, except in that the files may become separated.

This suggests that the description is going to have some formal specification. Does this specification also get carried around in the file because otherwise what happens if the description spec gets lost?

Um, yeah. I suspect that you might have missed one of my postings - please check. This is probably the addition of most importance; the header we discussed in some detail is only an introduction for this embedded information to make sure that the reader can use it.

smile You are storing a movie and you worry about an 8k header? monkey

Yes because it might be a low bitrate movie that I'm serving over a network link which either has a limited amount of bandwidth or I have to pay for the bandwidth that I use.

Then I suggest you are dealing with basically epheremial data - not archiving stuff. Therefore you would not be using the format for what it is intended.
However, there would be nothing to stop you serving just the 'actual' data, extracting it from the archive in real time.

...


As has already been pointed out, what if you can't read the natural language description? While this is a problem for any specification at least the spec is independent of the file and so you only have to worry about translating it once.

I doubt that any researcher would be so machine-like as to repeatedly translate identical texts. Instead, the header would serve as a good marker that the following information was once considered important, and was intended to be recoverable.
As I have already said, this format is not intended to be a time-capsule, although obviously it would be superiour to data without any meta-data. The 'header' is intended to allow you to read the natural language description.

I don't think this is a problem. In this case the program only looks at the strings to find out if it can use the data, then at the tags to find which sections they correspond to. If it can't recognise a name-string, the onus is on the user to recover the data.

Well that's pretty much how XML works too. A document has a DTD version and you can mark fields as compulsory so that if you can't understand those fields you can't understand all the data. Alternatively it's possible to mark fields as not being that important.

I've just had a quick look into XML on the web, and I don't think it does what I'm suggesting. One may be able to make a sort of explicit data system using XML, but that is not my concern. After all, one may be able to create a Word document which described its format before getting on to useful stuff, but who would?


It may well be, I know very little about XML. However I certainly got the impression that for XML you need to basically know what the system is describing before you can do anything with it. If this is the case, it fulfills a different niche entirely.

Surely you have to know what the system is describing before you can do anything with it anyway? Otherwise you'd only see a bunch of binary data. In XML you don't need the DTD either, although it makes validation much easier. Besides, there's nothing to stop you putting comments in the file.

Um, the proposal is an attempt to break this chain. That is why I suggested the structure I did, which basically says:
* I order the information this way (bytes)
* Here is the language (Latin alphabet; english)
* This is how the format works (meta-meta data, formally described in english)
* I contain information in the following formats (meta data on the information, formal description)
* [Information]

Obviously one cannot describe all data-formats known to man, so I included a system by which only relevent meta-data would be included. This means we have meta-meta data. The use of english means that there is no further recursion; intelligent organisms are the assumed readers.

If I gave you a complex XML file, would you be able to reconstruct the data exactly as intended? (Using only the information in the file - I'll accept you know the how the information is stored and ASCIIsmile).
Is XML really universal - to use your example, would it be a good format to store a movie in?

Believe me I know exactly where you're coming from. About a year or so ago I really could've done with the documentation for the format of Rink's "links" file as I got 95% of the way to writing a dynamic linker and then struggled. If I'd had the file format I probably would've finished it. As it is the linker was condemned to the depths of my HD.

I appreciate this discussion, even if it is critical. I'd like to think I've answered every one of your points now. I do wonder whether you see the intent for the format, especially when you talk about XML. I think it fulfills a different niche.
Whether the format is worth specifying, let alone implementing is another question.

  ^[ Log in to reply ]
 
johnstlr Message #5071, posted at 14:58, 15/6/2002, in reply to message #5067
Unregistered user
I think this misses the point. The above is all true, but it isn't a good thing. Often you want to recover old data but can't. It isn't really anyones fault, it just happens. The aim is to address that.

But if the file format is open (ie documented) it isn't as issue. If we really want to get really nit picky, if your file is 20yrs old how are you going to recover it if the media it is stored on has degraded? wink


The other only problems are smile :
1) the program is no longer maintained, and doesn't run now the operating system has changed.

This only applies to closed file formats.


2) The program doesn't (didn't ever) run on your system.

Ditto.


3) You have a file and don't have any idea even what sort of data it contains.

Again this is a documentation problem.


really (1) is the doozy.

True, but only because we're used to fundamentally closed systems. I bet there are configuration files from the very early days of UNIX still in use but then just about all configuration in UNIX is done using plain text files.



Well they could be. In fact they probably are.
But in 20 years time, the documentation is lost and you are screwed.

Again I don't completely agree. If the format is an open standard then the standards body which owns it should have an archive of the documentation. I realise it's not quite as secure as having the information in the file itself but it's unlikely that no one will have a copy.


From what I've described I think the p@rser would actually be really simple. To move to a new platform is just as easy/difficult as normal if you have the code. If not, it should at least be possible to program, as you know the specs.

I don't agree that the pa**er would necessarily be simple - it would actually be more complex than a pa**er for the same data without the description.


The first is a good point. The embedded description has to be accurate enough. I suppose people will have to judge that themselves. For an open file-format, there can easily be recognised authorities (cf Linus Torvolds) who can try to polish formats. The latter is basically identical, except in that the files may become separated.

This suggests that the description is going to have some formal specification. Does this specification also get carried around in the file because otherwise what happens if the description spec gets lost?

It's remarkably easy to get terminology confused, even between people who work in the same field.


smile You are storing a movie and you worry about an 8k header? monkey

Yes because it might be a low bitrate movie that I'm serving over a network link which either has a limited amount of bandwidth or I have to pay for the bandwidth that I use.


The font is to indicate what the alphabet of the embedded natural language is. To be honest, it might be a little superfluous. Ascii being basically fixed, at least for letters, numbers and basic punctuation. If you have a program which understands the file sections, you would never see the embedded text, it would just use the data and ignore the text.
On the other hand, it does have a lot of redundancy, which is the best way of showing that you want to transmit a message.

As has already been pointed out, what if you can't read the natural language description? While this is a problem for any specification at least the spec is independent of the file and so you only have to worry about translating it once.


I don't think this is a problem. In this case the program only looks at the strings to find out if it can use the data, then at the tags to find which sections they correspond to. If it can't recognise a name-string, the onus is on the user to recover the data.

Well that's pretty much how XML works too. A document has a DTD version and you can mark fields as compulsory so that if you can't understand those fields you can't understand all the data. Alternatively it's possible to mark fields as not being that important.

It may well be, I know very little about XML. However I certainly got the impression that for XML you need to basically know what the system is describing before you can do anything with it. If this is the case, it fulfills a different niche entirely.

Surely you have to know what the system is describing before you can do anything with it anyway? Otherwise you'd only see a bunch of binary data. In XML you don't need the DTD either, although it makes validation much easier. Besides, there's nothing to stop you putting comments in the file.

Believe me I know exactly where you're coming from. About a year or so ago I really could've done with the documentation for the format of Rink's "links" file as I got 95% of the way to writing a dynamic linker and then struggled. If I'd had the file format I probably would've finished it. As it is the linker was condemned to the depths of my HD.

  ^[ Log in to reply ]
 
Loris Message #5067, posted at 14:58, 15/6/2002, in reply to message #5065
Unregistered user
One problem as I see it with file formats is that they get upgraded somehow. Programs are usually backwards compatable - at least for a while, but sometimes they just stop.

I don't see this as an issue. I've never heard of an application that couldn't read at least the pervious version of it's data files and all you've got to do is save out in the new format.

The only problem is if someone sends you data that is from an older version of the software. Ok you can't read it but is it your fault? Is it the fault of the software? From a software engineering perspective it's neither the fault of you or the software. Sometimes it's only practical to maintain backwards compatibility for so long and then the effort of doing so outweights the benefits.

I think this misses the point. The above is all true, but it isn't a good thing. Often you want to recover old data but can't. It isn't really anyones fault, it just happens. The aim is to address that.

The other only problems are smile :
1) the program is no longer maintained, and doesn't run now the operating system has changed.
2) The program doesn't (didn't ever) run on your system.
3) You have a file and don't have any idea even what sort of data it contains.
4) there are probably others, let me think on it smile
really (1) is the doozy.

After a while the program you have which analyses it just stops working on your old data. OR you lose the program. Or you want to read a file on a different format (dear to our hearts, this one) etc.

Again all of these problems could be solved by simply documenting the file format. You don't need a gee whiz, hyper intelligent, mega complex file format and analyser to do this.

Well they could be. In fact they probably are.
But in 20 years time, the documentation is lost and you are screwed. I'm flattered by your description of the format, but I don't think it is appropriate.

My idea is that it would be nice to have a format which you could refer back to after (say) 10-50 years or so, and still be able to interpret the data in exactly the way it was meant.

There are several problems with your suggestions

Good, now we've talking. cool


1) You still need to be able to construct an appropriate p@rser and without one you still can't move between platforms.
From what I've described I think the p@rser would actually be really simple. To move to a new platform is just as easy/difficult as normal if you have the code. If not, it should at least be possible to program, as you know the specs.

2) What if the embedded natural language isn't accurate enough? Also why not have the description outside of the file in a document - it's more efficient.

The first is a good point. The embedded description has to be accurate enough. I suppose people will have to judge that themselves. For an open file-format, there can easily be recognised authorities (cf Linus Torvolds) who can try to polish formats. The latter is basically identical, except in that the files may become separated.

3) What is the font definition for? What if I'm storing a picture or a movie?

smile You are storing a movie and you worry about an 8k header? monkey
The font is to indicate what the alphabet of the embedded natural language is. To be honest, it might be a little superfluous. Ascii being basically fixed, at least for letters, numbers and basic punctuation. If you have a program which understands the file sections, you would never see the embedded text, it would just use the data and ignore the text.
On the other hand, it does have a lot of redundancy, which is the best way of showing that you want to transmit a message.

4) A header is only as good as the p@rser that reads it. Almost all file formats contain a header which describes what follows. The problem comes when the header can't describes a new extension.
I don't think this is a problem. In this case the program only looks at the strings to find out if it can use the data, then at the tags to find which sections they correspond to. If it can't recognise a name-string, the onus is on the user to recover the data.

As far as I'm aware pretty much all you want to do can be achieved already in XML, but it's a case of industry wide standards taking time to seep through to RISC OS (thanks to Justin Fletcher for making the start)

It may well be, I know very little about XML. However I certainly got the impression that for XML you need to basically know what the system is describing before you can do anything with it. If this is the case, it fulfills a different niche entirely.

  ^[ Log in to reply ]
 
Loris Message #5066, posted at 14:58, 15/6/2002, in reply to message #5064
Unregistered user

Surely just writing simple conversion tools isn't too bad?

It isn't too bad - provided you have the information. This is really what the format is intended to solve. I think it is actually fairly common. Since I've started thinking about it I've come across 2 examples, without even looking.

<snip - (1) the header>

What about qouting a simple ISO code? Even if they do get superseeded at some point there's still bound to be a full definiton of each alphabet somewhere.

I suppose it is possible. But you'd need to be very clear about which encoding system. (How? - without using text?) I don't really like the idea, because it means the decipherer would have to find that format before getting started.
Even for the 'big' format, the overhead would be something like 8*8*64 bytes=4k
And it does clearly indicate that we mean business about making it recoverable.
A couple of things I didn't mention:
Letters and numbers and basic punctuation would be ascii of course. This means a simple scan using a text editor would show it up for the forseeable future.
Many of the other characters could be control characters to give formatting of the descriptive text. These would of course have to follow the ascii characters and would be described using them.

<snip - (2) Descriptive text. - except>


i) A tag number - this is specific to this document.
ii) A textual string; this is used by programs to decide whether they can display/use that data type.
iii) A version number, with major and minor parts
iv) An english (or whatever) description of how to use the data in the tagged field. This is necessarily very explicit and precise. All terms needed are described.


So a tag number is the identifier for each chunk of data? e.g. tag 1, tag 2, tag 3, etc.?

A tag would be an identifier for each type of data. So you could have several sections with the same tag number. This is similar to how drawfiles work, except that here they are specific only to the document. The relationship between type and the tag relies on the string in (ii) above. This is to allow anyone to make up their own section format. Provided it follows the rules, it should be interpretable later.

<snip - (3) the data>

Sounds OK, but it might be worthwhile changing it to a chunk format & chunk ID. That way you can have a listing for each chunk format in section 2, rather than a listing for every single chunk.
I think I've addressed this above.
All 'chunks' of the same type would have the same tag number, and not require multiple listings.
It might be worth giving them a chunk number as well... but I can't think of a good reason to ATM.


I suppose it would work, but might not catch on for day-to-day use. Small, compact files might have their size doubled or trippled due to all the extra data they have to carry, and the format does nothing to actually allow programs to read old formats - it just acts as a description of that file format, so you may still have the problem of one piece of software bringing out a new format and ignoring its older versions.
You are right it might not catch on - but that doesn't mean we can't consider it. I'm kind of hoping its open-source for data ethic might make it catch on in the Linux community wink
Regarding file-size - well that isn't really a concern of the format. What would you rather have: 8k of vital data you can't read any more, or a 64k file which you know you could recover the data from? The problem would be reduced for larger sizes, which most of them almost certainly would be - especially if you had multiple small files all combined into the one file.
In any case, this doesn't seem to worry the majority, who are perfectly happy sending bloated, unreadable files around the place.
Regarding your last point - if your new program couldn't read the file, you could write a plugin for it.

I see this file format as being infinitely adaptable. Let me give you a genetical example:
The could be several related section formats for describing DNA. Many of these would not contain any display information at all. There could be a 'packed DNA' format - using just 2 bits for each nucleotide, ones for frequent reference which allow comments etc to be held, comparisons between different bits and many more. (I have to deal with lots of stuff like this on a day to day basis.)

One thing might be to 'componentise' the data descriptions, to allow reference to other sections and prevent all of them requiring the same repetitive definition of primitives.

Yes, that's a good idea as I've (sort of) already suggested above wink

Good. Let us consider the implementation.

  ^[ Log in to reply ]
 
Loris Message #5063, posted at 14:58, 15/6/2002, in reply to message #5054
Unregistered user OK folks, I'm going to expand on my idea, although not in too much detail right now as I should be working. smile

One problem as I see it with file formats is that they get upgraded somehow. Programs are usually backwards compatable - at least for a while, but sometimes they just stop. After a while the program you have which analyses it just stops working on your old data. OR you lose the program. Or you want to read a file on a different format (dear to our hearts, this one) etc.

So.
My idea is that it would be nice to have a format which you could refer back to after (say) 10-50 years or so, and still be able to interpret the data in exactly the way it was meant.
I should mention here that changes in hardware access etc are outside the scope of this file format - if you want that, I guess printouts are required.
Also, it is not intended to be for time capsules etc where civilisations rise and fall and language changes.

So, my idea is basically that there is a human-readable (ie english) description of each of the subsections, stored in the file. Before you all say "Duh!", let me explain myself.

The file has these sections:
1) Header - this is to try and minimise problems finding the start of the file, and indicate exactly how bytes are used, etc.
I reckon it should look something like this:

[00][ff][00][01][02][03][04]
where each of those is a byte.
then be followed by a font definition. This will be used in the descriptive section. I'm not sure about this. One method would be to use 8*8 characters, using 64 bytes for each value. For example to define A could be:
00000000
00AAA000
0A000A00
0AAAAA00
0A000A00
0A000A00
0A000A00
00000000
I hope you can forgive my s***ty ascii art skills.
This would obviously take up some space, but it is very explicit. It does have problems with defining space, though. (Although the numbers could be switched).
A more compact solution would give the number followed by 8 bytes of data.

2) Descriptive text.
The first bit would describe the file format itself and how it worked.
The next bits would describe how to interpret each of the following tagged data fields. Each of these descriptions contains:
i) A tag number - this is specific to this document.
ii) A textual string; this is used by programs to decide whether they can display/use that data type.
iii) A version number, with major and minor parts
iv) An english (or whatever) description of how to use the data in the tagged field. This is necessarily very explicit and precise. All terms needed are described.

3) The data you want to store. Each of these sections has:
i) a tag number
ii) a length to the next section
iii) the data


So.
If you get a file, you can:
worst case - write a program to read the file
best case - already have a program to display the file
intermediate case - have a program which will display parts of the file, and be theoretically able to plug in components to read the others.

That is my concept.
Some of the stuff comes from the Drawfile format, some I made up myself.
Does it suck donkeys?
I am interested in all suggestions.

One thing might be to 'componentise' the data descriptions, to allow reference to other sections and prevent all of them requiring the same repetitive definition of primitives.

  ^[ Log in to reply ]
 
johnstlr Message #5062, posted at 14:58, 15/6/2002, in reply to message #5061
Unregistered user Java has a pretty comprehensive set of XML APIs, either as add ons for 1.3 or as standard for 1.4.

If you mean tools such as XML browsers and stuff, well they'll probably appear at some point as the free APIs become more widely spread.

Having said that the beauty of XML being a text format is that you could just use Zap or StrongEd wink

  ^[ Log in to reply ]
 
johnstlr Message #5060, posted at 14:58, 15/6/2002, in reply to message #5059
Unregistered user
I've been thinking about format specifications along the lines of something like regular expressions.

The binary format of the fundamental elements in the file are defined in named declarations then used to describe more complex structures.

Well this is basically what XML allows but you still need a pa**er and application that can understand the semantics of the XML file.


The problem is defining the meaning of these structures: how do you describe that the file contains an image, audio data or a series of page definitions?

One place to start might be the MPEG standard for describing the contents of MPEG files (MPEG-7 IIRC) although they're trying to describe the data in a form that is directly readable by humans.


a) Yes, as long as you can decipher each format to start with.

As you've already pointed out you need some sort of meta data describing the contents of the file (ie an XML dtd) but you still need to understand the meta data to start with.

Is this really a problem though? It's only a problem if the data is stored in a format that is closed (ie not documented anywhere). It's relatively easy to avoid that these days and old file formats aren't going to carry any sort of meta data so you still need to either reverse engineer them or find someone who does know the format.

This is a tricky thing to do. I remember being at a workshop once discussing how to represent network QoS in a generic form. The best we could come up with at the time was "an application defined data structure" because whatever we came up with we found limitations. In comparison to what is being suggested here, network QoS is a pretty limited field smile

I'm not saying it's impossible, just that it might be less effort to simply document file formats. It'd help if it were possible to standardaise on file formats - ie everyone uses PNG and JPEG instead of sprites or BMPs etc. This isn't going to happen while MS continue to "embrace and extend" standards though.

  ^[ Log in to reply ]
 
Phlamethrower Message #5056, posted at 14:58, 15/6/2002, in reply to message #5055
Unregistered user a) possibly, even if it may be a bit impossible
b) ok then smile
  ^[ Log in to reply ]
 
I don't have tourettes you're just a cun Message #5055, posted by [mentat] at 14:58, 15/6/2002, in reply to message #5054
[mentat]Fear is the mind-killer
Posts: 6266
a) potentially
b) maybe

grin

  ^[ Log in to reply ]
 

The Icon Bar: Programming: explicit data format