MediaTags specifications clarification
Published 2010-11-30This was converted from e-mail / html to markdown with html2text.py
Speed, Completeness, and Identification are the three most important qualities of this application.
Speed
- The end product is "MediaBox", a small 600MHz OMAP3530
which will use MediaTags on 30,000+ files.
Completeness
- It should be possible to represent every meta-data tag present in the file.
- It should be able to read tags in an unabstracted manner
- 
pure binary tags, such as Album Art, may also be extracted as JSON - --extract-binary-tags=bas64 -- included in JSON
 
Ex: "@art": "Az48tks9cC...."
* --extract-binary-tags-to=path/to/attachments-dir -- placed into the
folder and referenced in JSON
Ex: "@art": "./path/to/attachments-dir/my song.m4a.@art.jpeg"
* Binary tag extraction will usually be a post-processing feature and
should be off by default
- Both stream and media metadata shoud be present, but separate.
Identification - Checksums of the "stream" part of a file
- 
Each tag application should be able to produce a checksum of the stream (data) portion of a file - 
--with-stream-md5sum 
- 
--with-stream-sha2sum 
 
- 
- 
The checksum should not be of the file as a whole 
- 
The checksum should not include the tags 
- 
The checksum is probably from the byte offset of the last header tag to the end of the file or first header tag 
Conclusions
TagLib - Let's use taglib if
- 
taglib allows raw access to tags (I believe it does) 
- 
taglib can generate the same detail of information as AtomicParsley 
Mutagen - Probably not a good fit
- 
mutagen is significantly slower than taglib ? 
- 
mutagen does not allow access to all tags, just abstracted normalized ones ? 
Libexiv2 - yes
Exiftool - probably not a good fit
- 
Is it fast? if not, no 
- 
does it allow access to all tags? or does it abstract them? 
Type Detection - GNU file is too slow!
- 
my tests show that file ./my-song.m4atakes more time thanAtomicParsely -t ./my-song.m4a- detecting the file type should not take more time than parsing the file!
 
- 
type detection should be very very simple - 
if the file has an extension, use the extension to determine the type - if it doesn't match a known type, ignore it
 
 
- 
- 
if the file doesn't have an extension (very rare), try matching the first few bytes of the header - it's okay to use filefor rare cases - little time will be wasted in comparison.
 
- it's okay to use 
- 
fail with error if the file cannot be parsed as expected - 
some media types can have multiple types of tags (id3, m4a, musepak, oggtag?, etc?) - try the most likely first (mp3 -> id3)
 
- 
some media types can have embedded tags - 
mp3 -> album art -> jpeg -> exif 
- 
only parse the intended type 
- 
don't parse exif data from an mp3 
 
- 
 
- 
What does Unix Filter Class mean?
Priority
While I was waiting my friend created prototypes for outputting mp3 and m4a media metadata which I am using for now.
The most important thing that I need right now is to be able to checksum the data stream.
It's okay to rearrange some of the other things if it's better for your workers' workflow,
but I would like the checksum-ing ability first.
Once the --literal-tags is done I'll know better what the --normalized-tags should look like
- Stream (not file) checksums -- { "stream": { "sha256sum": "ae68f......" } }
- jpegtags --without-metadata --with-sha256sum ./my-file.jpeg
- mp3tags --without-metadata --with-sha256sum ./my-file.mp3
- aactags --without-metadata --with-sha256sum ./my-file.m4a
- JPEG Media metadata --literal-tags
- exivtags ./my-file.jpeg
- xmptags ./my-file.jpeg
- iptctags ./my-file.jpeg
- Stream metadata
- jpegtags
- aactags
- mp3tags
- Media --literal-tags
- m4atags
- id3tags
- Media --verbose-tags
- m4atags
- id3tags
- exivtags
- xmptags
- iptctags
- eBook/pdf tags
- more information about what information is stored and can be extracted is needed
Before --normalized-tags I first want to see the outputs of the stream and meta-data --literal-tags
I've pushed --binary-tags to be a future consideration
General Clarifications
Meta-data organization
I want to make it clear that there are three types of meta data that I am particularly interested in.
Media (tag) metadata
- 
The tags that universally describe a particular piece of artwork / media - 
Music (id3, m4a): artist, album, track number, rating 
- 
Images (exif, ipic, xmp): geo location, keywords, aspect ratio, date/time taken, visual similarity metrics 
- 
Documents (proprietary): author, title / subject, keywords, text body 
 
- 
Stream (data) metadata
- 
The tags that describe a specific stream of media, but not the artwork / media itself - 
Music (aac, mp3): md5sum, stream type (aac, mp3), quality, bitrate 
- 
Images (jpeg): md5sum, stream type, quality, width, height, color depth 
- 
Documents (odxml, msxml, pdf): md5sum, stream type (xml, ms-binary), word count, page count 
 
- 
File (data + tag) metadata - not necessary to analyze at this time
- 
Tags that describe a set of bytes on disk - 
--with-file-metadata 
- 
All Types: md5sum, access time, modified time, size, inode count 
 
- 
Examples
- 
I have an mp3 and an m4a of the same song. - 
The media metadata will be almost exactly the same - the exception is that some of tag formats support more options than others
 
 
- 
* The stream metadata will almost always be different
  * an exception may be that the bitrates are the same
* The file metadata will be almost always be different
Text
- 
Many of the files to be processed will contain Chinese, Japenese and other international characters - 
UTF-8 should be used, not ASCII alone. 
- 
UTF-16 may also be used. 
- 
--pretty-print should output with pretty whitespacing -- somewhat like JSON.stringify(object, null, "\t") 
 
- 
Modularity
**
**The most important parts of the organization are this
- The library should be modular, I prefer small bits of code that each do one thing well
- It should be easy to build just one feature of the application or incorporate it in another application
- 
mediatags /my-song.mp3 --with-stream-tags --with-md5sum gives the combined result of - 
id3tags /my-song.mp3 
- 
mp3tags --with-md5sum /my-song.mp3 
 
- 
- m4atags /my-song.mp3 returns { "error": "no m4a tags found" }
- 
aactags /my-song.mp3 returns { "error": "no aac stream found" } 
- 
In the future I would like to create a MediaTags plugin for Node.JS 
A possible organization
- 
mediatags - single binary that handles any type of file - 
libmediatags.o - 
libmediatagsid3.c 
- 
libmediatagsm4a.c 
- 
libmediatagsexiv.c 
- 
libmediatagsmp3.c 
- 
libmediatagsaac.c 
- 
libmediatagsjpeg.c 
- 
libmediatagspdf.c 
- 
libmediatagsdoc.c 
- 
libmediatagsodt.c 
 
- 
- 
id3tags ---> mediatags (symlink) 
- 
m4atags --> mediatags (symlink) 
- 
etc 
 
- 
- 
Each lib has a method such as getMediaTags(), getStreamTags() 
**Future Considerations
**
Ideas to consider, but not to implement yet.
Binary Tags
- perhaps Google Protobuf?
**Streaming Input
**
- Accept data in chunks over a socket ?
By AJ ONeal
Did I make your day?
 
    (you can learn about the bigger picture I'm working towards on my patreon page )
