MediaTags specifications clarification
Published 2010-11-30This was converted from e-mail / html to markdown with html2text.py
Speed, Completeness, and Identification are the three most important qualities of this application.
Speed
- The end product is "MediaBox", a small 600MHz OMAP3530
which will use MediaTags on 30,000+ files.
Completeness
- It should be possible to represent every meta-data tag present in the file.
- It should be able to read tags in an unabstracted manner
-
pure binary tags, such as Album Art, may also be extracted as JSON
- --extract-binary-tags=bas64 -- included in JSON
Ex: "@art": "Az48tks9cC...."
* --extract-binary-tags-to=path/to/attachments-dir -- placed into the
folder and referenced in JSON
Ex: "@art": "./path/to/attachments-dir/my song.m4a.@art.jpeg"
* Binary tag extraction will usually be a post-processing feature and
should be off by default
- Both stream and media metadata shoud be present, but separate.
Identification - Checksums of the "stream" part of a file
-
Each tag application should be able to produce a checksum of the stream (data) portion of a file
-
--with-stream-md5sum
-
--with-stream-sha2sum
-
-
The checksum should not be of the file as a whole
-
The checksum should not include the tags
-
The checksum is probably from the byte offset of the last header tag to the end of the file or first header tag
Conclusions
TagLib - Let's use taglib if
-
taglib allows raw access to tags (I believe it does)
-
taglib can generate the same detail of information as AtomicParsley
Mutagen - Probably not a good fit
-
mutagen is significantly slower than taglib ?
-
mutagen does not allow access to all tags, just abstracted normalized ones ?
Libexiv2 - yes
Exiftool - probably not a good fit
-
Is it fast? if not, no
-
does it allow access to all tags? or does it abstract them?
Type Detection - GNU file
is too slow!
-
my tests show that
file ./my-song.m4a
takes more time thanAtomicParsely -t ./my-song.m4a
- detecting the file type should not take more time than parsing the file!
-
type detection should be very very simple
-
if the file has an extension, use the extension to determine the type
- if it doesn't match a known type, ignore it
-
-
if the file doesn't have an extension (very rare), try matching the first few bytes of the header
- it's okay to use
file
for rare cases - little time will be wasted in comparison.
- it's okay to use
-
fail with error if the file cannot be parsed as expected
-
some media types can have multiple types of tags (id3, m4a, musepak, oggtag?, etc?)
- try the most likely first (mp3 -> id3)
-
some media types can have embedded tags
-
mp3 -> album art -> jpeg -> exif
-
only parse the intended type
-
don't parse exif data from an mp3
-
-
What does Unix Filter Class mean?
Priority
While I was waiting my friend created prototypes for outputting mp3 and m4a media metadata which I am using for now.
The most important thing that I need right now is to be able to checksum the data stream.
It's okay to rearrange some of the other things if it's better for your workers' workflow,
but I would like the checksum-ing ability first.
Once the --literal-tags is done I'll know better what the --normalized-tags should look like
- Stream (not file) checksums --
{ "stream": { "sha256sum": "ae68f......" } }
- jpegtags --without-metadata --with-sha256sum ./my-file.jpeg
- mp3tags --without-metadata --with-sha256sum ./my-file.mp3
- aactags --without-metadata --with-sha256sum ./my-file.m4a
- JPEG Media metadata --literal-tags
- exivtags ./my-file.jpeg
- xmptags ./my-file.jpeg
- iptctags ./my-file.jpeg
- Stream metadata
- jpegtags
- aactags
- mp3tags
- Media --literal-tags
- m4atags
- id3tags
- Media --verbose-tags
- m4atags
- id3tags
- exivtags
- xmptags
- iptctags
- eBook/pdf tags
- more information about what information is stored and can be extracted is needed
Before --normalized-tags I first want to see the outputs of the stream and meta-data --literal-tags
I've pushed --binary-tags to be a future consideration
General Clarifications
Meta-data organization
I want to make it clear that there are three types of meta data that I am particularly interested in.
Media (tag) metadata
-
The tags that universally describe a particular piece of artwork / media
-
Music (id3, m4a): artist, album, track number, rating
-
Images (exif, ipic, xmp): geo location, keywords, aspect ratio, date/time taken, visual similarity metrics
-
Documents (proprietary): author, title / subject, keywords, text body
-
Stream (data) metadata
-
The tags that describe a specific stream of media, but not the artwork / media itself
-
Music (aac, mp3): md5sum, stream type (aac, mp3), quality, bitrate
-
Images (jpeg): md5sum, stream type, quality, width, height, color depth
-
Documents (odxml, msxml, pdf): md5sum, stream type (xml, ms-binary), word count, page count
-
File (data + tag) metadata - not necessary to analyze at this time
-
Tags that describe a set of bytes on disk
-
--with-file-metadata
-
All Types: md5sum, access time, modified time, size, inode count
-
Examples
-
I have an mp3 and an m4a of the same song.
-
The media metadata will be almost exactly the same
- the exception is that some of tag formats support more options than others
-
* The stream metadata will almost always be different
* an exception may be that the bitrates are the same
* The file metadata will be almost always be different
Text
-
Many of the files to be processed will contain Chinese, Japenese and other international characters
-
UTF-8 should be used, not ASCII alone.
-
UTF-16 may also be used.
-
--pretty-print should output with pretty whitespacing -- somewhat like JSON.stringify(object, null, "\t")
-
Modularity
**
**The most important parts of the organization are this
- The library should be modular, I prefer small bits of code that each do one thing well
- It should be easy to build just one feature of the application or incorporate it in another application
-
mediatags /my-song.mp3 --with-stream-tags --with-md5sum gives the combined result of
-
id3tags /my-song.mp3
-
mp3tags --with-md5sum /my-song.mp3
-
- m4atags /my-song.mp3 returns { "error": "no m4a tags found" }
-
aactags /my-song.mp3 returns { "error": "no aac stream found" }
-
In the future I would like to create a MediaTags plugin for Node.JS
A possible organization
-
mediatags - single binary that handles any type of file
-
libmediatags.o
-
libmediatagsid3.c
-
libmediatagsm4a.c
-
libmediatagsexiv.c
-
libmediatagsmp3.c
-
libmediatagsaac.c
-
libmediatagsjpeg.c
-
libmediatagspdf.c
-
libmediatagsdoc.c
-
libmediatagsodt.c
-
-
id3tags ---> mediatags (symlink)
-
m4atags --> mediatags (symlink)
-
etc
-
-
Each lib has a method such as getMediaTags(), getStreamTags()
**Future Considerations
**
Ideas to consider, but not to implement yet.
Binary Tags
- perhaps Google Protobuf?
**Streaming Input
**
- Accept data in chunks over a socket ?
By AJ ONeal
Did I make your day?
Buy me a coffee
(you can learn about the bigger picture I'm working towards on my patreon page )