Clarifications on MediaTags
Published 2011-1-13Speed, Completeness, and Identification are the three most important qualities of this application.
Speed
- The end product is "MediaBox", a small 1GHz OMAP3530 which will use MediaTags on 30,000+ files.
Completeness
-
It should be possible to represent every meta-data tag present in the file.
-
It should be able to read tags in an unabstracted manner
-
pure binary tags, such as Album Art, may also be extracted as JSON
-
--extract-binary-tags=bas64 -- included in JSON Ex: "@art": "Az48tks9cC...."
-
--extract-binary-tags-to=path/to/attachments-dir -- placed into the folder and referenced in JSON Ex: "@art": "./path/to/attachments-dir/my song.m4a.@art.jpeg"
-
Binary tag extraction will usually be a post-processing feature and should be off by default
-
-
Both stream and media metadata shoud be present, but separate.
Identification - Checksums of the "stream" part of a file
- Each tag application should be able to produce a checksum of the stream (data) portion of a file
- --with-stream-md5sum
- --with-stream-sha2sum
- The checksum should not be of the file as a whole
- The checksum should not include the tags
- The checksum is probably from the byte offset of the last header tag to the end of the file or first header tag
Conculsions
TagLib - Let's use taglib if
- taglib allows raw access to tags (I believe it does)
- taglib can generate the same detail of information as AtomicParsley
Mutagen - Probably not a good fit
- mutagen is significantly slower than taglib ?
- mutagen does not allow access to all tags, just abstracted normalized ones ?
Libexiv2 - yes
Exiftool - probably not a good fit
- Is it fast? if not, no
- does it allow access to all tags? or does it abstract them?
Type Detection - GNU file
is too slow!
-
my tests show that
file ./my-song.m4a
takes more time thanAtomicParsely -t ./my-song.m4a
- detecting the file type should not take more time than parsing the file!
-
type detection should be very very simple
- if the file has an extension, use the extension to determine the type
- if it doesn't match a known type, ignore it
- if the file has an extension, use the extension to determine the type
-
if the file doesn't have an extension (very rare), try matching the first few bytes of the header
- it's okay to use
file
for rare cases - little time will be wasted in comparison.
- it's okay to use
-
fail with error if the file cannot be parsed as expected
- some media types can have multiple types of tags (id3, m4a, musepak, oggtag?, etc?)
- try the most likely first (mp3 -> id3)
- some media types can have embedded tags
- mp3 -> album art -> jpeg -> exif
- only parse the intended type
- don't parse exif data from an mp3
- some media types can have multiple types of tags (id3, m4a, musepak, oggtag?, etc?)
What does Unix Filter Class mean?
- Incidentally, UNIX filter applications are the ones that you described as taking input data from a Pipe (as in cat abc.mp3 | m4atags --verbose)
Priority
While I was waiting my friend created prototypes for outputting mp3 and m4a media metadata which I am using for now.
The most important thing that I need right now is to be able to checksum the data stream.
It's okay to rearrange some of the other things if it's better for your workers' workflow, but I would like the checksum-ing ability first.
Once the --literal-tags is done I'll know better what the --normalized-tags should look like
-
Stream (not file) checksums
-
jpegtags --without-metadata --with-sha256sum ./my-file.jpeg
-
mp3tags --without-metadata --with-sha256sum ./my-file.mp3
-
aactags --without-metadata --with-sha256sum ./my-file.m4a
{ "stream": { "sha256sum": "ae68f......" } }
-
-
JPEG Media metadata --literal-tags
- exivtags ./my-file.jpeg
- xmptags ./my-file.jpeg
- iptctags ./my-file.jpeg
-
Stream metadata
- jpegtags
- aactags
- mp3tags
-
Media --literal-tags
- m4atags
- id3tags
-
Media --verbose-tags
- m4atags
- id3tags
- exivtags
- xmptags
- iptctags
-
eBook/pdf tags
- more information about what information is stored and can be extracted is needed
Before --normalized-tags I first want to see the outputs of the stream and meta-data --literal-tags I've pushed --binary-tags to be a future consideration
General Clarifications
Meta-data organization
I want to make it clear that there are three types of meta data that I am particularly interested in.
Media (tag) metadata
- The tags that universally describe a particular piece of artwork / media
- Music (id3, m4a): artist, album, track number, rating
- Images (exif, ipic, xmp): geo location, keywords, aspect ratio, date/time taken, visual similarity metrics
- Documents (proprietary): author, title / subject, keywords, text body
Stream (data) metadata
- The tags that describe a specific stream of media, but not the artwork / media itself
- Music (aac, mp3): md5sum, stream type (aac, mp3), quality, bitrate
- Images (jpeg): md5sum, stream type, quality, width, height, color depth
- Documents (odxml, msxml, pdf): md5sum, stream type (xml, ms-binary), word count, page count
File (data + tag) metadata - not necessary to analyze at this time
- Tags that describe a set of bytes on disk
- --with-file-metadata
- All Types: md5sum, access time, modified time, size, inode count
Examples
- I have an mp3 and an m4a of the same song.
-
The media metadata will be almost exactly the same
- the exception is that some of tag formats support more options than others
-
The stream metadata will almost always be different
- an exception may be that the bitrates are the same
-
The file metadata will be almost always be different
-
Text
- Many of the files to be processed will contain Chinese, Japenese and other international characters
- UTF-8 should be used, not ASCII alone.
UTF-16 may also be used-- JSON doesn't support UTF16- --pretty-print should output with pretty whitespacing -- somewhat like JSON.stringify(object, null, "\t")
Modularity
The most important parts of the organization are this
-
The library should be modular, I prefer small bits of code that each do one thing well
-
It should be easy to build just one feature of the application or incorporate it in another application
-
mediatags
/my-song.mp3 --with-stream-tags --with-md5sum
gives the combined result ofid3tags /my-song.mp3
mp3tags --with-md5sum /my-song.mp3
-
m4atags /my-song.mp3
returns{ "error": "no m4a tags found" }
-
aactags /my-song.mp3
returns{ "error": "no aac stream found" }
-
In the future I would like to create a MediaTags plugin for Node.JS
A possible organization
- mediatags - single binary that handles any type of file
- libmediatags.o
- libmediatagsid3.c
- libmediatagsm4a.c
- libmediatagsexiv.c
- libmediatagsmp3.c
- libmediatagsaac.c
- libmediatagsjpeg.c
- libmediatagspdf.c
- libmediatagsdoc.c
- libmediatagsodt.c
- id3tags ---> mediatags (symlink)
- m4atags --> mediatags (symlink)
- etc
- libmediatags.o
- Each lib has a method such as getMediaTags(), getStreamTags()
Future Considerations
Ideas to consider, but not to implement yet.
Binary Tags
- perhaps Google Protobuf?
Streaming Input
- Accept data in chunks over a socket ?
By AJ ONeal
Did I make your day?
Buy me a coffee
(you can learn about the bigger picture I'm working towards on my patreon page )