This was converted from e-mail / html to markdown with html2text.py

Speed, Completeness, and Identification are the three most important qualities of this application.

Speed

  • The end product is "MediaBox", a small 600MHz OMAP3530

which will use MediaTags on 30,000+ files.

Completeness

  • It should be possible to represent every meta-data tag present in the file.
  • It should be able to read tags in an unabstracted manner
  • pure binary tags, such as Album Art, may also be extracted as JSON

    • --extract-binary-tags=bas64 -- included in JSON

Ex: "@art": "Az48tks9cC...."

* --extract-binary-tags-to=path/to/attachments-dir -- placed into the

folder and referenced in JSON

Ex: "@art": "./path/to/attachments-dir/my song.m4a.@art.jpeg"

* Binary tag extraction will usually be a post-processing feature and

should be off by default

  • Both stream and media metadata shoud be present, but separate.

Identification - Checksums of the "stream" part of a file

  • Each tag application should be able to produce a checksum of the stream (data) portion of a file

    • --with-stream-md5sum

    • --with-stream-sha2sum

  • The checksum should not be of the file as a whole

  • The checksum should not include the tags

  • The checksum is probably from the byte offset of the last header tag to the end of the file or first header tag

Conclusions

TagLib - Let's use taglib if

  • taglib allows raw access to tags (I believe it does)

  • taglib can generate the same detail of information as AtomicParsley

Mutagen - Probably not a good fit

  • mutagen is significantly slower than taglib ?

  • mutagen does not allow access to all tags, just abstracted normalized ones ?

Libexiv2 - yes

Exiftool - probably not a good fit

  • Is it fast? if not, no

  • does it allow access to all tags? or does it abstract them?

Type Detection - GNU file is too slow!

  • my tests show that file ./my-song.m4a takes more time than AtomicParsely -t ./my-song.m4a

    • detecting the file type should not take more time than parsing the file!
  • type detection should be very very simple

    • if the file has an extension, use the extension to determine the type

      • if it doesn't match a known type, ignore it
  • if the file doesn't have an extension (very rare), try matching the first few bytes of the header

    • it's okay to use file for rare cases - little time will be wasted in comparison.
  • fail with error if the file cannot be parsed as expected

    • some media types can have multiple types of tags (id3, m4a, musepak, oggtag?, etc?)

      • try the most likely first (mp3 -> id3)
    • some media types can have embedded tags

      • mp3 -> album art -> jpeg -> exif

      • only parse the intended type

      • don't parse exif data from an mp3

What does Unix Filter Class mean?

Priority

While I was waiting my friend created prototypes for outputting mp3 and m4a media metadata which I am using for now.

The most important thing that I need right now is to be able to checksum the data stream.

It's okay to rearrange some of the other things if it's better for your workers' workflow,

but I would like the checksum-ing ability first.

Once the --literal-tags is done I'll know better what the --normalized-tags should look like

  1. Stream (not file) checksums -- { "stream": { "sha256sum": "ae68f......" } }
  2. jpegtags --without-metadata --with-sha256sum ./my-file.jpeg
  3. mp3tags --without-metadata --with-sha256sum ./my-file.mp3
  4. aactags --without-metadata --with-sha256sum ./my-file.m4a
  5. JPEG Media metadata --literal-tags
  6. exivtags ./my-file.jpeg
  7. xmptags ./my-file.jpeg
  8. iptctags ./my-file.jpeg
  9. Stream metadata
  10. jpegtags
  11. aactags
  12. mp3tags
  13. Media --literal-tags
  14. m4atags
  15. id3tags
  16. Media --verbose-tags
  17. m4atags
  18. id3tags
  19. exivtags
  20. xmptags
  21. iptctags
  22. eBook/pdf tags
  23. more information about what information is stored and can be extracted is needed

Before --normalized-tags I first want to see the outputs of the stream and meta-data --literal-tags

I've pushed --binary-tags to be a future consideration

General Clarifications

Meta-data organization

I want to make it clear that there are three types of meta data that I am particularly interested in.

Media (tag) metadata

  • The tags that universally describe a particular piece of artwork / media

    • Music (id3, m4a): artist, album, track number, rating

    • Images (exif, ipic, xmp): geo location, keywords, aspect ratio, date/time taken, visual similarity metrics

    • Documents (proprietary): author, title / subject, keywords, text body

Stream (data) metadata

  • The tags that describe a specific stream of media, but not the artwork / media itself

    • Music (aac, mp3): md5sum, stream type (aac, mp3), quality, bitrate

    • Images (jpeg): md5sum, stream type, quality, width, height, color depth

    • Documents (odxml, msxml, pdf): md5sum, stream type (xml, ms-binary), word count, page count

File (data + tag) metadata - not necessary to analyze at this time

  • Tags that describe a set of bytes on disk

    • --with-file-metadata

    • All Types: md5sum, access time, modified time, size, inode count

Examples

  • I have an mp3 and an m4a of the same song.

    • The media metadata will be almost exactly the same

      • the exception is that some of tag formats support more options than others
* The stream metadata will almost always be different

  * an exception may be that the bitrates are the same

* The file metadata will be almost always be different

Text

  • Many of the files to be processed will contain Chinese, Japenese and other international characters

    • UTF-8 should be used, not ASCII alone.

    • UTF-16 may also be used.

    • --pretty-print should output with pretty whitespacing -- somewhat like JSON.stringify(object, null, "\t")

Modularity

**

**The most important parts of the organization are this

  • The library should be modular, I prefer small bits of code that each do one thing well
  • It should be easy to build just one feature of the application or incorporate it in another application
  • mediatags /my-song.mp3 --with-stream-tags --with-md5sum gives the combined result of

    • id3tags /my-song.mp3

    • mp3tags --with-md5sum /my-song.mp3

  • m4atags /my-song.mp3 returns { "error": "no m4a tags found" }
  • aactags /my-song.mp3 returns { "error": "no aac stream found" }

  • In the future I would like to create a MediaTags plugin for Node.JS

A possible organization

  • mediatags - single binary that handles any type of file

    • libmediatags.o

      • libmediatagsid3.c

      • libmediatagsm4a.c

      • libmediatagsexiv.c

      • libmediatagsmp3.c

      • libmediatagsaac.c

      • libmediatagsjpeg.c

      • libmediatagspdf.c

      • libmediatagsdoc.c

      • libmediatagsodt.c

    • id3tags ---> mediatags (symlink)

    • m4atags --> mediatags (symlink)

    • etc

  • Each lib has a method such as getMediaTags(), getStreamTags()

**Future Considerations

**

Ideas to consider, but not to implement yet.

Binary Tags

  • perhaps Google Protobuf?

**Streaming Input

**

  • Accept data in chunks over a socket ?

By AJ ONeal

If you loved this and want more like it, sign up!


Did I make your day?
Buy me a coffeeBuy me a coffee  

(you can learn about the bigger picture I'm working towards on my patreon page )