This was converted from e-mail / html to markdown with html2text.py

Speed, Completeness, and Identification are the three most important qualities of this application.

Speed

The end product is "MediaBox", a small 600MHz OMAP3530

which will use MediaTags on 30,000+ files.

Completeness

It should be possible to represent every meta-data tag present in the file.

It should be able to read tags in an unabstracted manner

pure binary tags, such as Album Art, may also be extracted as JSON
- --extract-binary-tags=bas64 -- included in JSON

Ex: "@art": "Az48tks9cC...."

* --extract-binary-tags-to=path/to/attachments-dir -- placed into the

folder and referenced in JSON

Ex: "@art": "./path/to/attachments-dir/my song.m4a.@art.jpeg"

* Binary tag extraction will usually be a post-processing feature and

should be off by default

Both stream and media metadata shoud be present, but separate.

Identification - Checksums of the "stream" part of a file

Each tag application should be able to produce a checksum of the stream (data) portion of a file
- --with-stream-md5sum
- --with-stream-sha2sum
The checksum should not be of the file as a whole
The checksum should not include the tags
The checksum is probably from the byte offset of the last header tag to the end of the file or first header tag

Conclusions

TagLib - Let's use taglib if

taglib allows raw access to tags (I believe it does)
taglib can generate the same detail of information as AtomicParsley

Mutagen - Probably not a good fit

mutagen is significantly slower than taglib ?
mutagen does not allow access to all tags, just abstracted normalized ones ?

Libexiv2 - yes

Exiftool - probably not a good fit

Is it fast? if not, no
does it allow access to all tags? or does it abstract them?

Type Detection - GNU file is too slow!

my tests show that file ./my-song.m4a takes more time than AtomicParsely -t ./my-song.m4a
- detecting the file type should not take more time than parsing the file!

type detection should be very very simple
- if the file has an extension, use the extension to determine the type
  - if it doesn't match a known type, ignore it
if the file doesn't have an extension (very rare), try matching the first few bytes of the header
- it's okay to use file for rare cases - little time will be wasted in comparison.

fail with error if the file cannot be parsed as expected
- some media types can have multiple types of tags (id3, m4a, musepak, oggtag?, etc?)
  - try the most likely first (mp3 -> id3)
- some media types can have embedded tags
  - mp3 -> album art -> jpeg -> exif
  - only parse the intended type
  - don't parse exif data from an mp3

What does Unix Filter Class mean?

Priority

While I was waiting my friend created prototypes for outputting mp3 and m4a media metadata which I am using for now.

The most important thing that I need right now is to be able to checksum the data stream.

It's okay to rearrange some of the other things if it's better for your workers' workflow,

but I would like the checksum-ing ability first.

Once the --literal-tags is done I'll know better what the --normalized-tags should look like

Stream (not file) checksums -- { "stream": { "sha256sum": "ae68f......" } }
jpegtags --without-metadata --with-sha256sum ./my-file.jpeg
mp3tags --without-metadata --with-sha256sum ./my-file.mp3
aactags --without-metadata --with-sha256sum ./my-file.m4a
JPEG Media metadata --literal-tags
exivtags ./my-file.jpeg
xmptags ./my-file.jpeg
iptctags ./my-file.jpeg
Stream metadata
jpegtags
aactags
mp3tags
Media --literal-tags
m4atags
id3tags
Media --verbose-tags
m4atags
id3tags
exivtags
xmptags
iptctags
eBook/pdf tags
more information about what information is stored and can be extracted is needed

Before --normalized-tags I first want to see the outputs of the stream and meta-data --literal-tags

I've pushed --binary-tags to be a future consideration

General Clarifications

Meta-data organization

I want to make it clear that there are three types of meta data that I am particularly interested in.

Media (tag) metadata

The tags that universally describe a particular piece of artwork / media
- Music (id3, m4a): artist, album, track number, rating
- Images (exif, ipic, xmp): geo location, keywords, aspect ratio, date/time taken, visual similarity metrics
- Documents (proprietary): author, title / subject, keywords, text body

Stream (data) metadata

The tags that describe a specific stream of media, but not the artwork / media itself
- Music (aac, mp3): md5sum, stream type (aac, mp3), quality, bitrate
- Images (jpeg): md5sum, stream type, quality, width, height, color depth
- Documents (odxml, msxml, pdf): md5sum, stream type (xml, ms-binary), word count, page count

File (data + tag) metadata - not necessary to analyze at this time

Tags that describe a set of bytes on disk
- --with-file-metadata
- All Types: md5sum, access time, modified time, size, inode count

Examples

I have an mp3 and an m4a of the same song.
- The media metadata will be almost exactly the same
  - the exception is that some of tag formats support more options than others

* The stream metadata will almost always be different

  * an exception may be that the bitrates are the same

* The file metadata will be almost always be different

Text

Many of the files to be processed will contain Chinese, Japenese and other international characters
- UTF-8 should be used, not ASCII alone.
- UTF-16 may also be used.
- --pretty-print should output with pretty whitespacing -- somewhat like JSON.stringify(object, null, "\t")

Modularity

**The most important parts of the organization are this

The library should be modular, I prefer small bits of code that each do one thing well

It should be easy to build just one feature of the application or incorporate it in another application

mediatags /my-song.mp3 --with-stream-tags --with-md5sum gives the combined result of
- id3tags /my-song.mp3
- mp3tags --with-md5sum /my-song.mp3

m4atags /my-song.mp3 returns { "error": "no m4a tags found" }

aactags /my-song.mp3 returns { "error": "no aac stream found" }
In the future I would like to create a MediaTags plugin for Node.JS

A possible organization

mediatags - single binary that handles any type of file
- libmediatags.o
  - libmediatagsid3.c
  - libmediatagsm4a.c
  - libmediatagsexiv.c
  - libmediatagsmp3.c
  - libmediatagsaac.c
  - libmediatagsjpeg.c
  - libmediatagspdf.c
  - libmediatagsdoc.c
  - libmediatagsodt.c
- id3tags ---> mediatags (symlink)
- m4atags --> mediatags (symlink)
- etc
Each lib has a method such as getMediaTags(), getStreamTags()

**Future Considerations

Ideas to consider, but not to implement yet.

Binary Tags

perhaps Google Protobuf?

**Streaming Input

Accept data in chunks over a socket ?

By AJ ONeal

Did I make your day?

Buy me a coffee

(you can learn about the bigger picture I'm working towards on my patreon page )

MediaTags specifications clarification

Conclusions

Priority

General Clarifications