What you need to know about metadata

2019-12-27

In this post I'll hope to explain what metadata is, cover some use-cases and crucial facts.

Metadata is data, about data

If you feel like we're done after the quote; nice for you, but there are more considerations than the subject the data pertains to.

Properties of metadata

Relevant only with given context
Enhancement to core-value data
Possibly missing or indeterminate data

Where does it live?

Often unspoken or implicitly, you don't care where it lives, and it can live in any one of a number of places. This is partially because metadata is a huge area rather than one thing or another. This isn't to say it doesn't matter where it lives, just that there is no consistent best-practice.

You might have a TeamConfig concept with a rather large database table of 100+ software options.
You might have a Graph database storing the various connections between people.
You may have a table called settings or options with individual values and data about where and to whom they belong.
You might store JSON documents which represent a thing at a time.
You might simply have a table which stores information about the creation, selection, deletion & modification of other data.

Where these live should really depend on what characteristics you need. I have a strong preference for primary-key accessible and natural-key accessible metadata offline. That's more about how we interact with metadata.

Where it lives should really be about meeting business needs.

When to build?

In 2013 I made my own "metadata" server for my business core platform which is half-marketing, half practical-engineering.

I wanted to take load off of a database for simple lookups that were not searchable.
Keeping the data in an active-server system made little sense.
The data being served could be treated like a file
It was in-fact desirable that it not be modified within the system.
This simplified storage & retrieval, and took advantage of things like OS filesystem disk-caching.
Horizontal scaling via servers sub-path routed was desirable for low-cost.
Vertical scaling via RAM + disk-space if partitioning was ever difficult.
Ability to store saved data to be retrieved later.

I'm not sure anyone should really design their own metadata servers. 7 years on I'm not entirely sure mine was that much of a success, but it did a job, helped in a few projects, and didn't get in anyone's way. More positively it didn't invent much, but re-use patterns from elsewhere.

My message here is know why to store metadata, what the benefits are and what you are taking on.

How do you address it?

When talking about where data lives I mentioned my preferred access to metadata, via the primary or natural key your main, or conventional datastore uses. The solutions I've dealt with have been a mix of internal and third-party, and a mix of SaaS and internal tooling. Addressing is by far one of the least thought-out aspects of metadata access, but I'm glad solutions like ElasticSearch and Neo4J in 2019 have capable HTTP interfaces following patterns I recognise.

Primary and natural keys explained

Primary keys

In order to stop duplication and operations applied by mistake to records, databases use a unique piece of information per-record called primary keys. These are the main ways to interact with a piece of data and be sure that your operations apply to that data. If I try to add a record about person 5472 and they exist, it gives us the benefit of complaining.

Users can take full-control of these, but populist techniques use auto-generated machine identifiers such as auto-incrementing ID's, or UUID's. There are other options, but if you don't have a way to specify using a piece of information about something, you'll generally nominate a synthetic primary key, which is to say it's meaning exists only within your program.

Natural keys

Natural keys are pieces of information that apply to records and are part of the record usable, supplied data. UUID's and primary keys are not natural keys, but synthetic. You could say that for some systems, first name, or first and last name is a suitable natural key. When I was in school as a child our names were often used as natural keys. As you get older you might get a government ID, and if it's unique to you, and unlikely to change, it might be a good natural key candidate.

Because natural keys are external to our systems, we must take care in nominating them, but can gain significant advantage in using them. A really great example of the pitfalls and peaks are email addresses.

Talking about how you address data can be very important to understanding the performance characteristics of using that data. For example it might not be a good idea to build an upload system which uses a database to store information about a distributed object store.

Lots of software makes these mistakes; very prominent examples are in most popular web-frameworks including ActiveStorage of Ruby on Rails; but why do I call these mistakes?

Inter-system dependencies

If I predicate my system, on the behaviour of your system. I Either have to exert control over coordinating both systems and their errors, or to accept that changes to your system, including availability may break my system.

Worse still, I now have a useless system sitting in the middle, with the sole purpose of turning metadata into data, which is an inversion of intent.

If I have to talk to my database through my web-app, to get to files I stored on your service, then it's a very specific, but indirect bus-route. If my database is deleted, then the actual data, which we treat as metadata is rendered useless.

If instead I can talk directly to the file-service, I might be able to take advantage of properties of each specific service.

I might choose to contain within my interactions to that specific service, details to avoid unnecesarry metadata, interstitial systems and the accidental reliance on metadata to get to my data. For file uploading, direct uploads from an application frontend, provide such benefits and others.

Shorter chain of dependency
Can remove provider and storage backend of security access
Faster lookups based upon the known

How long does it live for?

I was extremely excited to hear this year that Sky, the UK digital services business, purges hundreds of gigabytes per-day. Surely in the age of data, they are quite mad, and I'm cheering on poor practice?

In-fact my excitement was not at the deletion, but the removal from active-data, as well as several other best-practices conveyed in a talk by Raja M Naveed.

By removing data from active-data, we shrink and focus our efforts. It also highlights that this is metadata, not core-value data. In much the same way you don't drive to every shop when you want a new sofa, and you would not want to search a database of all chairs ever made. Older chairs could be moved to a metadata store on chairs you used to own to provide provinence.

Active Data

Active data is data you need to reduce when querying. The strict definition may differ by your choice of technology. SQL has schemas, logical databases and tables, mongodb has collections.

Amongst the things I like to tout on my CV is sub-second searches on systems with hundreds of gigabytes of data. One of the easiest ways to do this is to not search the entire data-set. Whilst I've not yet worked for a billion dollar company, I'm fairly certain that physics still applies to them, and both moving and matching several billions of electrons is not the fastest way to get things done.

So the more data you can put outside of active data, can speed up access times. It doesn't need to couple the two systems either, but you do need to know either a primary or natural key before acessing data in other systems.

What does it look like?

This is perhaps the most difficult question of all to answer. Much like core-value data, metadata takes many shapes. The most common are often mixed amongst our data.

This is unfortunate because it needlessly bloats the most valuable data, which leads to increases in backup time, scanability, and general noise.

As time increases, usually data-volume does too, sometimes by orders of magnitude; which make separation harder, if not unthinkable. While I applaud those who do not verify every single record after a migration; it's not something I do when I own the business or manage the account.

Simple exercises to discern if something should be metadata or core-data include

What the outcome is if it's missing or incorrect.
Can it be derrived from core-data.
How will it be accessed (in relation to something else or standalone).

What shape should it be?

Around 2010 I started to notice inconsistencies in digital systems due to changes in data shape.

I started to take a rather radical approach for core-data, to not change existing data-stores, but rather produce new structures and objects to deal with new data. A Person with a data of birth, multiple names and an age would be a different type to a person with a government id, a gender, or an income.

When dealing with metadata, I did not want to follow this pattern, as I might have some objects with some data, and some with other data. I was storing notes and appointments, addressable to a case, attributes, alliances and inventory addressable to participants, who's shape would not change, or who's metadata must remain associated.

I came up with a strategy to help, which was using adaptors and version numbers to describe the shapes of each type of data.

If I needed to save the IP address along with a user on a note, then I'd create a new object adaptor to present a single object with optional IP address, or a known canned value, even for older records. This could cope with unit and precision changes, as well as format changes, in much the same way as a migration, but on a per-record basis.

It's been one of my immense frustrations with NoSQL and free-form SQL that they do not view versioning as important as primary keys, as without them, a prior date, becoming a date-time could lead to undesirable results.

Wrapping up

If you're new to metadata, curious or have inherrited a system, I hope this helps. I'm sure I'll be infuriated at having missed something, but there is only so much.

By Lewis Cowles