Importing data

Sometimes we pivot. We stop what we are doing as we are doing it and re-focus or completely drop that thing for another, maybe we pick a thing up.

When we pivot, we often need to re-integrate new information. I'm a programmer & architect so I'll focus this on digital, but you may find it more broadly resonates with life.

There is an awesome book by Seth Godin called "The Dip" which I thoroughly recommend everyone reads. It's more broadly about the pre-requisite to this. The decision to go head-down full-steam ahead or stop, pivot. It's about determination and strategies to ensure that you are mindful of not pushing forwards without constant evaluation of where this fits in to a broader strategy.

Know what you want to change

One of the most frustrating things to me as an agent of change and as an agent of delivery are unclear requirements. Perhaps you know internally what you want to change, or perhaps you just know you are in pain and where that pain is coming from. In any case it's essential that you document and clearly lay out what you'd like to be different in terms of outcomes.

  • You are pre-revenue
    You need to start capturing or transforming data to support your story to investors
  • You have an academic application design
    This is proving costly & need to re-orient
  • You keep running into broad-phase data issues
    Perhaps a singular name field needing to extract first & last-name, perhaps no title

There are no free lunches and this is a painful process for some. As someone pivotal to delivery it can feel a bit like baptising a mountain lion getting stakeholders to sign-up to a plan forward. Particularly if it puts their vision on a roadmap.

Make a plan for how to get there

As I joined a call today for the nth time, I re-iterated some points that were causing pain in a migration. We have a plan, it's broad-phase, not in-the-weeds documented to the micrometer.

My own preferences on documentation

I lay somewhere between where we are and strong documentation. My preferences are unimportant until the lack of fulfilment of them causes pain, wastes time. It's not about me, it's about delivery. I'm aware that some people believe documentation is not agile and I'm quite open and honest that I think they are theives of your time and attention. I'm very pragmatic, documentation is not a set of shackles, you can and should iterate on it, it should not be ready for submission to an examining board unless there is a requirement as-such.

Our problem is simple. We have data which is coming from an existing internal system to a newer system to help launch a new product. It might not always live there; we're not building a walking skeleton, but it needs to come in, it needs to fit into a new schema because the questions we are asking are different and there were known design flaws in the old system.

Not getting your own way

I did not get to fix all the things I wanted to in this iteration. Sometimes you can argue until you're blue in the face and a business or group of stakeholders will pull for something you can see the problems in a mile away. I have a simple strategy for this. I complain loudly, attempt to point out the flaws within 15-30 minutes and then consign anything we cannot agree on to another iteration. Pontification is the enemy of delivery.

  • We have ERD diagrams of the high-level concept
  • We have code which broadly works enough for an MVP
  • we have test cases passing.
  • some of us have strong desires, we cannot reach agreement on everything. We can ship first, iterate later!

It's time for an import...

Defining a view of the world

Prior to import you'll need to ensure you have views of what you need. We were quite spartan with this, almost to a point of rigidity. All plans came from an initial ERD I'd done some months prior. Of course we manage our schema's and some of the engineers have tooling to generate an ERD. Theirs is of an application view, not a storage view, it's not source-controllable. It has a lot of unimportant system-specific nuance.

I've worked with third-party data and non-perfect data for a lot of my life. I have strategies to improve data and experience of real-world projects which worked and didn't work. One of the pivotal steps which separates the successful from the failed is how nuanced your world view of data is. In all but the most slap-dash of projects I like to avoid using an application at all when migrating it's data.

Some smells when you need to use an application

  • Presentation format is derived from non-obvious or complex rules that are not suited to ETL
  • You lack direct access to the place(s) where data is stored
  • There are known storage bugs which are overcome by the application

Benefits of using direct storage -> storage migration

  • There are less individual actors
  • There may be tooling you do not need to write or own
  • You'll reduce interstitial (often throwaway) system count
  • Verification is often simpler
  • There are often speed benefits

It's not always possible

One of the things you may find is that you need to defer or rule-out import of certain data. Businesses especially in the age of data tend to become needlessly attached to this data., but as a professional and person who wants to help them, you have to remove the sweets from those you love at times.

My track record with this is hit-and miss you cannot win every situation, only use what you know to try to give the best advice.

  • Try to tie advice to vision.
  • Try to work through real-world cases or analogies.
  • Don't shy away from the reductive if it helps.
  • Bring data if you can.
    This will effectively reduce 10 of 10,000,000 records is an easier pill to swallow than the reverse
  • Try to avoid unilateral decisions

Ultimately one thing businesses need to accept is that we are not magicians; we are engineers, so we should try our best to use a fixed-pool of resources to achieve a goal within a context. If that context changes, people need to know.

Natural keys

At best the word natural can be distracting, can provoke strong reactions for and against. I'm using it because it's a term from database systems, which I hope you'll know.

A natural key (also known as business key[1]) is a type of unique key in a database formed of attributes that exist and are used in the external world outside the database Wikipedia

So our first imports did not go well. Some data which wasn't very complex of-course moved. We had 4 maybe 5 programmers at one time or another working from our initial plans to build an ETL, high code-coverage, the trappings of success. One of the problems was that our test-data was really dependent on foreign-keys.

Because of my history I'd attempted to make use of natural keys where possible, only using foreign keys as a signalling duplicate protection mechanism where natural keys were absent and to provide an extra layer of defence. This made my import scripts look different, but also made them work, and enabled them to produce meaningful errors when they did not. I decided to change others importer scripts to look and behave like mine. This helped, I was glad when the CTO noticed it, sad when syntactic sugar was added which I felt made my own work harder for me to read at-a-glance.

Perhaps a surprising thing I'll say on ETL tasks is this. Don't write them inside of your application if you can help it! Ours was built into a monolithic application because we had a short time-to-deliver and a lot of passionate people, keen to use Ruby as much as possible. If this were a contract job, I'd walk away. There should be no place for fuzzy feelings for a specific technology.

We also had a requirement to make the tools available to the wider business, which is another reason I tried to use natural keys. They just resonate a lot better with people in my experience. "Oh lewis {email} is not importing" gives a quick transferible thing I, or others can scan artefacts for in 3, 6, 12, 18 months once we've all had a while with more interesting problems.

Imports changed meanings entirely to fit requirements better. We decided that each of the 5 main areas for import would succeed or fail at a record level. This gives the property of a document. A resource and it's sub-resources must be created or not. Transactions are a powerful feature of databases which I've found can help to gain success in these matters

  • Open a transaction
  • Try to create all interconnected objects
  • Ensure all relationships between interconnected objects exist
  • If there is success, commit the transaction, log success
  • In all other cases roll-back the transaction & report errors/li>

This helped us to make import documents more easily humanly readable, it meant we could perform set-based workloads for documents with multiple symmetric entries. I won't go into heavy detail, but read up on set-based operations, batching and computation thrashing

TLDR is that you can generally increase ideal case, without impacting worst-case by doing lots of a specific thing at once instead of doing many different things in cycles. Of course measure what is right for you, be pragmatic.

Lying with good cause

I've presented this to this point as-if there were one prior system in this analogous to an existing current project. There are actually multiple systems, following a faux-microservices, distributed monolith. There were decisions made to limit the complexity of this write-up and also import by materializing cross-system representations into a single source of truth.

We used a tool called Chartio which is more about dashboards, analytics and reporting than ETL. All I can offer here is to let your team use what they are comfortable with, but attempt to steer towards tooling made for the job. Chartio has allowed us to annotate data with corrections, impose further filtering and logic and had a read-only connection to some of the data needed. It's enabled interoperability amongst a distributed team where tooling is not standard, but uses interoperable export and import formats (CSV). I offer this not as an excuse, but to add further context to aid understanding.

This also has led to pinning some data in an interstitial format. Remote URI accessible resources are persisted using URI's for things like images. These add time to imports and often can be a source of distraction through network failures, yet more third-party services. The core thing is not to delete any third-party resources and in-fact take regular snapshots of all resources so you don't lose information.

Troubleshooting

Our next set of problems which are as-yet unresolved and in-progress are about bounding data. The business has a clear requirement to bring in all of a certain domain, bounded to a specific few entities. Those entities rely on other entities as is common when coming from an RDBMS system, and this means we need to select the greatest outer-scope first

If you had soccer teams that had played eachother and were in division 1, you'd first need to select soccer teams, not division 1 teams, if teams could play other teams outside of their division, and you wanted to honestly reflect their match history.

Just because a team is not part of division 1, if data that relies on it points to it, you have some choices

  • Don't import the data (can you do it later?)
  • Accept that you'll need to import more than just what you want, but what is connected to it
  • Don't import anything (rare, but giving up is a valid option)

Verification

It may seem strange given that we are not done that we are already in this stage. Essentially any time you iterate you need to verify the outcomes. In my current role we use Slack to communicate, Google Drive for artefacts as well as Chartio dashboards to track and produce data artefacts. I should probably link the Google drive import artefacts, chartio source, and import report logs to JIRA which is the project management tool. In-fact I shall do this tomorrow. Given a non-startup I may do this or make a dedicated resource to do this for me. Some things to avoid

  • Clutter - How can I find what I want?
  • Ambiguity - What is needed to reproduce results?

For other projects I've worked on verification has ranged from check every single record imported to check a few known entries. Some advice I will offer is start by telling the business you need to check everything, but be open to negotiate downwards, getting in writing if you cannot verify the entire data-set that it's acceptable. That writing can be as simple as an IM message from a stakeholder, but people's lives and jobs and the success of product will rest on this keystone.

Wrapping up

Having reduced some import timings, got many more records than the initial import you'll be able to deliver iterative value, showing progress until it's time to launch, deliver and move-on. I hope you've learned some things, this was by no means exhaustive, but covered some points I've had on my mind. I Hope to look back at this and find either I know better methods, or it keeps the methods I do know fresh.