
You decided to finally move your on prem Data Warehouse to the Cloud Data Provider of your choice. How should you best proceed?
Contrary to the first intuition, migrating data and structures (such as tables) is easier than you might think. For example, Google's Database Migration Service can help you a great deal if GCP is your data platform of choice. In any case, you do it once.
Hard part is migrating regular load of your DWH data, day after day, month after month. For that, you will need to convert your existing ETL to the ELT architecture of your choice.
If you have hundreds or thousands ETL jobs, programing them manually in the new technology is absolutely colossal task. It might take years and actually kill your project – since your new technology stack might get outdated even before you are finished. I've seen it happen.
So, can you automate it? Somehow convert old technology integration logic to the same logic in the new technology?
Perhaps your first thought nowadays is using AI. Think again. Do you really want to debug thousands of bugs AI hallucinates, while not really understanding the code it generated? You will need a rule based, deterministic algorithm.
But wait, some of our jobs need refactoring or optimization!
Even so, we would still recommend migrate the jobs as they are now as the first step. Why? The reason is simple – testing. If the jobs are supposed to work in the same way, you can start them on identical environments with identical parameters, and see if the results are identical. Every row, every column. You can even automate such a test (contact me to find out how).
Only when the new technology performs identically as the old one, we would suggest optimization. That is, if that is even needed in the cloud. As the last step, or perhaps a separate project, we would recommend functional / business refactoring. Sure, some of the old code does the old logic – but if you try to do refactoring and migration at once, how will you know what is the cause of a data error? Is it wrongly implemented new business logic, or migration bugs? Doing it sequentially simplifies stuff a great deal.
Let us get back to our deterministic migration algorithm. You can implement it 2 ways. You can parse the integration job in old technology and immediately generate the equivalent in the new technology stuck. But we would recommend using middle layer – just containing the specs of the source job. In such way, we can theoretically use the specs to generate target jobs in multiple technologies, if such need ever arises.
A word about middle layer – it would be great if it was human readable, or if such format could be automatically generated. In such way, the middle layer is more than just specification of integration job. It can serve as documentation as well.
Now when we have architecture of migration logic, how do we implement it? Usually, we will need SQL parser flavored for the source technology. On top of that, we will need to parse non-SQL specifics of the old ETL technology.
If you are an engineer, it is clear to you that developing a formal parser of a programming language is damn hard. But luckily, there are packaged solutions already doing that.
Our very own EPIC Ingenioso supports SQL parsing in a number of flavors. It also supports Oracle Data Integrator on the source side, IBM DataStage as well as stored procedures. On the target side we can do almost everything – dbt, Spark, Matillion, Cloud Data Platform's stored procedures.
Tags
You might also be interested in
View all articles
BLOGMigrate ODI / DataStage to dbt
BLOG

