Hear ye, hear ye, a tale of woe! This is the long, complex, and messy story of how I became a Microsoft Azure Data Factory and Data Flow expert, discovered a handful of crucial shortcomings, fought with Microsoft’s Support team, and lived through a cautionary tale on the strengths and weaknesses of this popular 4GL.¹

Project Goals¶

The concept was simple - take a set of files from one environment, transform them per a given specification, and deposit them in a second environment, securely and with great care. Millions of dollars of fines hung in the balance. A simple little Data Factory implementation was recommended by the client’s internal IT department. The utmost care would be taken to ensure the data was secure in transit, and whole solution could be backed by a serverless database and typical monitoring systems. We could work with networking teams to ensure data persisted down to a desired network layer.² How hard could it possibly be to complete this glorified copy-paste?


  +---------------+     SQL   Azure     +--------------+
==|               |      DB   Alerts    |              |===>  To
  |  Storage 1    |       |   |         |  Storage 2   |      Lower
  |               |     +---------+     |              |      Environments
  |   Flat Data   |     |         |     |   Well       |
  |   & XML Meta  |     |  ADF 1  |     |   Organized  |
  |    '----------+=====+--.      | PL2 |   Data       |
  |               | PL1 |  `-DF---+=====+--->          |
  +---------------+     +---------+     +--------------+

    >>>-=======================================================->
                               flow of data

    ADF = Azure Data Factory
    DF  = Data Flow
    PL  = Network Link
    DB  = Database

Initial Solution & Emergent Problems¶

The Copy Activity within the ADF³ is a potent method to rapidly copy files between storages within Azure. It works well unless you have a particular use case like ours. Here is the path that each single file had to take regardless of the networking and systems accomplishing the reading, recording, and copying:

   +--------------+             |                +---------------+
   |  Storage A   |             |                |  Storage B    |
   |              |             |                |               |
   |  Folder*     |             |                |  A/           |
   |   Docs/-     |             |                |   B/          |
   |    asdf.pdf ----------+-------------,       |    C/         |
   |   Meta/      |        |    |        |       |     D/        |
   |    info.xml ----------'    |        '-------+--->  asdf-v2.pdf
   |              |             |                |               |
   |  * 100+      |             |                |               |
   |    of these  |             |                |               |
   +--------------+             |                +---------------+

By combining xml data and data from other incoming sources, an end path was computed by a data flow⁴ and each file moved accordingly. Note that both the filename and path change on a per-file basis. This was where our troubles began.

To accomplish this within an ADF, a per-file copy activity must be used in a for loop to provide separate storage bindings and parameters for the source and destination file. We used a set of parent/children pipelines to break up these copy operations according to the limits of the data factory:

You can only look-up 5000 records from a SQL database at a time
You can only handle 100,000 items in a ForEach loop
You cannot nest ForEach loops
HTTP requests have an unchangeable 230 second timeout

Given these limitations, I produced this design, riffing off the metadata driven copy pattern built-in to the Data Factory:

  ,------------------------------------------------------------,
1 |  Handle a Folder of Folders to Process & Copy              |
  '--+---------------------------------------------------------'
     |                                                    ^
     |   ,-------------------------------------------,    |
2    +-> |  Handle a Folder of Files                 |----+
         '-+-----------------------------------------'
           |                                      ^
           |   ,-------------------------------,  |
3          +-> |  Copy a Batch of Files        |--+
               '-------------------------------'
                (Run a Copy Activity per file)

On one particular test run, we burned roughly $4000 copying a couple hundred thousand files. This is an absurdly high number. Our task became clear:

Investigate and determine where the money was being burned
Propose a few solutions for resolving the problem
Apply a fix to complete the copying in a cost-effective manner

Microsoft Support Gaslighting¶

At first I assumed it was me doing something wrong - of course, in the maze of levers and buttons, I had forgotten to switch something, and flipping that unflipped switch would magically prevent the burning of money. The client and I arranged to discuss and troubleshoot the issues with Microsoft Support - which I assumed would be fruitful and useful, given they designed the thing. I brought to the table a thoughtful and minimal description of the problem at hand in an attempt to be respectful of everybody’s time.

Was this support request useful?

No. It was absolutely not.

Not only was I was assured by the so-called expert engineer Microsoft brought in that the Data Factory could not possibly be any cheaper, but also that an alternative written as an Azure Function would be roughly the same cost.

At first, I believed them. Why shouldn’t I? Here I was, face to face with a first-party vendor, explaining to me that I was using the system correctly and also that it would not be cheaper to reimplement it with an alternative copying method. I think I went through all five stages of grief handling this nonsensical response.

Two of the Microsoft resources on the call did end up being helpful in highlighting the true problem. I do greatly appreciate this - but apart from affirming my replacement design choices, very little help was granted to design around this clear and major flaw in the Data Factory system.

I have been heavily disincentivized from ever contacting Microsoft Support again for technical or system design issues of any kind - but I suppose it was wise to in this scenario in spite of the results.

The Root Cause: Storage Re-Binding¶

After a thorough technical investigation, the root cause was identified. Due to the nature of the copy operation, and the fact that I was pointing to a single file within both filesystems with every copy operation, the Integrated Runtime⁵ had to both rebind and establish a new private link with each copy.

When this constant rebinding is paired with the minimum Copy Activity usages of 4 DIUs⁶ and one minute of billed time, this means each file copy is racking up four minutes of DIU-time despite only burning a fraction of that. For a couple hundred thousand files, this can really add up, even at only $0.359 CAD per DIU-Hour:

(* (* 4            ; 4 DIU-Minutes
    (/ 0.359 60))  ; 36c per DIU-Hour / 60
    300000)        ; 300,000 files

;;=>  $7179.99  (billed per client agreement at a fractional rate)

Azure Functions to the Rescue¶

I was now in the hot-seat - we needed a solid and reliable solution to replace this broken mess, and fast! After considering a few different solutions, the program architect, IT, and I agreed on implementing a few Azure Functions to handle the workload. These worked in a pinch, and could be easily triggered and managed from the Data Factory. The simple-as-possible internal architecture of the pipelines and database made this a fairly easy swap. Here’s what the system looked like after:


  +---------------+     SQL   Azure     +--------------+
==|               |      DB   Alerts    |              |===>  To
  |  Storage 1    |       |   |         |  Storage 2   |      Lower
  |               |     +---------+     |              |      Environments
  |   Flat Data   |     |         |     |   Well       |
  |   & XML Meta -----> |  ADF 1  |     |   Organized  |
  |               |     |         |     |   Files      |
  |  Instructions <-----|         |     |              |
  |               |     |     #   |     |       ^      |
  +---------------+     +-----|---+     +-------|------+
          |                   |                 |
          |                   v                 |
          |    +---------------------------+    |
          |    |     Azure Functions       |    |
          +----|     Invoked by ADF        |----+
               +---------------------------+
               Read Instructions & Move Files

   >>>-====================================================->
                       flow of data

Despite Microsoft Support’s strange and erroneous instructions, this solution ended up being cheaper, more reliable, and easier to scale than the options available within the Data Factory.

Conclusions¶

Azure Data Factory is a platform with some key strengths and weaknesses that must be well understood before beginning a project with them, and ought to be more clearly documented. Now that I grok the system, I would use it in the future, but only for tasks it is fit to perform. I would be lying if I didn’t say the rapid and easy setup for simple tasks has saved my butt on a few occasions.

Strengths:

Really great for simple stuff once authorized against cloud resources
Very fast to set up simple scheduled or triggered movements and deletions
Easy to use the Managed Identity to provide it with access to things
Copy activity is powerful and reliable - but only when a simple source-to-destination bulk dump is required
Data flows are great in general and provide a simple and pragmatic way of handling large mapping operations with multiple source and destination datasets

Weaknesses:

The optimal use cases are not made clear in the documentation
An overwhelming number of truly strange and obscure limitations
Cost can become very high if you have a unique use case
Microsoft support does not seem familiar with the product
Data flow UI must be tortured to perform more complex transformations
No easier to understand than a handful of Azure Functions

4GL: Fourth Generation Programming Language, as coined by James Martin in 1981. ↩︎
The Purdue Model zscaler.com ↩︎
Abbreviation for Azure Data Factory azure.microsoft.com ↩︎
ADF Mapping Data Flow provides the ability to run common data science tasks within an automatically provisioned spark cluster, with a visual editor to simplify development. microsoft.com ↩︎
ADF Integration Runtime enables a serverless data factory to access secure client resources. microsoft.com ↩︎
DIU is the billing unit Microsoft uses for many serverless products, standing for Data Integration Unit. microsoft.com ↩︎

A Data Factory Nightmare