Iterating across all Excel Sheets dynamically and Converting into CSV files via Azure Data factory / Synapse Pipeline

Related: 1|2

Problem Statement :

As of 30th Aug,2022 ; Azure data factory / Synapse pipeline do not have an out of box functionality to get the list of all sheets within an excel file. In case if there is a need to load/convert all sheets within an excel into some sink, one has to maintain a list of all Sheets manually and iterate it over a for activity but in case if the sheet list are not fixed, one has to write a custom logic in some other Azure Offering and leverage it to iterate within ADF/Synapse pipeline.

Is there a way to avoid the custom logic creation and leverage out of box ADF/ Synapse pipeline functionality to dynamically iterate overall all excel sheets and map it to corresponding sink.

Prerequisites :

  1. Azure Data Factory / Synapse Pipeline

Solution :

  1. Within Excel dataset in ADF/Synapse pipeline, there is a property of Worksheet mode with 2 options : a) Sheet Name b) Sheet Index

As the requirement is to dynamically iterate over all the excel sheets, we would leverage the property of Index as the Sheet names are unknown

2. In my previous blog of Avoidance of Infinite Iterations of Until activity within Azure Data Factory / Synapse Pipelines, we can exit the Until activity in case of any failure. We would leverage this aspect to iterate over all excel sheets based on Sheet index till the point of failure via Until activity and exit it in case of failure ( i.e. whenever we exceed the sheet index range )

3. For our current use case scenario, we have an excel with 2 Sheets and uploaded in Azure blob storage which acts as a source within ADF / Synapse Pipeline.

Create a Source dataset as Azure Blob Storage with Excel file format and create a new dataset parameter called SheetIndex which we would leverage to iterate dynamically within Until activity.

4. The SheetIndex begins with 0 , so we need to create a variable as Counter with default value as 0.

The TempCounter variable ( with default as 0 ) is used for incrementing the Counter variable in every iteration based on the below logic where in:

  1. Assign TempCounter as Counter + 1
  2. Override Counter Variable with TempCounter Variable

5. For our Current Use case, we would be leveraging Azure Blob Storage as Sink with Delimited Text format and for use to append the SheetIndex in every CSV file created, we need to create a dataset parameter Called SheetIndex in the Sink dataset as well.

Dynamic file name creation in sink :

@concat('Test',dataset().SheetIndex,'.csv')

6. Now, the normal flow would be to create a Copy activity to convert the Excel Sheet into CSV

Source Settings:

Sink Settings :

7. Post Copy activity, increment the Counter variable by 1 and follow the same process under an iteration via Until activity till the point of failure

8. The Until activity should iterate till the point of failure i.e the Counter variable value exceeds the Sheet index count (Sheet Index out of Range), post which the Until activity would be exited.

@not(equals(variables('ErrorHandling'),''))

9. Output Logs of Pipeline execution :

2 CSV files got generated each with SheetIndex number appended (0,1 mapping to the 2 different sheets within Excel ABC & DEF)

There are 2 instances of Copy activity success mapping to the conversion of 2 Excel sheets into CSV and failure of the 3rd Copy activity iteration due to index range out of bound.

Note : The Pipeline iteration would be success due to the Until Error handling scenario even if one of the activity has failed.

The JSON for the above Excel Sheet iteration framework is available at this GitHub location.

Published by Nandan Hegde

Microsoft Data MVP |Microsoft Data platform Architect | Blogger | MSFT Community Champion I am a MSFT Data Platform MVP and Business Intelligence and Data Warehouse professional working within the Microsoft data platform eco-system which includes Azure Synapse Analytics ,Azure Data Factory ,Azure SQL Database and Power BI. To help people keep up with this ever-changing landscape, I frequently posts on LinkedIn, Twitter and to his blog on https://datasharkx.wordpress.com. LinkedIn Profile : www.linkedin.com/in/nandan-hegde-4a195a66 GitHUB Profile : https://github.com/NandanHegde15 Twitter Profile : @nandan_hegde15 MSFT MVP Profile : https://mvp.microsoft.com/en-US/MVP/profile/8977819f-95fb-ed11-8f6d-000d3a560942

Leave a comment

Design a site like this with WordPress.com
Get started