Problem Statement :
As of 30th Aug,2022 ; Azure data factory / Synapse pipeline do not have an out of box functionality to get the list of all sheets within an excel file. In case if there is a need to load/convert all sheets within an excel into some sink, one has to maintain a list of all Sheets manually and iterate it over a for activity but in case if the sheet list are not fixed, one has to write a custom logic in some other Azure Offering and leverage it to iterate within ADF/Synapse pipeline.
Is there a way to avoid the custom logic creation and leverage out of box ADF/ Synapse pipeline functionality to dynamically iterate overall all excel sheets and map it to corresponding sink.
Prerequisites :
- Azure Data Factory / Synapse Pipeline
Solution :
- Within Excel dataset in ADF/Synapse pipeline, there is a property of Worksheet mode with 2 options : a) Sheet Name b) Sheet Index
As the requirement is to dynamically iterate over all the excel sheets, we would leverage the property of Index as the Sheet names are unknown
2. In my previous blog of Avoidance of Infinite Iterations of Until activity within Azure Data Factory / Synapse Pipelines, we can exit the Until activity in case of any failure. We would leverage this aspect to iterate over all excel sheets based on Sheet index till the point of failure via Until activity and exit it in case of failure ( i.e. whenever we exceed the sheet index range )
3. For our current use case scenario, we have an excel with 2 Sheets and uploaded in Azure blob storage which acts as a source within ADF / Synapse Pipeline.
Create a Source dataset as Azure Blob Storage with Excel file format and create a new dataset parameter called SheetIndex which we would leverage to iterate dynamically within Until activity.
4. The SheetIndex begins with 0 , so we need to create a variable as Counter with default value as 0.
The TempCounter variable ( with default as 0 ) is used for incrementing the Counter variable in every iteration based on the below logic where in:
- Assign TempCounter as Counter + 1
- Override Counter Variable with TempCounter Variable
5. For our Current Use case, we would be leveraging Azure Blob Storage as Sink with Delimited Text format and for use to append the SheetIndex in every CSV file created, we need to create a dataset parameter Called SheetIndex in the Sink dataset as well.
Dynamic file name creation in sink :
@concat('Test',dataset().SheetIndex,'.csv')
6. Now, the normal flow would be to create a Copy activity to convert the Excel Sheet into CSV
Source Settings:
Sink Settings :
7. Post Copy activity, increment the Counter variable by 1 and follow the same process under an iteration via Until activity till the point of failure
8. The Until activity should iterate till the point of failure i.e the Counter variable value exceeds the Sheet index count (Sheet Index out of Range), post which the Until activity would be exited.
@not(equals(variables('ErrorHandling'),''))
9. Output Logs of Pipeline execution :
2 CSV files got generated each with SheetIndex number appended (0,1 mapping to the 2 different sheets within Excel ABC & DEF)
There are 2 instances of Copy activity success mapping to the conversion of 2 Excel sheets into CSV and failure of the 3rd Copy activity iteration due to index range out of bound.
Note : The Pipeline iteration would be success due to the Until Error handling scenario even if one of the activity has failed.
The JSON for the above Excel Sheet iteration framework is available at this GitHub location.