Get Latest Folder / File Detail from Azure Data Lake Storage via Synapse / Azure Data Factory Pipeline

Problem Statement :

Is it possible to Get Latest Folder / File Detail from Azure Data Lake Storage via Synapse / Azure Data Factory Pipeline without the use of dataflow usage as normal expression builder doesn’t support SORT & MAX functions.

Prerequisites :

  1. Azure Data Lake Storage
  2. Synapse / ADF

Solution :

ADLS Hierarchy : Which contains multiple Folders and files within datasharkx15 Root Folder.

In case of Folder aspect, Final 2 is the folder with the latest modified date and Test2.txt is the file with the latest modified date.

  1. Create 3 variables SampleDate, LatestDate & LatestObjectType and assign a minimum date value ‘1900-01-01’ to the SampleDate which we would use to compare with the Last Modified date of the ADLS object types and store the Greater Date among those in LatestDate variable.

Create a Parameter ObjectType which would contain either Folder or File as the value based on which object type latest name we want to identify.

2. Overall flow :

Using Get Metadata activity, we would get the list of Object names under the Root Folder.

Binary1 Dataset for Root Folder :

3. Using Filter activity, filter the list of Object names which is either Folder or File depending on the Parameter ‘ObjectType’ under the Root Folder.

Items: @activity('GetListOfChildItems').output.childItems

Condition: @equals(item().type,pipeline().parameters.ObjectType)

4. Pass the output of Filter activity to ForEach activity to get Meta Data of individual Objects

@activity('FilterObjectType').output.value

5. Inside ForEach activity, get the Last Modified date of the each iteration Object (Folder / File)

Dataset:

where @dataset().path is a dataset parameter :

whose value would be the ForEach iteration item().name specifying either the Folder name or File Name in every iteration.

6. Using If Condition activity check whether the current Object’s Last Modified of ForEach is greater than the SampleDate variable value.

@greater(ticks(activity('Get Metadata2').output.lastModified),ticks(formatDateTime(variables('SampleDate'))))

7. If the condition is True ( Meaning the current iterations Modified date is greater than the current value of Sample Date ) , then set the LatestDate variable to the objects Last Modified date and Pass the LatestDate variable to the SampleDate variable.

Also update the LatestObjectType variable to the Object Name in that iteration.

LatestDate:

@activity('Get Metadata2').output.lastModified

SampleDate:

LatestObjectType:

@activity('Get Metadata2').output.itemName

8. After looping all the folders/files via ForEach activity, the value of the variable : LatestObjectType would be the name of the folder / file with the latest timestamp.

Outputs:

with ObjectType value=Folder

First Get Meta Data output:

Filter Activity Output :

Final Output :

The Export Template for the above framework is available at this GitHub location.

Note :

In case if the number of folders are significantly high, it would be better to filter on the Get Meta Data scope as below :

Published by Nandan Hegde

Microsoft Data MVP |Microsoft Data platform Architect | Blogger | MSFT Community Champion I am a MSFT Data Platform MVP and Business Intelligence and Data Warehouse professional working within the Microsoft data platform eco-system which includes Azure Synapse Analytics ,Azure Data Factory ,Azure SQL Database and Power BI. To help people keep up with this ever-changing landscape, I frequently posts on LinkedIn, Twitter and to his blog on https://datasharkx.wordpress.com. LinkedIn Profile : www.linkedin.com/in/nandan-hegde-4a195a66 GitHUB Profile : https://github.com/NandanHegde15 Twitter Profile : @nandan_hegde15 MSFT MVP Profile : https://mvp.microsoft.com/en-US/MVP/profile/8977819f-95fb-ed11-8f6d-000d3a560942

Leave a comment

Design a site like this with WordPress.com
Get started