Read and Import Data from PDF File using MSFT Fabric

Problem Statement :

Often, our data doesn’t come in a neat Excel sheet or CSV file, but is buried as a table in a PDF. The challenge with PDF files is that it can have a variety of content ranging from text, tables, images, hyperlinks etc.

There are programming frameworks which have libraries that can be used for extracting data from these files. But this again poses the challenge for a non-technical user.

Is it possible to Read and Import data from PDF file in an easier and an automated way .

Prerequisites :

  1. MSFT Fabric Dataflow Gen 2
  2. Storage Account

Sample PDF :

Solution :

  1. Login to https://app.fabric.microsoft.com/home?experience=data-factory and Select Dataflow Gen 2.

And rename the Dataflow :

2. Click ‘Get Data’ >> More and Select Azure Blobs

3. Create a new connection (In case if there isn’t an existing one) via your organization account and Click ‘Next’.

Note : In this scenario, we would use Account Key authentication.

4. The list of Containers within the Azure blob would be visible as seen below :

Filter for your Container and Select the PDF file whose data we need to extract.

5. The PDF connector detected all the 6 tables in our sample PDF file, so it provides an option to import specific tables. This is a great option for those who are interested in a specific piece of data from the PDF file. If you analyze carefully, it also tells you which table is on which page. The rest of the two options are at the bottom of the list, provides an option to read the entire page. There is 1 table (item) created per page.

Note : There are 5 Pages within the PDF file and 6 Tables distributed across the 5 Pages.

We need to get data from one table, therefore, select Table 001, and it shows the table on the right-hand side page.

6. Do the necessary transformation that you need in the Power Query mode.

In this scenario, we would Convert the first row as headers.

Final State :

The Advanced editor code :

let

Source = AzureStorage.Blobs("https://datasharkxstg.blob.core.windows.net/"),
#"Navigation 1" = Source{[Name = "fabric"]}[Data],
Navigation = #"Navigation 1"{[#"Folder Path" = "https://datasharkxstg.blob.core.windows.net/fabric/", Name = "DataSharkXSamplePDF.pdf"]}[Content],
#"Imported PDF" = Pdf.Tables(Navigation, [Implementation = "1.3"]),
#"Navigation 2" = #"Imported PDF"{[Id = "Table001"]}[Data],
#"Promoted headers" = Table.PromoteHeaders(#"Navigation 2", [PromoteAllScalars = true])
in
#"Promoted headers"

7. Now Select the Sink / Destination settings

As of today only 4 Sinks are supported :

So for our use case, we would Select Lakehouse.

8. Filter the Workspace and Select the Lakehouse.

You can either create a new table on run or map it to an existing one.

In our use case, we would create a new table in destination with the name PDFExtractionSink.

9. One can either append the data or replace the data based on the settings

We would proceed with Replace data scenario.

And Click on “Save settings”.

10. Finally Click on “Publish”

11. The Dataflow begins the 1st execution once it’s published.

Lakehouse output :

12. To Execute again, Click on the Refresh now component of the dataflow.

As we have enabled Replace data at sink, the data would be overwritten.

13. In order to schedule the data extraction, one can either schedule the dataflow refresh frequency via Dataflow Settings:

Or schedule via Fabric Data pipelines.

Published by Nandan Hegde

Microsoft Data MVP |Microsoft Data platform Architect | Blogger | MSFT Community Champion I am a MSFT Data Platform MVP and Business Intelligence and Data Warehouse professional working within the Microsoft data platform eco-system which includes Azure Synapse Analytics ,Azure Data Factory ,Azure SQL Database and Power BI. To help people keep up with this ever-changing landscape, I frequently posts on LinkedIn, Twitter and to his blog on https://datasharkx.wordpress.com. LinkedIn Profile : www.linkedin.com/in/nandan-hegde-4a195a66 GitHUB Profile : https://github.com/NandanHegde15 Twitter Profile : @nandan_hegde15 MSFT MVP Profile : https://mvp.microsoft.com/en-US/MVP/profile/8977819f-95fb-ed11-8f6d-000d3a560942

Leave a comment

Design a site like this with WordPress.com
Get started