Problem Statement :
Power BI REST API : Datasets – Execute Queries has the below limitation :
Is it possible to have an automated way for data extraction from Power BI dataset with data greater than 15MB or dataset greater than 100000 rows.
Prerequisites :
- Power BI Report / Dataset
- Azure Data Factory / Synapse
- Azure SQL Database / SQL Database
- Azure Blob Storage
- Logic App ( This Offering can be avoided if we leverage Web Activity / Dataflow in Data Factory / Synapse )
In my previous blog,
we have leveraged Service Principal Authentication for data extraction from Power BI, so for a change of scenario ; in this use case we would be leveraging Logic App and AD User Authentication for extracting data from Dataset.
Solution :
For the Security configurations, please follow till step #1 as mentioned in the blog Copy Data from Power BI through Azure Data Factory / Synapse Pipeline – Part 3 for enabling Managed Identity authentication.
Since we would also be leveraging User ID authentication, grant the user Dataset.Read.All rights on the dataset from which you want to extract the data.
GitHUB Code
Pipeline would have 4 variables :
- To overcome the limitation of Data row / Size, we need to get the overall row count (total number of rows within the dataset) and partition the dataset in some batch range to extract the data in partitions and finally merge the data into a single file.
- To get overall Row Count, we would be leveraging Web activity to execute a query on Power BI dataset.
URL : @concat('https://api.powerbi.com/v1.0/myorg/datasets/',pipeline().parameters.PBIDatasetId,'/executeQueries')
Body :
{"queries":[{"query":"@{pipeline().parameters.RowCountQuery}"}],"serializerSettings":{"includeNulls":true}}
Where value for Parameter: RowCountQuery would be a Dax Query to get the Count of Rows
EVALUATE SUMMARIZE(DataSharkX,"Count Row",COUNT(DataSharkX[EmployeeID]))
Below is my Power BI Dataset table :
Output :
3. We would now need to extract the row count from the JSON output and assign it to the RowCount variable
Expression :
@replace(string(last(split(string(activity('Row Count').output.results[0].tables[0].rows[0]),':'))),'}','')
Output :
4. For this current scenario, we plan to extract the data in batches of 90000 rows. Hence, we have assigned 90000 value to the PartitionRange pipeline parameter.
Now based on the overall row count and Partition range, we need to identify the number of iterations / batches in which we would be extracting the data.
For our use case, we have used a Lookup Activity to execute a SQL query on an Azure SQL Database.
Query value :
declare @@intialize int = 1
declare @@partition int = @{pipeline().parameters.PartitionRange}
declare @@count int = @{variables('RowCount')}
declare @@maxrun int = ceiling(@@count/@@partition)
declare @@dividendcount int = @@partition
declare @@con varchar(max) = (select CONCAT(@@intialize,';',@@dividendcount))
declare @@counter int = 0
while (@@counter<@@maxrun)
begin
set @@intialize = (select @@dividendcount )
set @@dividendcount = (select @@partition + @@dividendcount)
set @@con = (select CONCAT(@@con,'|',@@intialize,';',@@dividendcount))
set @@counter = (select @@counter+1)
end
select @@con AS ConValue
Output :
where we have generated 3 partitions in the range of
a) 1;90000
b) 90000;180000
c)180000;270000
5. Split the ConValue (output in step #4) via ‘|’ and iterate it over For Each activity
Items :
@split(string(activity('Generate Partitions').output.value[0].ConValue),'|')
6. Within For Each Activity
Set up the values for the variables: LowerRange & UpperRange.
A) LowerRange
Expression :
@substring(item(),add(indexOf(item(),';'),1),sub(sub(length(item()),indexOf(item(),';')),1))
Output :
B) UpperRange
Expression :
@substring(item(),add(indexOf(item(),';'),1),sub(sub(length(item()),indexOf(item(),';')),1))
C) Generate the Final Query to Execute on Power BI Dataset to extract the data in batches
DAX Query below is Equivalent to Row_Number function in T-SQL, wherein we generate a Rank Column in the source dataset and filter the dataset in every iteration based on data between the LowerRange and UpperRange
DEFINE VAR Employee1 = SUMMARIZECOLUMNS(DataSharkX[EmployeeID],DataSharkX[EmployeeName],DataSharkX[Rating],"EID",VALUES(DataSharkX[EmployeeID])) VAR EmployeeTable = ADDCOLUMNS ( Employee1, "Rank", RANK.EQ( [EID],DataSharkX[EmployeeID],ASC )) EVALUATE(FILTER(EmployeeTable,[Rank]>=@{variables('LowerRange')} && [Rank]< @{variables('UpperRange')} ))
Output :
D) Trigger the Logic App Synchronously that would execute this FinalQuery on Power BI Dataset and upload the file in Azure Blob Storage.
7 . Logic App Design
The Highlighted section (of Sharepoint) can be ignored in current context if need be.
a) HTTP Request :
Where Query and Iteration are the input parameters which would map to the FinalQuery and LowerRange Variable values from ADF/Synapse.
b. Run Query against Dataset
Login via the User ID that has the necessary access on the Power BI Dataset and Select the necessary Workspace Name and the Dataset.
c. Create CSV Table
d. Create Blob
Connect to the Resource Group and Provide the Storage Account Details.
And Blob Name is below expression :
concat(utcNow(),triggerBody()?['Iteration'],'pbi.csv')
e. (Optional) Create File Sharepoint
Where File Name is below expression :
concat(utcNow(),triggerBody()?['Iteration'],'pbi.csv')
f. For Synchronous call and ADF / Synapse Webhook , Instead of using a request response, we’ll use a HTTP POST message to a CallBackUri.
output after all Iterations:
8. To Merge/Club the different files into a single file, leverage Copy Activity
Source Setting :
where Source Dataset :
Sink Setting :
where Sink Dataset :
Output :
9. Finally Delete the Individual Partitioned data files
Note :
One can leverage Web Activity in place of webhook/Logic App to execute the queries as well but the partition range needs to be much lower than the 100000 range as Web Activity has a lower response range.
Error Message for greater size:
Or one case use Dataflow activity to execute the queries.
File Output :
Thanks for the wonderful tutorial. I really appreciate your time and effort for writing this. There is a syntax mistake in “LowerPartitionRange”. It should be @substring(item(),0,indexOf(item(),’;’)).
Also, instead of using Logic app we can trigger web activity to fetch the data using final query and save the output in blob(json file) then use copy activity to create a merged csv or insert into database this will also help to avoid json flatten cost using dataflow and save money.
Note – Make sure to save json result rows only using (@activity(‘Extract data’).output.results[0].tables[0].rows) otherwise copy activity will not be able to read multiline.
LikeLike
lower partition range query is not working
LikeLike