Welcome back to the clash of the data prep tools. For the second test, I have decided to go for an Alteryx challenge. This will be the format going forward, one-week preppin’ data the next an Alteryx weekly challenge.
This challenge starts with a file of data about open orders and a file about shipments. The aim of the challenge replicates a real-life business problem, to calculate the order status. A link to the challenge can be found here.
The requirements were to flag orders from the shipments file that had been fully shipped and those were the order still has lines to ship. It was also required to flag lines as complete or partially shipped.
An order could have more than one line, and a line can be partially shipped on more than one shipment. If an order exists in the open orders file, then the order has one or more lines still to ship.
The two input files can be seen below.
Given that this was an Alteryx challenge, it only seemed fair to start with that tool. The full workflow can be seen below.
The first step taken was to parse the data into separate fields rather than one concatenated field. To do this I used the text-to-columns tool, first splitting on the pipe, then splitting on the commas. I now had four fields for orders and five for shipments.
At this point, I needed to take the field names from the first row of data. This is something that is very easy to do in Alteryx using the dynamic rename tool. I suddenly realised that this was going to be Tableau Prep’s downfall in this challenge as there is currently no straight forward way to do this. Read on to find out…
Now the data was in the correct format, I could start trying to identify order status. First orders and shipments were joined on order number and line number. Those that had joined were identified as having the line partially complete as they had shipped but were still open. Those that didn’t join were identified as having the line complete. The two streams then unioned together.
The next join took the unique order numbers from the original open orders stream and the previous join. Those order numbers that joined were flagged as partially complete and those that didn’t join were flagged as complete. The two streams were once again unioned.
Finally, the order numbers were joined back to the stream from the original join to give the final output which can be seen below. Not an overly complex challenge in terms of tools used just required a bit of thinking to work out the join logic. Now onto Tableau Prep!
Tableau Prep Builder:
This challenge was definitely not something that was out of reach for Prep, however, as already highlighted I knew the dynamic rename would cause some issues. It’s very easy to quickly rename fields in prep by just typing into the field header. However, if you were using a large data set this would not be a great solution. You also still need to then remove the ‘header’ row from the data. I will show an example of this approach below.
With the recent introduction of scripting in 2019.3, I figured this would potentially be a viable approach, so I decided to give it a go and write my first Python script. My full workflow can be seen below.
The first step I took was very similar to Alteryx, using the custom split function in Prep, I first split on the pipes and then the commas. Easy enough. Next up came the large difference between the two. How could I dynamically rename the field headers from the first row of data? The simple answer is, you can’t. Having used a Python script to achieve this, there is still an element of hard coding so the solution is not truly dynamic, as it is in Alteryx.
One way to get the field headers from the first row of data would be to rename the fields manually. As previously mentioned, it is not a great solution because if you had a large data set you’d be doing lots of renaming. You also then have to remove the header row from the data. With no default ‘skip 1st row’ as you would have in Alteryx, you have to be slightly more creative with the solution.
Given this was a small data set, I could see that none of the values in the fields exceeded two characters, so I just applied a filter to exclude anything longer than two characters. Again, if you had a larger, more varied data set this may not be a viable option.
A solution could be to join the data to itself to generate a row ID, like in this blog, and then remove the first row, however, this is a fairly tedious process for something that should be quite a simple task.
A final idea that I explored was to split the data as usual, then output the files and bring them back in, to allow the data interpreter to work its magic. The problem is, from Prep Builder you can only output to tde, hyper or csv files… and when you read these files types in, you cannot use the data interpreter. Another dead end.
I then decided to try and write a python function which would rename the fields. I have hardly touched python before aside from a few online videos so after some googling and a bit of trial and error I came up with the following script.
The three ‘stages’ to the function are:
- Create a variable called header, which contains the first row of data from the data frame.
- Rename the columns in the data frame with the variable ‘header’
- Take the data frame from the second row ([0:] would be from the first row)
Having tested the function in Jupyter Notebook, I now needed to get the script running in Tableau Prep. When you insert a script step into a prep flow, you need to configure the server you’re connecting to, this can either be R serve if you’re writing in R or for python users, this is TabPy. There are numerous guides to installing TabPy on your machine, but here is a good one from Alex Fridikkson.
Once TabPy was installed and running, I was good to go. When using a script in Prep Builder, you get the following configuration pane below. First, you must select which server you want to connect to, Rserve if you’re running an R script, TabPy if you’re running a Python script. You browse to your script that you have saved and then call the function at the bottom of the configuration pane.
I ran the script through Prep Builder but was confused to see an output with an empty data frame, despite no errors being thrown up. This was strange as I had tested the script on a separate data frame directly in Python so knew that it worked.
After a bit of digging (lack of paying attention… it does actually say at the bottom of the configuration pane), I found out that when scripting in Prep, if you change your field headers, you need to re-identify them in your script! This was frustrating to discover as I thought I had found a dynamic solution to the renaming, but alas, there was still some hard-coding to do.
To identify the new fields in the script, you need to write a function called GetOutputSchema and identify the field names and types. My fully functioning script can be seen below. Because there is an extra field in the shipments data, I had to input a slightly different script to account for this additional field.
I now had my renamed fields and was good to carry on with the task. From this point onward, due to the nature of Prep’s joining tool and being able to select join types, the task was actually easier and achieved with fewer steps than in Alteryx.
I chose to take a very similar approach, first I did a right outer join on order number and line number, those that didn’t join were identified as having completed lines and those that did were identified as partial. I then aggregated the order numbers and joined these together, the same as in Alteryx, and flagged orders as complete or partial. The final stage was to join these order numbers back to the original flow and I had my final output.
Alteryx has to be the winner in this challenge as there was actually no way to dynamically rename the fields in Prep. Despite writing a script which would have made this dynamic in python itself, there was still an element of hard coding to get the functionality in Prep.
Had dynamically renaming been an option in Prep, the tool would have passed the challenge with flying colours as the option to select which types of join you’d like to use removes the need for the unions which are required in Alteryx.
Hopefully, this will be a feature that is added to Prep in the near future as having field headers in the first row of data is a common issue. Of course, a lot of the time Tableau’s data interpreter can get around this issue for you, however, this is not always the case, especially if data prep is required before you get to the rename process like in this challenge.
I think it would be useful to have an area where you could save commonly used scripts and quickly call them within Prep Builder, almost like a script repository, as this would be a great way to store commonly used scripts. There are some great ones out there already which increase the functionality of prep:
- Josh Milligan (useful introductory scripts): https://vizpainter.com/pythonintableauprep/ & https://vizpainter.com/morepythonscriptsfortableauprep/
- Sarah Battersby (converting Geohash to lat/long): https://community.tableau.com/people/sarah.battersby.0/blog/2019/07/29/converting-geohash-to-latlon-in-tableau-prep-20193-with-tabpy
- Rob Carroll (multi-row formula): https://community.tableau.com/docs/DOC-23322
Overall, a good challenge and happy to have written my first bit of code and got this working in Tableau Prep Builder.