ETL & Data Enrichment with Spark.NET and ML.NET Automated (Auto) ML
This sample takes a restaurant violation dataset from the NYC Open Data portal and process it using Spark.NET. Then, the processed data will be used to train a machine learning model that attempts to predict the grade an establishment will receive after an inspection. The model will be trained using ML.NET, an open-source, cross-platform machine learning framework. Finally, data for which no grade currently exists will be enriched using the trained model to assign an expected grade.
For a detailed write-up, check out the Restaurant Inspections ETL & Data Enrichment with Spark.NET and ML.NET Automated (Auto) ML blog post.
This project was built using Ubuntu 18.04 but should work on Windows and Mac devices.
The dataset used in this solution is the DOHMH New York City Restaurant Inspection Results and comes from the NYC Open Data portal. It is updated daily and contains assigned and pending inspection results and violation citations for restaurants and college cafeterias. The dataset excludes establishments that have gone out of business. Although the dataset contains several columns, only a subset of them are used in this solution. For a detailed description of the dataset, visit the site.
This solution is made up of different .NET Core applications:
git clone https://github.com/lqdev/RestaurantInspectionsSparkMLNET.git
Before building the code, update the location of the solution in the RestaurantInspectionsTraining and RestaurantInspectionsEnrichment.
Replace the value of solutionDirectory
with the path of where your solution is saved.
Original:
string solutionDirectory = "/home/lqdev/Development/RestaurantInspectionsSparkMLNET";
New:
string solutionDirectory = "<YOUR-SOLUTION-PATH>/RestaurantInspectionsSparkMLNET";
dotnet publish -f netcoreapp2.1 -r ubuntu.18.04-x64
dotnet build
dotnet publish -f netcoreapp2.1 -r ubuntu.18.04-x64
From the project directory run the application with spark-submit.
spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local bin/Debug/netcoreapp2.1/ubuntu.18.04-x64/publish/microsoft-spark-2.4.x-0.4.0.jar dotnet bin/Debug/netcoreapp2.1/ubuntu.18.04-x64/publish/RestaurantInspectionsETL.dll
dotnet run
Navigate to the publish directory. In this case, it’s bin/Debug/netcoreapp2.1/ubuntu.18.04-x64/publish.
From the publish directory, run the application with spark-submit
.
spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local microsoft-spark-2.4.x-0.4.0.jar dotnet RestaurantInspectionsEnrichment.dll