Aaand the New NiFi Champion is…



On Could 3, 2023, Cloudera kicked off a contest known as “Finest in Circulate” for NiFi builders to compete to construct the most effective knowledge pipelines. This weblog is to congratulate our winner and overview the highest submissions.  

On the verge of the discharge of NiFi 2.0, Cloudera VP of Engineering and NiFi founder Joe Witt, joined by principal committers Mark Payne and Matt Gillman, addressed the worldwide neighborhood by way of a digital occasion dubbed “Meet the Committers.” The group mentioned NiFi’s origins and the journey to NiFi 2.0 in addition to vital options within the upcoming launch, and surveyed the neighborhood in regards to the dev/ops challenges of managing their very own nodes. As a part of the occasion, Cloudera kicked off the “Finest in Circulate” contest. The competition challenged builders to construct knowledge pipelines that characterize their enterprise use instances utilizing Cloudera DataFlow. DataFlow is a cloud-native knowledge service powered by Apache NiFi with a streamlined consumer expertise for growth and deployment enabling true common knowledge distribution. For the competition, Cloudera made a sandbox surroundings out there for builders to make use of DataFlow Public Cloud. We had greater than 40 builders lively within the surroundings and plenty of high-quality contest submissions. However in the long run there might solely be one winner.

Finest in Circulate champion

So with none additional ado, our winner and the brand new Finest in Circulate Champion is:

Vince Lombardo! Vince is a Senior Infrastructure Engineer at Wells Fargo, and he developed a cybersecurity pipeline to effectively accumulate, course of, and make knowledge from an asset polling device out there for database ingestion. Cybersecurity is a typical area for DataFlow deployments because of the want for well timed entry to knowledge throughout methods, instruments, and protocols. What’s attention-grabbing about Vince’s device is that it cleverly makes use of “pagination” performance to constantly distribute up-to-the minute outcomes from a device that doesn’t all the time return a full set of outcomes immediately. For extra element on the profitable circulation, try Vince’s github web page right here.   

Vince’s profitable circulation

Vince started by funneling knowledge from six API endpoints from an asset polling device containing cybersecurity and tech ops knowledge into two discrete knowledge matters. The circulation he constructed differentiates between check or true API name earlier than initiating a safe log in. The sensible half comes subsequent. As a result of the polling device can take time to return queries, Vince added a processor to loop till the question completes, returning question standing till the question is full. Completeness is estimated by evaluating a check consequence with “estimated complete.” When a close to match is detected, the info pull is triggered after which checked once more for completeness earlier than being remodeled into rows and columns and merged right into a batch for database ingestion.

Determine 1: The a part of the circulation that loops till the Tanium question has accomplished

Vince’s circulation met all of our standards and was the clear contest winner. This circulation is full and adheres to NiFi finest practices being each environment friendly and extremely safe. By using pagination, this dataflow ensures a whole consequence set is available from a knowledge supply with extremely variable question execution instances. It’s deployable, has clear enterprise worth, and serves as an amazing instance of common knowledge distribution in motion. Congratulations Vince!  

Runner up

Ramakrishna Sanikommu was our runner up. His submission submit could be discovered right here. RK constructed some easy flows to drag streaming knowledge into Google Cloud Storage and Snowflake.  Many builders use DataFlow to filter/enrich streams and ingest into cloud knowledge lakes and warehouses the place the flexibility to course of and route anyplace makes DataFlow very efficient.  RK constructed a number of flows shortly, first pulling a number of knowledge sources from a Google Pub/Sub subject and merging them right into a file for ingestion into GCS. He then constructed a second circulation to execute a Python script and cargo the info into Snowflake. His flows adhered to finest practices and demonstrated some gentle transformations. RK correctly used the DataViewer as effectively to view contents of a queue.

Determine 2: Ramakrishna’s first circulation consuming knowledge from Google PubSub and ingesting it into Google Cloud Storage


Determine 3: Ramakrishna’s second circulation studying knowledge from Google Cloud Storage and ingesting it into Snowflake

Abstract and searching forward

In lower than 10 years since its inception, NiFi has achieved completely huge scale each when it comes to recognition and the measurement of deployments. NiFi’s origins, nonetheless, have been fairly easyfor any two methods to work collectively, there are fairly a number of issues that must agree. They have to not solely communicate some widespread knowledge language however account for myriad issues like relevance, safety, precedence, authorization, and many others. NiFi was constructed as a type of Swiss Military Knife to shortly join totally different methods and coordinate dataflows from one to a different utilizing an intuitive no-code growth canvas.  

Since buying the corporate primarily liable for sustaining the NiFi code base in 2015, Cloudera has continued to pour assets into the Open Supply undertaking, which now boasts greater than 500 contributors throughout the globe and 1000’s of lively neighborhood members in Slack. NiFi has advanced significantly, staying forward of safety vulnerabilities and including connectors with releases each quarter. The “Finest in Circulate” contest was quite a lot of enjoyable, and demonstrated the urge for food for neighborhood round Apache NiFi. Right here at Cloudera we’re excited to host future occasions for NiFi builders, so keep tuned to search out out what’s subsequent. To check drive Cloudera DataFlow your self, click on right here to request a trial of Cloudera Information Platform within the Public Cloud. campaign/try-cdp-public-cloud.html