Longneck solves data transformation problems to enable efficient data integration for data warehousing or other transformation-intensive data cleaning and ETL activities.

Longneck provides:

  • simple and efficient data- and computational model,
  • an extensible and flexible XML-based transformation language,
  • modular and reusable transformation descriptors,
  • a fast, scalable and robust Java execution engine for both multithreaded single-machine and distributed applications.

Longneck has been developed and maintained by the Data Warehousing and Business Intelligence Group of the Institute for Computer Science and Control, Hungarian Academy of Sciences (MTA SZTAKI).

Longneck has been successfully applied for data integration and ETL tasks including web log processing for web analytics, IT event log processing, data cleaning for customer master data management enabling efficient entity resolution or for processing web search engine events.


We introduced new facilities for testing transformation operations, blocks, entities and processes, aiming to help unit testing, TDD and continuous integration. We are currently extending the relevant documentation, examples and applications. We also added a new extension for check digit verification.
We briefly introduce Longneck on the Budapest BI Forum 2013, Open Analytics Day.
(Hungiarian slides, English slides).
First open source version of Longneck released!


Sources are available on GitHub.


Apache License 2.0


While solving various data integration and business intelligence research and industrial problems around 2005, we realised the lack of a scalable and easily configurable open source data quality and ETL tool. After trying the ones available and also some proprietary software, we decided to begin developing our own framework, that fits our needs.

We experimented with various types of ETL architectures and models before we got to Longneck; some of these earlier tools are still in use.

A lot of new tools arrived and older ones become cleverer since we began working on Longneck. However, we believe that our framework is still unique amongst ETL and data integration tools, with outstanding scalability and a flexible language.

We decided to make Longneck open source and freely available in 2013.

Coming soon

  • We will release a Storm-based engine soon to enable real distributed processing, with measurements on throughput speed.
  • Reading and writing HDFS and Excel files are handy features; we plan adding these data sources and targets.
  • We plan to make further extensions available (eg. extension identifying unique website visitors).
  • We plan to extend our content repository by blocks of useful domains, eg. IT event logs.
  • Longneck doesn't have a GUI yet - transformation rules are formalized as XML files. We plan building a GUI, which can help constructing and applying transformation rules and processes.
  • Finding and eliminating duplicated elements of data sets is an important and hard problem of data quality and data integration projects. We are considering publishing our entity resolution solution as a Longneck extensions to support duplicate detection.


Csaba Sidló or Gábor Lukács may help you.