Twenty years ago, the Open Source framework was published, delivering what would be the most significant trend in software development since that time. Whether you want to call it “free software” or “open source”, ultimately, it’s all about making application and system source codes widely available and putting the software under a license that favors user autonomy.
According to Ovum, open source is already the default option across several big data categories ranging from storage, analytics and applications to machine learning.
In the latest Black Duck Software and North Bridge’s survey, 90% of respondents reported they rely on open source “for improved efficiency, innovation and interoperability,” most commonly because of “freedom from vendor lock-in; competitive features and technical capabilities; ability to customize; and overall quality.”
There are now thousands of successful open source projects that companies must strategically choose from to stay competitive.
While every company must develop its own strategy, and choose the open source projects it feels will fuel its desired business outcomes, there are some projects that we feel are worth strong consideration.
How open source can be your path to business agility
Following are a few of the big data open source projects that have the largest potential for enabling companies to have extreme agility and lightning fast responses to customers, business needs and market challenges.
- Apache Beam is a project model that got its name from combining the terms for big data processes batch and streaming because it’s a single model for both cases. Under the Beam model, you only need to design a data pipeline once, and choose from multiple processing frameworks later. Your data pipeline is portable, and flexible so that you can choose to make it batch or stream. This way, your team can benefit from much greater agility and flexibility to reuse data pipelines, and choose the right processing engine for multiple use cases.
- Apache Airflow is ideal for automated, smart scheduling of Beam pipelines to optimize processes and organize projects. Among other beneficial capabilities and features, pipelines are configured via code rendering them dynamic, and metrics have visualized graphics for DAG and Task instances. If and when there is a failure, Airflow has the ability to rerun a DAG instance.
- Apache Cassandra is a scalable and nimble multi-master database that enables failed node replacements without having to shut anything down, and automatic data replication across multiple nodes. It’s a NoSQL database with high availability and scalability. It differs from the traditional RDBMS, and some other NoSQL databases, in that it is designed with no master-slave structure, all nodes are peers and fault tolerant. This makes it extremely easy to scale out for more computing power without any application downtime.
- Apache Carbon Data is an indexed columnar data format for incredibly fast analytics on big data platforms such as Hadoop and Spark. This new kind of file format solves the problem of querying analysis for different use cases. With Apache Carbon, the data format is unified so you can access through a single copy of data and use only the computing power needed, thus making your queries run much faster.
- Apache Spark is one of the most widely utilized Apache projects and a popular choice for incredibly fast big data processing (cluster computing) with built-in capabilities for real-time data streaming, SQL, machine learning, and graph processing. Spark is optimized to run in memory, and enables interactive streaming analytics so you can analyze vast amounts of historical data with live data to make real-time decisions, such as fraud detection, predicative analytics, sentiment analysis and next-best offer.
- TensorFlow is an extremely popular open source library for machine intelligence which enables far more advanced analytics at scale. TensorFlow is designed for large-scale distributed training and inference, but it is also flexible enough to support experimentation with new machine learning models and system-level optimizations. It is very readable, well documented and expected to continue to grow into a more vibrant community.
- Docker and Kubernetes are container and automated container management technologies that speed deployments of applications. Using technologies like containers makes your architecture extremely flexible and more portable. Your DevOps process will benefit from increased efficiencies in continuous deployment.
As impressive as each of these open projects are individually, it is the collective advances that best illustrate the huge impact the open source community has had on the enterprise and the monumental shift from legacy and proprietary software to open source-based systems—enabling companies of all sizes, across all industries to increase speed, agility, and data-driven insights at all levels or their organizations.