Forward Looking Enterprises today are all implementing BigData Analytics and Machine Learning Solutions as an extension of their Business Intelligence Solutions.

Hadoop and Apache Spark are the emerging platforms of choice for these new era Analytics Apps.

Our Consultants have deployed many Big Data Analytics Solutions using Apache Spark by providing end to end solutions for Data Transformation using ETL techniques, Analytics Models using Python and Scala, Visualization using tools like D3Js. Tableau etc.

BigData Analytics

Data is the new oil. Industry experts across multiple business sectors unanimously agree that competitive advantage and business growth can be achieved through Big Data Analytics. Big Data Analytics involves processing of data across three aspects or 3Vs – Velocity, Volume and Variety. Velocity is the speed at which data is acquired. Volume of the Big Data datasets typically will be hundreds of GBs of data. Variety implies that the data will not be structured like RDBMS tabular data but also will include unstructured or semi-structured data like social media articles, comments, etc.

We at DataTrine use a well-defined analytics framework which can be applied across business segments to quickly create a Big Data Business Dashboard.

Our DataTrine Analytics Framework includes the following key elements


We can control only those aspects of a business which can be accurately measured. The first element which is determined aspects of the business problem which need to be measured. We defined the metrics and ratios which are be measured for the business problem. The metrics give a quick snapshot of “what is happening now” view of the business.


Once the metrics are identified, we then determine the data sources and compute the time- series of each of the metrics defined. These time-series provides insights into the business trends or “what has happened?” during the past business cycles at various time scales like days, weeks, months and quarters.

Predictive Analysis

Using Statistical algorithms like ARIMA, Logistic Regression etc. the next step of our analytical framework computes “What will happen next” based on the trend datasets and metrics available. visualizations: Our Analytical framework also includes strong visualizations for anomaly detection or “What did not happen” using visualization tools like heatmaps and control charts multi-axis radar charts.

Prescriptive analytics: using statistical algorithms like Random Forests, we compute the impact of the control variables on the outcome variables and generate the implicit business rules operating in the business domain. By analyzing these business rules, we provide the ideal state of control variables to be managed in order to get desired business outcome. In other words, we provide “prescriptive analytics” of”what should happen”.

While our Data Analytics provide effective analytical tools based on the framework described, our data engineers use cutting edge tools like Apache Hadoop and Apache Spark for implementing the data analytics framework and integrating the data from external systems required to compute the analytics.

Machine Learning

Following the success of many applications at leading technology firms like Google and Facebook, many forward looking businesses have started to explore machine learning solutions in their own domain to gain competitive advantage. Machine Learning applications are applied for many business problems with Sales, marketing and finance being the key initial business activities. Predicting the future business trends, increasing the control efficiency of operations, fraud detection etc. are some of the key business problems in which DataTrine is applying machine learning techniques for its customers.

Some of the key algorithms which we have been applying frequently are as follows:
a. classification algorithms like K-Means and k-nearest neighbors for segmenting different customers based on their profile and behavior
b. Time-series algorithms like ARIMA and Logistic-Regression for predicting sales volumes, price- movements etc.
c. Natural language processing for mining social media data and internal text documents for sentiment analysis of product performance etc.
d. Deep Learning Artificial Neural Networks for image processing and audio processing for customer support activities process improvements

Apache Hadoop

Big Data Analytics implies processing high volume data with varied data structures in very less time period. Map-Reduce architecture which is the underlying architecture in Google had become the standard for massively distributed parallel processing to achieve google-scale performance. Apache Hadoop is the industry standard open source implementation of Map-Reduce Architecture.

In the Data Analytics workflow, Data Scientists create data models as solutions for analytics problems using an ensemble of machine learning algorithms and test for their performance. Most of the time, the data scientists select only a subset of data for arriving at an effective analytics model, usually coded in Python or R Platforms. It is the work of Data Engineers to implement the data models in Map-Reduce Architecture to achieve the final solution which can crunch massive data volumes in seconds to deliver the business insights needed.

Our Data Engineers are highly skilled in the various distributions of Apache Hadoop and can be implement the analytics models within very short durations. Our Data Engineers can also implement the ETL (Extract, transform and Load) operations which will present the data from external systems to the Map-Reduce engine for computing the result sets.

Apache Spark - A New Layer over Hadoop

Apache Hadoop had emerged as the Platform of Choice forBig DataApplication development. As more and more applications based on Map-Reduce Approach was developed, many specialized systems were developed to address specific use-cases like interactive processing, graph-processing, streaming etc. because Hadoop was limited to Batch Processing architecture. Examples of such specialized frameworks include GraphLab, Dremel, Impala etc.

One of the generic frameworks that have gained prominence over all the specialized solutions is Apache Spark. The major reason for the rapid success of Apache Spark is that it is a general-purpose data processing engine layered over Hadoop. Apache Spark is a significant step forward forBig DataAnalytics because it is reducing the complexity of developing Map-Reduce functionality through the introduction of a higher level abstraction called resilient distributed datasets (RDD). The RDD toolkit which data scientists and application developers incorporate into their applications to rapidly query, analyze and transform data at scale. Apache Spark can be either a powerful replacement or complement to Apache

Hadoop based on the type of application being developed. Spark’s flexibility makes it well-suited to tackling a range of use cases while allowing developers to take advantage of Apache Hadoop for Scalability and Fault Tolerance.

DataTrine team has implemented many projects on Apache Spark using both the Python and Scala Language Platforms. Our team can convert an Analytics Model to Apache Spark Platform within very short time because of proven experience in implementing such projects earlier.