Saturday, November 9, 2013

Hadoop in the cloud frees you from the schema doldrums

Fiverr solved some major big data problems by employing a cloud-based service called Xplenty.


By Duane Craig


xplenty_logo.jpg
Fiverr is a marketplace for online services with millions of users in 200 countries across the globe. Every day the company collects millions of rows of semi-structured and unstructured data that reside in several sources such as MySQL relational database, mongoDB, Redis, and more. Fiverr also uses web-based services like Google Analytics. All told its big data overhead consists of approximately 80% semi-structured traffic data, and 20% structured data.

Maximum value

Seeking maximum value from big data, many companies are tempted to try a solution such as a standard relational database management system (RDBMS), which is very convenient as it uses regular SQL language, making it attractive to analysts. It also integrates well, offering reporting and supported extract, translate and load tolls. Then there's the price. MySQL for example, is free and open source. But, since Fiverr uses Agile Development, Slava Borodovsky, director of business intelligence, explained that it's very difficult to use a standard relational database management system solution.

"In agile development there are so many changes with a given product on a daily basis, that using schema-based tools is not efficient. Each time there are new parameters in production, you need to change the schema of your database. That's a painful procedure, especially with a big amount of data," Borodovsky said.

1_hadoop_guest.png
That challenge inspired Fiverr to consider a solution for its big data that would support an open schema, thereby allowing it to make changes on the fly. Hadoop became the focal point but not without some deep assessment of the potential issues that can arise with it. For example, while Hadoop was deemed optimal for Fievrr's type of data environment and scale, it requires special skills and attention, according to Borodovsky.
"It's a powerful system, but 'regular' business intelligence folks, such as analysts and even developers, can't deal with it," he said. "It requires special programming knowledge like Java, and a very technical orientation. It is something very different from the regular SQL world. In most cases if a company wants to use Hadoop it needs to hire employees with special knowledge and skills, who are also very costly. In addition to headcount, they'll need to create a distributed environment, which also has extra costs."

Fiverr initially sought to address the challenges of Hadoop implementation by using a columnar database to store traffic data. However, there were problems with this, and the company needed a better solution. What if all the benefits of Hadoop were in the cloud, leaving most of the challenges behind? Enter Xplenty, or Hadoop as a service.


Xplenty

a_fiverr_bd_11-7.png


Xplenty's GUI enables the user to create complex data flows in just minutes

After signing up, Fiverr implemented Xplenty's solution in a few days and the company started to get positive results very quickly. Xplenty's cloud architecture made it very simple to implement Hadoop for BI needs.

"The biggest surprise was the speed of implementation," said Borodovsky. "It was up and running after a few days. The cloud infrastructure of Xplenty makes the implementation process very easy and only required minimal IT efforts. The biggest challenge had to do with the format that we store our data in, since at that time Xplenty wasn't supporting the JSON format. However, we solved that problem in a day or two. We also made small changes in our data files structure by splitting them into smaller files to increase performance. The implementation process was very transparent and easy."

The company now stores all its traffic data as text files in JSON format and processes it with Xplenty. Now, Fiverr analysts can create Hadoop clusters and run complex analytical tasks within a few clicks. There is no need for a technical person to take care of Hadoop maintenance and optimization. This solution keeps Fiverr up to date with new changes on the site while keeping it very responsive to new metrics.

Fiverr was intent on mining its traffic data to do funnel, conversion, and trend analysis. Those complex analytical tasks, along with click-thru analysis, are typically the kind which are big and semi-structured, as when stored in JSON format. The duration of those BI processes from business request to analytical insight shrank dramatically.

Less schema manipulation

Using Hadoop, Fiverr doesn't need to change schema of its database/data warehouse. This is typically very time-consuming and involves IT resources, which can all too often create additional bottlenecks in BI process flow. The company can now start using new parameters that were added in production right after they go live. As an example, Borodovsky cites the process that measures the performance of a new feature that was added to production.

"In the typical database world we would need to change the structure of our data warehouse and add additional columns to tables to store the new parameters," he explained. "Then we'd need to change the ETL processes that will parse the new parameters and insert them to a table. Next, we'd have to write queries, create reports and analyze the feature."

"This process usually ranges from a day in small companies and start-ups, to a number of days, and even weeks in big companies. With Xplenty's Hadoop solution we can skip the first two steps. We can create a new process in Xplenty with a number of clicks and get the insights very quickly. The average duration of a BI process has changed and is at least two times faster than before, in terms of processes that are related to traffic analysis."

"With Xplenty, we are saving time dealing with data, as it's not necessary to change the schema constantly. We are also independent in terms of IT, where we've saved on headcount resources and can put more attention on analytics and business insights than on technical maintenance of Hadoop. As with many things in IT, finding the right solution sometimes takes a while. We met Xplenty at just the right time."
d_fiverr_bd_11-7.png


Instantly provision more cluster nodes to scale up and provide more compute power

0 comments:

Post a Comment

Appreciate your concern ...