• 4
name

A PHP Error was encountered

Severity: Notice

Message: Undefined index: userid

Filename: views/question.php

Line Number: 191

Backtrace:

File: /home/prodcxja/public_html/questions/application/views/question.php
Line: 191
Function: _error_handler

File: /home/prodcxja/public_html/questions/application/controllers/Questions.php
Line: 433
Function: view

File: /home/prodcxja/public_html/questions/index.php
Line: 315
Function: require_once

Looking for a high performance distributed scalable solution for storing tons of log messages. We have multiple concurrent log sources(=servers).

The interesting thing here is that performance is crucial and we are even willing to loose a small percent (let's say max 2%) of all of the daily messages if the logging system performs better.

We want to process the log messages daily with an online algorithm so we do not need any fancy relational database stuff. Just want to run through the data sequentially and calculate some aggregates and trends.

This is what we need:

  • At least 98% of the messages must be stored. It's not a problem to loose a couple of messages.
  • Once a message is stored it must be reliably stored (Durable aka D from ACID - so basically replication is needed)
  • Multiple sources.
  • The messages must be stored in a sequent way, but exact ordering is not needed (we expect any two messages further away than a couple of seconds be in the right order, but messages close to each other can be in arbitrary order)
  • We must be able to process the daily data sequentially (ideally in some reliable way like map-reduce, so machine failures are handled and processing on nodes with failures is restarted)

Any RDBMS is certainly not an option here as it guarantees too many (for this task unnecessary) properties.

      • 2
    • Are you logging over the network (e.g. syslog)? If so, keep in mind that traditional syslog/UDP can easily lose far more then 2% of the UDP packets.

I think you want Flume. It seems to hit most of the points you are looking for - multiple sources, reliability (E2E guarantee), the ability to write to HDFS (distributed, fault-tolerant storage, integrates into Hadoop for map/reduce.

Edit: I'd also like to mention Scribe as another possibility. It's C++ based, written by Facebook, but it seems to have been abandoned mostly by upstream. Still, it's a lot lower footprint than Flume, nevertheless including the footprint of all of the Flume dependencies such as Zookeeper. And it, too, can write to HDFS.

  • 3
Reply Report
      • 2
    • brilliant! I really like Flume's failure modes.. we will be able to balance between reliability and performance. Exactly what I was looking for.
      • 1
    • Glad to help. Be aware, there are some issues with it - for one being a hard dependency on having some of the Hadoop libraries, and it runs Java, so it carries with it the full weight of a JVM on each machine. But still, nothing else does what it does for the moment, and I've recently been required to suck up because of that fact.
      • 2
    • Just for future viewer’s sake, I should also mention Fluentd is another alternative, and that Scribe has been steadily declining in popularity.

A distributed Splunk setup would fit your needs, but it sounds like you have a high volume of log data; it's licensed based on amount of data indexed per day, so it would not be cheap.

  • 1
Reply Report
    • You are right, we need a system that can handle very high volume of log data (.. in the near future so we can handle the rapid growth). The company might pay for some additional services of an open-source solution (e.g.: on-line support, troubleshooting, feature requests, ...) but certainly won't pay on a daily basis for traffic.
    • @yi_H To clarify, you buy a permanent license that allows for a certain amount of data to be indexed per day; for instance, I think the ballpark number is around $10,000 for a permanent 1GB per day license, and cheaper per gig as the license gets bigger (I know the prices have been changed since my last conversation with their sales team, so take this with a grain of salt).

Trending Tags