HomeAnswerNotificationsCustomize Feeds
HOMEQUESTION
Which is best for working with data, spark or python? Or something else?
$0.01
1 ANSWER

You can use both at the same time!  PySpark is a Python front-end for Spark, replacing the use of Scala.

The longer answer is that it all depends on scale, and what you're comfortable with.

On the small scale, Excel (or another spreadsheet program) is often the best choice.  Tens of thousands of rows are no problem, and it's very easy to interact with the data in real-time.

The next step is processing larger amounts of data on a single computer; if you have merely gigabytes of data, this is far better than using a distributed/parallel architecture like Apache Spark. Any data set which fits in memory can be easily handled with tools like Pandas (for Python), so you will be able to handle millions of rows on a typical modern computer.   Python is a popular choice and I find it works well.   R is more popular with statisticians but I don't recommend it unless you need a specific library available in R.  Other languages can be used, but the ecosystem will probably not be as rich.

At this stage it's more a question of what you're comfortable with than what is really "best".  You can do a lot with command-like tools on a Unix system without writing any real code.

Only once the data set is bigger than memory, is it worth thinking about a tool like Apache Spark that can do parallel and stream processing.  Apache Hadoop is probably the biggest competitor; which is better depends a lot of what you'll be doing.  Hadoop may have an edge on big batch jobs that pull a lot of data from storage, while Spark is more oriented around streaming and is more suitable for real-time data.

There are a plethora of other solutions available, but it would probably be useful to get started in big data with either Spark or Hadoop as they're de facto standards and again have the richest ecosystem of advice and tools.  One that I've been meaning to take a look at is https://github.com/grailbio/reflow which attempts to automate and parallelize workflows originally created in a more ad-hoc fashion (perhaps at the scale of only millions of rows, rather than billions.)

$4.90
Reply