Awesome article…must read

Most calculations performed by the average R user are unremarkable in the sense that nowadays, any computer can crush the related code in a matter of seconds. But more and more often, heavy calculations are also performed using R, something especially true in some fields such as statistics. The user then faces total execution times of his codes that are hard to work with: hours, days, even weeks. In this paper, how to reduce the total execution time of various codes will be shown and typical bottlenecks will be discussed. As a last resort, how to run your code on a cluster of computers (most workplaces have one) in order to make use of a larger processing power than the one available on an average computer will also be discussed through two examples.How to Speed up R Code: An Introduction

Nice blog covering the important question: How do I become a data scientist. Worth a read

I asked myself this question a few months ago. Next I thought: What is the definition of Data Science? So the first thing I started to do is read as many posts on the topic as I could get my hands on and also lookup definitions of related topics such as Data Mining and Machine Learning. Looking at the discussions and posts around Data Science it seems to span *everything needed to understand data, to derive something out **of data and communicate the finding*. This does not really help to answer the original question, so let´s take a closer look.

There have been discussions whether a Data Scientist is somebody who knows about everything needed to analyze data or if this task has to be done by teams of specialists. This is found in the Unicorn-Discussionor the “Venn-Diagram-Thread. Let´s have a look at some Venn Diagrams:

The first…

View original post 2,720 more words

This post contains a nice geometric intuition about p-value. Worth a read.

I’m going to start this post with a confession: Up until a few days ago, the only thing I knew about p-values was that Randall Munroe didn’t seem to like them. My background is in geometry, not statistics, even though I occasionally try to fake it. But it turns out that a lot of other people don’t like p-values either, such as the journal Basic and Applied Social Psychology which recently banned them. So I decided to do some reading (primarily Wikipedia) and it turns out, like most things in the world of data, there’s some very interesting geometry involved, at least if you know where to look.

View original post 1,682 more words

This is interesting post…

I wanted to introduce some practical methods of scraping using ScraPy as well as create a README for the Craiglist scraper code. I thought the documentation was initially a little confusing. I also think everyone who is into data science should definitely learn scraping for a couple of reasons.

1. Perfectly formatted datasets never get presented to you.

2. It expands your breadth of possibilities. No longer are you restricted to online directories of datasets that everyone has been accessing forever.

3. You’re doing something that has never been done before. Well, probably not. I mean if you’re using my scraper, you might be gaining insights into a different city. But overall, this is where I think scraping actually establishes creativity. Last quarter I was hanging around the CSE atrium when different people from a joint statistics and cse machine learning learning class were presenting their final projects. It was…

View original post 1,121 more words

A blogpost by Eric Chai

So far on this blog, we’ve seen two very different approaches to constructing models that predict data distributions. With regression, we replaced the original data points with an equation defining a relatively simple shape that approximates the data, then used this to predict one dimension/feature of new points based on the others. With K Nearest Neighbors, we used the data points directly to define a fairly complicated distribution that divided the data space into two (or more) classes, so that we could predict the classes of new data points. Today, we’ll combine elements of both. We’re going to stick with classification, splitting the data space into two classes, but the goal will be to replace the original data with a simplified (linear) model.

View original post 1,014 more words

Over the last few weeks, I’ve introduced two classification methods – Support Vector Machines (SVM) and Logistic Regression – that attempt to find a line, plane or hyperplane (depending on the dimension) that separates two classes of data points. This has the advantage over more flexible methods like K Nearest Neighbors, that once the line/plane/hyperplane is found, new data points can be very quickly classified. This simple model is also less susceptible to overfitting. Both SVM and logistic regression work well when such a line/plane/hyperplane exists (when the two classes are *linearly separable*), but for data that forms a curved shape, there won’t be a line/plane/hyperplane that completely separates the two classes. We saw a similar problem with linear regression and were able to address it by replacing the line/plane/hyperplane with a curve, or a higher dimensional curved shape, which required adding parameters to the model. We could…

View original post 1,180 more words