{"feed":"No-Free-Hunch","feedTitle":"No Free Hunch","feedLink":"/feed/No-Free-Hunch","catTitle":"Science","catLink":"/cat/science"}

We’re building Kaggle into a platform where you can collaboratively create all of your data science projects. This past quarter, we’ve increased the breadth and scope of work you can build on our platform by launching many new features and expanding computational resources.

It is now possible for you to load private datasets you’re working with, develop complex analyses on them in our cloud-based data science environment, and share the project with collaborators in a reproducible way.

Upload private datasets to Kaggle

We first launched Kaggle Kernels and Datasets as public products, where everything created and shared needed to be public. Last June, we enabled you to create private Kaggle Kernels. This transformed how many of you used Kaggle: 94.4% of kernels created since then have been private.

However, this story has been incomplete: you’ve been limited to running kernels on public data. This prevented you from using Kaggle for your own private projects.

This past quarter, we launched private datasets. This lets you upload private datasets to Kaggle and run Python or R code on them in kernels. You can upload an unlimited number of private datasets, up to a 20GB quota. All new datasets default to private. You can create a dataset by clicking "New Dataset" on or "Upload a Dataset" from the data tab on the...

This is a guest post written by Kaggle Competition Master and  part of a team that achieved 5th position in the 'Planet: Understanding the Amazon from Space' competition, Indra den Bakker. In this post, he shares the journey from Kaggle competition winner to start-up founder focused on tracking deforestation and other forest management insights.

Back in the days, during my studies I was introduced to Kaggle. For the course ‘Data Mining Techniques’ at VU University Amsterdam we had to compete in the competition Personalize Expedia Hotel Searches — ICDM 2013. Me and my fellow team members did okay but we were far from the top scoring teams. However, I immediately wanted to learn more about the field of machine learning and it’s applications.

In the years to come, I competed in several competitions. I never managed to dedicate as much time as I wanted but every single competition was a great learning experience. In 2017, one of the goals that I had set for myself was to pick a Kaggle competition and fully focus on it to get my first golden medal. The competition I picked was Planet: Understanding the Amazon from Space. I already had some...

This infographic series features the speakers from Kaggle's CareerCon 2018 session, "Real Stories from a Panel of Successful Career Switchers". View videos from the event here.

Have you used Kaggle's beta API to download data or make a competition submission? We're pleased to announce version 1.1 of the API which includes new features for easily managing your datasets on Kaggle from the command line.

Read on to learn how to use the API to create and update datasets or check out detailed documentation on our GitHub page.

Create new datasets »

After you follow the installation instructions, it's simple to create a new dataset on Kaggle from files on your local machine:

  1. Create a folder containing the files you want to upload
  2. Run kaggle datasets init -p /path/to/dataset to generate a metadata file
  3. Add your dataset's metadata to the generated file, datapackage.json
  4. Run kaggle datasets create -p /path/to/dataset to create the dataset

Your dataset will be private by default. You can also add a -u flag to make it public when you create it, or navigate to "Settings" > "Sharing" from your dataset's page to make it public or share with collaborators.

Update datasets »

You can also create new versions of existing datasets allowing you to programmatically keep a dataset fresh with the latest data.

  1. Run kaggle...

This infographic series features the speakers from Kaggle's CareerCon 2018 session, "Real Stories from a Panel of Successful Career Switchers". Sign up for the event here.

This infographic series features the speakers from Kaggle's CareerCon 2018 session, "Real Stories from a Panel of Successful Career Switchers". Sign up for the event here.

This post is written by Richard Sproat & Kyle Gorman from Google's Speech & Language Algorithms Team. They hosted the recent, Text Normalization Challenges. Bios below.

Now that the Kaggle Text Normalization Challenges for English and Russian are over, we would once again like to thank the hundreds of teams who participated and submitted results, and congratulate the three teams that won in each challenge.

The purpose of this note is to summarize what we felt we learned from this competition and a few take-away thoughts. We also reveal how our own baseline system (a descendent of the system reported in Sproat & Jaitly 2016) performed on the two tasks.

First some general observations. If there’s one difference that characterizes the English and Russian competitions, it is that the top systems in English involved quite a bit of manual grammar engineering. This took the form of special sets of rules to handle different semiotic classes such as measures, or dates, though, for instance, supervised classifiers were used to identify the appropriate semiotic class for individual tokens. There was quite a bit less of this in Russian and the top solutions there were much more driven by machine-learning solutions,...

As we move into 2018, the monthly Datasets Publishing Awards has concluded. We're pleased to have recognized many publishers of high-quality, original, and impactful datasets. It was only a little over a year ago that we opened up our public Datasets platform to data enthusiasts all over the world to share their work. We've now reached almost 10,000 public datasets, making choosing winners each month a difficult task! These interviews feature the stories and backgrounds of the November and December winners of the prize. This month, we're pleased to highlight:

While the Dataset Publishing Awards are over, you can still win prizes for code contributions to Kaggle Datasets. We're awarding $500 in weekly prizes to authors of high quality kernels on datasets. Click here to learn more »

November Winners: First Place, EEG data from Basic Sensory Task in Schizophrenia by Brian Roach

2017 was a huge year for Kaggle. Aside from joining Google, it also marks the year that our community expanded from being primarily focused on machine learning competitions to a broader data science and machine learning platform. This year our public Datasets platform and Kaggle Kernels both grew ~3x, meaning we now also have a thriving data repository and code sharing environment.  Each of those products are on track to pass competitions on most activity metrics in early 2018.

To give the community more visibility into how Kaggle has changed, we have decided to share our major activity metrics and the commentary around those metrics. And, we’re also giving some visibility into our 2018 plans.

2017 Summary

Active users (unique annual, logged in users) grew to 895K this year up from 471K in 2016 (chart 1). This represents 90% growth for 2017 up from 71% growth in 2016.

While we are still most famous for machine learning competitions, both our public Datasets platform and Kaggle Kernels are on track to be larger drivers of activity on Kaggle in early 2018.

Chart 1: Active users


We launched 41 machine learning competitions this year, up from 33 last...

This article was jointly written by Keshav Dhandhania and Arash Delijani, bios below.

In this article, I’ll talk about Generative Adversarial Networks, or GANs for short. GANs are one of the very few machine learning techniques which has given good performance for generative tasks, or more broadly unsupervised learning. In particular, they have given splendid performance for a variety of image generation related tasks. Yann LeCun, one of the forefathers of deep learning, has called them “the best idea in machine learning in the last 10 years”. Most importantly, the core conceptual ideas associated with a GAN are quite simple to understand (and in-fact, you should have a good idea about them by the time you finish reading this article).

In this article, we’ll explain GANs by applying them to the task of generating images. The following is the outline of this article

  1. A brief review of Deep Learning
  2. The image generation problem
  3. Key issue in generative tasks
  4. Generative Adversarial Networks
  5. Challenges
  6. Further reading
  7. Conclusion

A brief review of Deep Learning

Sketch of a (feed-forward) neural network, with input layer in brown, hidden layers in yellow, and output layer in red.

Let’s begin...

To ensure the safety and reliability of each and every unique car configuration before they hit the road, Daimler’s engineers have developed a robust testing system. But, optimizing the speed of their testing system for so many possible feature combinations is complex and time-consuming without a powerful algorithmic approach.

In this competition launched earlier this year, Daimler challenged Kagglers to tackle the curse of dimensionality and reduce the time that cars spend on the test bench. Competitors worked with a dataset representing different permutations of Mercedes-Benz car features to predict the time it takes to pass testing. Winning algorithms would contribute to speedier testing, resulting in lower carbon dioxide emissions without reducing Daimler’s standards.

The dataset contained an anonymized set of variables (8 categorical and 368 binary features), labeled X0, X1,X2…, each representing a custom feature in a Mercedes car. For example, a variable could be 4WD, added air suspension, or a head-up display.

The dependent variable was the time (in seconds) that the car took to pass testing for each variable. Train and test sets had 4209 rows each.

In this interview, first place winner, gmobaz, shares how he used an approach that proposed important interactions.

Basics What was your backgrounds prior to entering this...

2017 has been an exciting ride for us, and like last year, we'd love to enter the new year sharing and celebrating some of your highlights through stats. There are major machine learning trends, impressive achievements, and fun factoids that all add up to one amazing community. Enjoy!

Public Datasets Platform & Kernels

It became clear this year that Kaggle's grown to be more than just a competitions platform. Our total number of dataset downloaders on our public Datasets platform is very close to meeting the total number of competition dataset downloaders – both around 350,000 data scientists each.


This year, Carvana, a successful online used car startup, challenged the Kaggle community to develop an algorithm that automatically removes the photo studio background. This would allow Carvana to superimpose cars on a variety of backgrounds. In this winner's interview, the first place team of accomplished image processing competitors named Team Best[over]fitting, shares in detail their winning approach.


As it often happens in the competitions, we never met in person, but we knew each other pretty well from the fruitful conversations about Deep Learning held on the Russian-speaking Open Data Science community,

Although we participated as a team, we worked on 3 independent solutions until merging 7 days before the end of the competition. Each of these solutions were in the top 10–Artsiom and Alexander were in 2nd place and Vladimir was in 5th. Our final solution was a simple average of three predictions. You can also see this in the code that we prepared for organizers and released on GitHub–there are 3 separate folders:

This tutorial was originally posted here on Ben's blog, GormAnalysis.

The purpose of this article is to hold your hand through the process of designing and training a neural network. Note that this article is Part 2 of Introduction to Neural Networks. R code for this tutorial is provided here in the Machine Learning Problem Bible.


Description of the problem

We start with a motivational problem. We have a collection of 2×2 grayscale images. We’ve identified each image as having a “stairs” like pattern or not. Here’s a subset of those.

Our goal is to build and train a neural network that can identify whether a new 2×2 image has the stairs pattern.

Description of the network

Our problem is one of binary classification. That means our network could have a single output node that predicts the probability that an incoming image represents stairs. However, we’ll choose to interpret the problem as a multi-class classification problem – one where our output layer has two nodes that represent “probability of stairs” and “probability of something else”. This is unnecessary, but it will give us insight into how we could extend task for more classes. In the future, we may want to classify {“stairs pattern”, “floor pattern”, “ceiling pattern”, or “something else”}.

Our measure of success might be something like accuracy rate, but to implement backpropagation (the fitting procedure) we need...

This tutorial was originally posted here on Ben's blog, GormAnalysis.

Artificial Neural Networks are all the rage. One has to wonder if the catchy name played a role in the model’s own marketing and adoption. I’ve seen business managers giddy to mention that their products use “Artificial Neural Networks” and “Deep Learning”. Would they be so giddy to say their products use “Connected Circles Models” or “Fail and Be Penalized Machines”? But make no mistake – Artificial Neural Networks are the real deal as evident by their success in a number of applications like image recognition, natural language processing, automated trading, and autonomous cars. As a professional data scientist who didn’t fully understand them, I felt embarrassed like a builder without a table saw. Consequently I’ve done my homework and written this article to help others overcome the same hurdles and head scratchers I did in my own (ongoing) learning process.

Note that R code for the examples presented in this article can be found here in the Machine Learning Problem Bible. Additionally, come back for Part 2, to see the details behind designing and coding a neural network from scratch.

We’ll start with a motivational problem. Here we have a collection of grayscale images, each a 2×2 grid of pixels where each pixel has an intensity value between 0 (white) and 255 (black). The goal is to build a model that identifies images with a “stairs” pattern.

This interview features the stories and backgrounds of the October winners of our $10,000 Datasets Publishing Award–Zeeshan-ul-hassan UsmaniEtienne Le Quéré, and Felipe Antunes. If you're inspired to contribute a dataset and compete for next month's prize, check out this page for more details.

First Place, US Mass Shootings - Last 50 Years (1966-2017) by Zeeshan-ul-hassan Usmani Can you tell us a little about your background?

I am a freelance A.I and Data Science consultant. I have a Masters and a Ph.D. in Computer Science from Florida Institute of Technology. I've worked with the United Nations, Farmer's Insurance, Wal-Mart, Best Buy, 1-800-Flowers, Planned Parenthood, Vicrtoria's Secret, MetLife, SAKS Analytics, North Carolina Health Department and some other small companies, governments, and universities in the US, Pakistan, Canada, United Kingdom, Lithuania, China, Bangladesh, Ireland, Sri Lanka and the Middle East. Currently, I am working on a few consulting assignments regarding the government's use of AI in a cyber-connected world. Here are two of my CNN interviews on the power of datasets and who is joining ISIS. I've recently published a book called Kaggle for Beginners. I have one wife, four boys, two cats and a lovely dog.

What motivated you to share this dataset with the...

Today, we’re excited to announce Kaggle’s Data Science for Good program! We’re launching the Data Science for Good program to enable the Kaggle community to come together and make significant contributions to tough social good problems with datasets that don’t necessarily fit the tight constraints of our traditional supervised machine learning competitions.

What does a Data Science for Good Event Look Like?

Data Science for Good events will unite the energy and talent of a diverse community to drive positive impact on data problems posed by non-profit hosts. Kaggle’s Datasets platform will provide a democratized workspace for data scientists to analyze the data and publish their work. The open and collaborative environment will encourage data scientists to build on each other’s work and to push each problem to the limit of what is possible.

The specific objectives for each event will be described by the event hosts. Objectives may range from creating a portfolio of illuminating interactive data visualizations to transparently diagnosing algorithmic bias.

Participants will have a timeline to develop their insights via Python or R code written using Kernels, our hosted Jupyter Notebooks-based workbench. At the close of an event, hosts will select authors of analyses to win cash prizes.

Today we’re pleased to announce a 20x increase to the size limit of datasets you can share on Kaggle Datasets for free! At Kaggle, we’ve seen time and again how open, high quality datasets are the catalysts for scientific progress–and we’re striving to make it easier for anyone in the world to contribute and collaborate with data.

In addition to allowing dataset sizes up to 10 Gb (from 500 Mb), Timo on our Datasets engineering team has worked hard to increase resources in other exciting ways, too. Check it out below.

The increased resources mean that you can more easily:

Also, a reminder that the increased limits are per dataset; as always, you can share any number of data projects with the Kaggle community.

Get started by clicking “New Dataset” from the Datasets page.

Plus, writing and sharing reproducible R and Python analyses on larger datasets on Kaggle is also easier with the recent boost to...

In 2017 we conducted our first ever extra-large, industry-wide survey to captured the state of data science and machine learning.

As the data science field booms, so has our community. In 2017 we hit a new milestone of reaching over 1M registered data scientists from almost every country in the world. Representing many different backgrounds, skill levels, and professions, we were excited to ask our community a wide range of questions about themselves, their skills, and their path to data science. We asked them everything from “what’s your yearly salary?” to “what’s your favorite data science podcasts?” to “what barriers are faced at work?”, letting us piece together key insights about the people and the trends behind the machine learning models.

Without further ado, we’d love to share everything with you. Over 16,000 responses surveys were submitted, with over 6 full months of aggregated time spent completing it (an average response time of more than 16 minutes). Today we’re publicly releasing:

  • This interactive report featuring a few initial insights from the survey. We put this together with the folks from the Polygraph. It includes interactive visualizations so you can easily cut the data to find out exactly...

This interview features the stories and backgrounds of our $10,000 Datasets Publishing Award's September winners–Khuram Zaman, Mitchell J, and Dave Fisher-Hickey. If you're inspired to publish your own datasets on Kaggle and vie for next month's prize, check out this page for more details.

First Place, Religious Texts Used By ISIS by Fifth Tribe (Khuram Zaman) Can you tell us a little about your background?

I’m the CEO of a digital agency called Fifth Tribe based out of 1776 in Crystal City, VA. We do branding, web/mobile application development, and digital marketing. Every few months, we do a company wide hackathon and everyone gets to work on a project and a tech stack of their choosing. I tend to do projects in Python and around data scraping on interesting subjects like violent extremism on digital platforms like twitter.

What motivated you to share this dataset with the community on Kaggle?

I posted a dataset last year (“How ISIS Fanboys Use Twitter” and it generated a lot of interesting insights and opened up a lot of conversations with people from various perspectives (researchers, government officials, businesses, civic leaders, etc). I uploaded the second dataset to build off of the previous dataset....