{"feed":"No-Free-Hunch","feedTitle":"No Free Hunch","feedLink":"/feed/No-Free-Hunch","catTitle":"Science","catLink":"/cat/science"}

This post is written by Richard Sproat & Kyle Gorman from Google's Speech & Language Algorithms Team. They hosted the recent, Text Normalization Challenges. Bios below.

Now that the Kaggle Text Normalization Challenges for English and Russian are over, we would once again like to thank the hundreds of teams who participated and submitted results, and congratulate the three teams that won in each challenge.

The purpose of this note is to summarize what we felt we learned from this competition and a few take-away thoughts. We also reveal how our own baseline system (a descendent of the system reported in Sproat & Jaitly 2016) performed on the two tasks.

First some general observations. If there’s one difference that characterizes the English and Russian competitions, it is that the top systems in English involved quite a bit of manual grammar engineering. This took the form of special sets of rules to handle different semiotic classes such as measures, or dates, though, for instance, supervised classifiers were used to identify the appropriate semiotic class for individual tokens. There was quite a bit less of this in Russian and the top solutions there were much more driven by machine-learning solutions,...

As we move into 2018, the monthly Datasets Publishing Awards has concluded. We're pleased to have recognized many publishers of high-quality, original, and impactful datasets. It was only a little over a year ago that we opened up our public Datasets platform to data enthusiasts all over the world to share their work. We've now reached almost 10,000 public datasets, making choosing winners each month a difficult task! These interviews feature the stories and backgrounds of the November and December winners of the prize. This month, we're pleased to highlight:

While the Dataset Publishing Awards are over, you can still win prizes for code contributions to Kaggle Datasets. We're awarding $500 in weekly prizes to authors of high quality kernels on datasets. Click here to learn more »

November Winners: First Place, EEG data from Basic Sensory Task in Schizophrenia by Brian Roach

2017 was a huge year for Kaggle. Aside from joining Google, it also marks the year that our community expanded from being primarily focused on machine learning competitions to a broader data science and machine learning platform. This year our public Datasets platform and Kaggle Kernels both grew ~3x, meaning we now also have a thriving data repository and code sharing environment.  Each of those products are on track to pass competitions on most activity metrics in early 2018.

To give the community more visibility into how Kaggle has changed, we have decided to share our major activity metrics and the commentary around those metrics. And, we’re also giving some visibility into our 2018 plans.

2017 Summary

Active users (unique annual, logged in users) grew to 895K this year up from 471K in 2016 (chart 1). This represents 90% growth for 2017 up from 71% growth in 2016.

While we are still most famous for machine learning competitions, both our public Datasets platform and Kaggle Kernels are on track to be larger drivers of activity on Kaggle in early 2018.

Chart 1: Active users


We launched 41 machine learning competitions this year, up from 33 last...

This article was jointly written by Keshav Dhandhania and Arash Delijani, bios below.

In this article, I’ll talk about Generative Adversarial Networks, or GANs for short. GANs are one of the very few machine learning techniques which has given good performance for generative tasks, or more broadly unsupervised learning. In particular, they have given splendid performance for a variety of image generation related tasks. Yann LeCun, one of the forefathers of deep learning, has called them “the best idea in machine learning in the last 10 years”. Most importantly, the core conceptual ideas associated with a GAN are quite simple to understand (and in-fact, you should have a good idea about them by the time you finish reading this article).

In this article, we’ll explain GANs by applying them to the task of generating images. The following is the outline of this article

  1. A brief review of Deep Learning
  2. The image generation problem
  3. Key issue in generative tasks
  4. Generative Adversarial Networks
  5. Challenges
  6. Further reading
  7. Conclusion

A brief review of Deep Learning

Sketch of a (feed-forward) neural network, with input layer in brown, hidden layers in yellow, and output layer in red.

Let’s begin...

To ensure the safety and reliability of each and every unique car configuration before they hit the road, Daimler’s engineers have developed a robust testing system. But, optimizing the speed of their testing system for so many possible feature combinations is complex and time-consuming without a powerful algorithmic approach.

In this competition launched earlier this year, Daimler challenged Kagglers to tackle the curse of dimensionality and reduce the time that cars spend on the test bench. Competitors worked with a dataset representing different permutations of Mercedes-Benz car features to predict the time it takes to pass testing. Winning algorithms would contribute to speedier testing, resulting in lower carbon dioxide emissions without reducing Daimler’s standards.

The dataset contained an anonymized set of variables (8 categorical and 368 binary features), labeled X0, X1,X2…, each representing a custom feature in a Mercedes car. For example, a variable could be 4WD, added air suspension, or a head-up display.

The dependent variable was the time (in seconds) that the car took to pass testing for each variable. Train and test sets had 4209 rows each.

In this interview, first place winner, gmobaz, shares how he used an approach that proposed important interactions.

Basics What was your backgrounds prior to entering this...

2017 has been an exciting ride for us, and like last year, we'd love to enter the new year sharing and celebrating some of your highlights through stats. There are major machine learning trends, impressive achievements, and fun factoids that all add up to one amazing community. Enjoy!

Public Datasets Platform & Kernels

It became clear this year that Kaggle's grown to be more than just a competitions platform. Our total number of dataset downloaders on our public Datasets platform is very close to meeting the total number of competition dataset downloaders – both around 350,000 data scientists each.


This year, Carvana, a successful online used car startup, challenged the Kaggle community to develop an algorithm that automatically removes the photo studio background. This would allow Carvana to superimpose cars on a variety of backgrounds. In this winner's interview, the first place team of accomplished image processing competitors named Team Best[over]fitting, shares in detail their winning approach.


As it often happens in the competitions, we never met in person, but we knew each other pretty well from the fruitful conversations about Deep Learning held on the Russian-speaking Open Data Science community,

Although we participated as a team, we worked on 3 independent solutions until merging 7 days before the end of the competition. Each of these solutions were in the top 10–Artsiom and Alexander were in 2nd place and Vladimir was in 5th. Our final solution was a simple average of three predictions. You can also see this in the code that we prepared for organizers and released on GitHub–there are 3 separate folders:

This tutorial was originally posted here on Ben's blog, GormAnalysis.

The purpose of this article is to hold your hand through the process of designing and training a neural network. Note that this article is Part 2 of Introduction to Neural Networks. R code for this tutorial is provided here in the Machine Learning Problem Bible.


Description of the problem

We start with a motivational problem. We have a collection of 2×2 grayscale images. We’ve identified each image as having a “stairs” like pattern or not. Here’s a subset of those.

Our goal is to build and train a neural network that can identify whether a new 2×2 image has the stairs pattern.

Description of the network

Our problem is one of binary classification. That means our network could have a single output node that predicts the probability that an incoming image represents stairs. However, we’ll choose to interpret the problem as a multi-class classification problem – one where our output layer has two nodes that represent “probability of stairs” and “probability of something else”. This is unnecessary, but it will give us insight into how we could extend task for more classes. In the future, we may want to classify {“stairs pattern”, “floor pattern”, “ceiling pattern”, or “something else”}.

Our measure of success might be something like accuracy rate, but to implement backpropagation (the fitting procedure) we need...

This tutorial was originally posted here on Ben's blog, GormAnalysis.

Artificial Neural Networks are all the rage. One has to wonder if the catchy name played a role in the model’s own marketing and adoption. I’ve seen business managers giddy to mention that their products use “Artificial Neural Networks” and “Deep Learning”. Would they be so giddy to say their products use “Connected Circles Models” or “Fail and Be Penalized Machines”? But make no mistake – Artificial Neural Networks are the real deal as evident by their success in a number of applications like image recognition, natural language processing, automated trading, and autonomous cars. As a professional data scientist who didn’t fully understand them, I felt embarrassed like a builder without a table saw. Consequently I’ve done my homework and written this article to help others overcome the same hurdles and head scratchers I did in my own (ongoing) learning process.

Note that R code for the examples presented in this article can be found here in the Machine Learning Problem Bible. Additionally, come back for Part 2, to see the details behind designing and coding a neural network from scratch.

We’ll start with a motivational problem. Here we have a collection of grayscale images, each a 2×2 grid of pixels where each pixel has an intensity value between 0 (white) and 255 (black). The goal is to build a model that identifies images with a “stairs” pattern.

This interview features the stories and backgrounds of the October winners of our $10,000 Datasets Publishing Award–Zeeshan-ul-hassan UsmaniEtienne Le Quéré, and Felipe Antunes. If you're inspired to contribute a dataset and compete for next month's prize, check out this page for more details.

First Place, US Mass Shootings - Last 50 Years (1966-2017) by Zeeshan-ul-hassan Usmani Can you tell us a little about your background?

I am a freelance A.I and Data Science consultant. I have a Masters and a Ph.D. in Computer Science from Florida Institute of Technology. I've worked with the United Nations, Farmer's Insurance, Wal-Mart, Best Buy, 1-800-Flowers, Planned Parenthood, Vicrtoria's Secret, MetLife, SAKS Analytics, North Carolina Health Department and some other small companies, governments, and universities in the US, Pakistan, Canada, United Kingdom, Lithuania, China, Bangladesh, Ireland, Sri Lanka and the Middle East. Currently, I am working on a few consulting assignments regarding the government's use of AI in a cyber-connected world. Here are two of my CNN interviews on the power of datasets and who is joining ISIS. I've recently published a book called Kaggle for Beginners. I have one wife, four boys, two cats and a lovely dog.

What motivated you to share this dataset with the...

Today, we’re excited to announce Kaggle’s Data Science for Good program! We’re launching the Data Science for Good program to enable the Kaggle community to come together and make significant contributions to tough social good problems with datasets that don’t necessarily fit the tight constraints of our traditional supervised machine learning competitions.

What does a Data Science for Good Event Look Like?

Data Science for Good events will unite the energy and talent of a diverse community to drive positive impact on data problems posed by non-profit hosts. Kaggle’s Datasets platform will provide a democratized workspace for data scientists to analyze the data and publish their work. The open and collaborative environment will encourage data scientists to build on each other’s work and to push each problem to the limit of what is possible.

The specific objectives for each event will be described by the event hosts. Objectives may range from creating a portfolio of illuminating interactive data visualizations to transparently diagnosing algorithmic bias.

Participants will have a timeline to develop their insights via Python or R code written using Kernels, our hosted Jupyter Notebooks-based workbench. At the close of an event, hosts will select authors of analyses to win cash prizes.

Today we’re pleased to announce a 20x increase to the size limit of datasets you can share on Kaggle Datasets for free! At Kaggle, we’ve seen time and again how open, high quality datasets are the catalysts for scientific progress–and we’re striving to make it easier for anyone in the world to contribute and collaborate with data.

In addition to allowing dataset sizes up to 10 Gb (from 500 Mb), Timo on our Datasets engineering team has worked hard to increase resources in other exciting ways, too. Check it out below.

The increased resources mean that you can more easily:

Also, a reminder that the increased limits are per dataset; as always, you can share any number of data projects with the Kaggle community.

Get started by clicking “New Dataset” from the Datasets page.

Plus, writing and sharing reproducible R and Python analyses on larger datasets on Kaggle is also easier with the recent boost to...

In 2017 we conducted our first ever extra-large, industry-wide survey to captured the state of data science and machine learning.

As the data science field booms, so has our community. In 2017 we hit a new milestone of reaching over 1M registered data scientists from almost every country in the world. Representing many different backgrounds, skill levels, and professions, we were excited to ask our community a wide range of questions about themselves, their skills, and their path to data science. We asked them everything from “what’s your yearly salary?” to “what’s your favorite data science podcasts?” to “what barriers are faced at work?”, letting us piece together key insights about the people and the trends behind the machine learning models.

Without further ado, we’d love to share everything with you. Over 16,000 responses surveys were submitted, with over 6 full months of aggregated time spent completing it (an average response time of more than 16 minutes). Today we’re publicly releasing:

  • This interactive report featuring a few initial insights from the survey. We put this together with the folks from the Polygraph. It includes interactive visualizations so you can easily cut the data to find out exactly...

This interview features the stories and backgrounds of our $10,000 Datasets Publishing Award's September winners–Khuram Zaman, Mitchell J, and Dave Fisher-Hickey. If you're inspired to publish your own datasets on Kaggle and vie for next month's prize, check out this page for more details.

First Place, Religious Texts Used By ISIS by Fifth Tribe (Khuram Zaman) Can you tell us a little about your background?

I’m the CEO of a digital agency called Fifth Tribe based out of 1776 in Crystal City, VA. We do branding, web/mobile application development, and digital marketing. Every few months, we do a company wide hackathon and everyone gets to work on a project and a tech stack of their choosing. I tend to do projects in Python and around data scraping on interesting subjects like violent extremism on digital platforms like twitter.

What motivated you to share this dataset with the community on Kaggle?

I posted a dataset last year (“How ISIS Fanboys Use Twitter” and it generated a lot of interesting insights and opened up a lot of conversations with people from various perspectives (researchers, government officials, businesses, civic leaders, etc). I uploaded the second dataset to build off of the previous dataset....

In our recent Planet: Understanding the Amazon from Space competition, Planet challenged the Kaggle community to label satellite images from the Amazon basin, in order to better track and understand causes of deforestation.

The competition contained over 40,000 training images, each of which could contain multiple labels, generally divided into the following groups:

  • Atmospheric conditions: clear, partly cloudy, cloudy, and haze
  • Common land cover and land use types: rainforest, agriculture, rivers, towns/cities, roads, cultivation, and bare ground
  • Rare land cover and land use types: slash and burn, selective logging, blooming, conventional mining, artisanal mining, and blow down

We recently talked to user bestfitting, the winner of the competition, to learn how he used an ensemble of 11 finely tuned convolutional nets, models of label correlation structure, and a strong focus on avoiding overfitting, to achieve 1st place.

Basics What was your background prior to entering this challenge?

I majored in computer science and have more than 10 years of experience programming in Java and working on large-scale data processing, machine learning, and deep learning.

Do you have any prior experience or...

Welcome back to Data Science 101! Do you have text data? Do you want to figure out whether the opinions expressed in it are positive or negative? Then you've come to the right place! Today, we're going to get you up to speed on sentiment analysis. By the end of this tutorial you will:

  • Understand what sentiment analysis is and how it works
  • Read text from a dataset & tokenize it
  • Use a sentiment lexicon to analyze the sentiment of texts
  • Visualize the sentiment of text

If you're the hands-on type, you might want to head directly to the notebook for this tutorial. You can fork it and have your very own version of the code to run, modify and experiment with as we go along.

What is sentiment analysis?

Sentiment analysis is the computational task of automatically determining what feelings a writer is expressing in text. Sentiment is often framed as a binary distinction (positive vs. negative), but it can also be a more fine-grained, like identifying the specific emotion an author is expressing (like fear, joy or anger).

Sentiment analysis is used for many applications, especially in business intelligence. Some examples of applications for sentiment analysis include:

  • Analyzing the social media discussion around a certain topic
  • Evaluating survey responses
  • Determining whether product reviews are positive or negative

Sentiment analysis is not perfect, and as with any automatic analysis of language, you will have errors in your results. It also cannot tell you why a writer is feeling a certain way. However, it can be useful to quickly...

Kaggle’s kernels focused engineering team has been working hard to make our coding environment one that you want to use for all of your side projects. We’re excited to announce a host of new changes that we believe make Kernels the default place you’ll want to train your competition models, explore open data, and build your data science portfolio. Here’s exactly what’s changed:

Additional Computational Resources (doubled and tripled)
  • Execution time: Now your kernels can run for up to 60 minutes instead of our past 20 minute limit.
  • CPUs: Use up to four CPUs for multithreaded workloads..
  • RAM: Work with twice as much data with 16 GB of RAM available for every kernel.
  • Disk space: Create killer output with 1 GB of disk space.

Code Tips

Code tips catch common mistakes as you work through coding a kernel. They will pop up when you run code with an identifiable error and significantly cut down your troubleshooting time.

Here are some examples of the most common code tips you’ll run into:

Although you specified the "R" language, you might be writing Python code. Was this intentional? If not, start a Python script instead.

Couldn't show a character. Did you happen to load binary data as text?

Did you mean "from bs4 import BeautifulSoup"?

Did you mean "ggplot2"?

Do you mean pandas.DataFrame?

Hidden Cells

You publish public kernels so you can share your data science work to build a portfolio, get feedback, and help others learn. We've added the ability...

Our recent Instacart Market Basket Analysis competition challenged Kagglers to predict which grocery products an Instacart consumer will purchase again and when. Imagine, for example, having milk ready to be added to your cart right when you run out, or knowing that it's time to stock up again on your favorite ice cream.

This focus on understanding temporal behavior patterns makes the problem fairly different from standard item recommendation, where user needs and preferences are often assumed to be relatively constant across short windows of time. Whereas Netflix might be fine assuming you want to watch another movie similar to the one you just watched, it's less clear that you'll want to reorder a fresh batch of almond butter or toilet paper if you bought them yesterday.

We interviewed Kazuki Onodera (aka ONODERA on Kaggle), a data scientist at Yahoo! JAPAN, to understand how he used complex feature engineering, gradient boosted tree models, and special modeling of the competition's F1 evaluation metric to win 2nd place.


What was your background prior to entering this challenge?

I studied Economics in university, and I worked as a consultant in the financial industry for several years. In 2015, I won 2nd place...

For many Kagglers, the academic year is getting started which means brushing up on coding skills, learning new machine learning techniques, and finding the right datasets for class projects. In this month's Data Notes, we highlight new features like tagging and our pro-tips for finding datasets. Plus, learn how you can share the datasets you've collected or created on with the Kaggle community for the opportunity to earn part of $10,000 in prizes each month.

If you want to keep up on the latest in community code, public datasets, and data science news, subscribe to our monthly Data Notes newsletter below (or check out past editions here).

Subscribe to the Data Notes Newsletter:

Datasets New feature! Use tags to discover more data faster

Did you know Kaggle hosts over 70 high quality datasets about finance? The new tagging feature helps you...

In August, over 350 new datasets were published on Kaggle, in part sparked by our $10,000 Datasets Publishing Award. This interview delves into the stories and background of August's three winners–Ugo Cupcic, Sudalai Rajkumar, and Colin Morris. They answer questions about what stirred them to create their winning datasets and kernel ideas they'd love to see other Kagglers explore.

If you're inspired to publish your own datasets on Kaggle, know that the Dataset Publishing Award is now a monthly recurrence and we'd love for you to participate. Check out this page for more details.

First Place, Grasping Dataset by Ugo Cupcic Can you tell us a little about your background?

I’m the Chief Technical Architect at the Shadow Robot company. I joined Shadow in 2009 after working briefly for a consulting firm in London. I joined Shadow as a software engineer when it was still a very small company. I then evolved as the needs of the company diversified while growing - senior software engineer, head of software and finally CTA. My background is in bio-informatics (!) and AI. Feel free to connect on Linkedin if you want to know more.