Weather Patterns Using Hadoop and R – Part 2

The first part of this post looked at data mining of world weather patterns such as cloud location and cloud coverage levels.

As these features describe aggregate values and ratios they are quite simple to evaluate with Hadoop, R and MapReduce Streaming. However for time dependant analytics, such as how long rainfall happens, additional issues need to be addressed.

Time Dependent Data Mining

Time based weather analytics focuses on weather at any location as time passes. This is done by extracting data using R so that Hadoop sorts data in the shuffle phase by weather station then time. However this is quite inefficient as, for any station, data is scattered across every file in the dataset.

Filtering down to 6 stations spread across the globe significantly reduces the data load for Hadoop’s shuffle phase, while there remains sufficient data to look at world weather and regional patterns.

The 6 stations are chosen across the globe:

  • Arctic and Antarctic – Danmarkshavn,Greenland and Base Belgrano II, an Argentinian run Antarctic weather station.
  • Mid Latitudes – Dublin Airport, Ireland and Christchurch, New Zealand.
  • Tropics – Khartoum, Sudan and La Paz, Bolivia.

The following R script brings this all together to form the Mapper:

# Mapper script
#! /usr/bin/env Rscript
stations <- c(" 4320", #Danmarkshavn
              " 3969", #Dublin Airport
              "62721", #Khartoum
              "85201", #La Paz
              "93890", #Christchurch
              "89674") #Base Belgrano II

#days in standard 365 day year
months <- c(31,28,31,30,31,30,31,31,30,31,30,31)
#days in year
#from 1981 to 1991, inclusive
years <- c(365,365,365,366,365,365,365,366,365,365,365)

con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0)
     | substr(line,1,2)=="88")
     months[2] <- 29
    } else {
     months[2] <- 28

 #land station synoptic code
#time in number of hours since Jan 1, 1981 00:00
 #precipitation level

Using this script with an identity reducer data mining can be performed from the output in R, outside Hadoop. Note that a map-only job (achieved with the command line argument -D mapred.reduce.tasks=0) won’t shuffle the data, so an identity reducer is necessary for this task.


When Precipitation Happens

As noted in the previous post, nimbostratus always occurs with precipitation. For stratus, 45% of the time this cloud appears precipitation develops before either the stratus disappears or 24 hours have passed. With other cloud types this rate ranges from 5% to 26%.

As mentioned before, high-level cloud types are not precipitation inducing so any data mining results relating to precipitation are meaningless[1].

How Long Precipitation Lasts

Within the 6 chosen stations nimbostratus precipitation lasts for up to 6 days, corresponding with the expectation that precipitation “is generally steady and prolonged”[1]. With stratus, precipitation can also last for up to 6 days, with cirrostratus for up to 4.5 days while for other cloud types the maximum is 3 days.

Cloud Development Patterns

Nimbostratus – follows altocumulus slightly more often than altostratus. May also form following occurrence of cirrus cloud cover and is roughly equally as likely to develop into altocumulus as altostratus.

Cirrus – at the 6 stations often develops into cirrostratus. Cirrostratus is known to form from “spreading and joining of cirrus clouds”[1].

Cumulus – at the chosen stations usually develops into stratocumulus or cumulonimbus. Indeed cumulonimbus is known to form from “upwardly mobile cumulus congestus clouds (thermals)”[1].

Stratocumulus – 31% of occurrences are followed by altocumulus. Indeed
altocumulus can form “by subdivision of a layer of stratocumulus”[2]. 27% of stratocumulus occurrences are followed by cumulus. Indeed “cumulus may originate from … stratocumulus”[3]. Also, 33% of stratocumulus occurrences are followed by stratus. This is line with the expectation that “stratus may develop from stratocumulus”[4].

Stratus – is more likely to develop into stratocumulus than no low cloud cover.


Again, data mining results compare quite well with established cloud facts and world weather satellite data, albeit this part of the study restricts itself to 6 land based locations.

The set of stations can be changed so that the trade-off between accuracy and time running Hadoop is optimal while various data analytics, such as regional weather effects, can be extracted. The main drawback is that this data mining involves re-running the MapReduce job each time the station list is changed.


  1. ^ a b c Common Cloud Names, Shapes, and Altitudes
  2. ^ Altocumulus
  3. ^ Cumulus
  4. ^ Stratus

Weather Patterns Using Hadoop and R – Part 1

In the post Cloud Weather Reporting with R and Hadoop the meaning of synoptic cloud data was not explored. This post looks at cloud patterns and their impact on world weather.

The cloud weather analytics covered mainly comprises aggregate frequency data (e.g. cloud location, coverage and precipitation types) and time series data (e.g. when precipitation happens and for how long). This post focusses on frequency data analytics, while the next post will focus on time dependant analytics.

Frequency Based Analytics

Frequency based properties are evaluated through MapReduce by extracting relevant weather data using the following R script as Mapper:

# Mapper script
#! /usr/bin/env Rscript

con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0)
 #output latitude, cloud types and precipitation level
 #latitude is coded as 0 for tropical, 1 for mid-latitudes, 2 for polar regions

The Reducer script from a previous post gives counts for each combination of cloud type and precipitation level, from which most frequency based weather analytics can be derived. For example, stratocumulus frequency is the aggregate count of where the low cloud type code is 4,5 or 8, divided by the total number of reports in the dataset.

Ideally from a data mining perspective, error prone work such as interpretation of WMO codes and further data analytics can be done using an R script and, as this is outside Hadoop, can be tested and amended quickly.


This data mining reveals a few properties of world weather, in addition to a few local weather patterns.

Location of Clouds

Nimbostratus is “common in middle latitudes”[1] – in the dataset 85% of nimbostratus occurrences are mid-latitude. Cumulonimbus is “common in tropics and temperate regions, rare at poles”[1], somewhat reflected in the dataset where 23% of occurrences of cumulonimbus are in tropical regions and 4% are in polar regions.

Likelihood of Precipitation

100% of occurrences of nimbostratus occur with precipitation, as WMO codes define precipitation bearing altostratus or altocumulus to be nimbostratus. Indeed nimbostratus is derived from ‘nimbus’, Latin for rain.

Conversely, high cloud types (cirrus, cirrostratus and cirrocumulus) generate no precipitation[1] so any perceived correlation with precipitation levels is meaningless. For remaining cloud types, precipitation occurrence levels range from 3% for cumulus to 38% for stratus. Indeed cumulus is associated with “generally no precipitation”[1].

Types of Precipitation

Cumulonimbus typically results in “Heavy downpours, hail”[1] – in the dataset 91% of precipitation associated with cumulonimbus is in the thunderstorm range. Indeed in the tropics, cumulonimbus often forms from hot weather through convection during monsoon season.

Conversely, nimbostratus is rarely seen with thunderstorms, more with “moderate to heavy rain or snow”[1] – 54% of nimbostratus occurrences happen with rain. Similarly, stratocumulus is usually accompanied by “occasional light rain, snow”[1] – 47% of cumulonimbus occurrences happen with rain.

Cloud Embedding

Cumulonimbus can be embedded with nimbostratus, and in such instances may result in heavy rain, but this is rare[2]. Indeed, further data mining reveals that only 5.9% of the occurrences of nimbostratus show this embedding.

Cloud Coverage Levels

The most significant cloud coverage comes from cirrus, stratocumulus and cumulus. For the dataset cirrus cloud cover is 22%, stratocumulus cloud cover is 24% and altocumulus cloud cover is 25%. This compares quite well to satellite data of 20 to 25% for cirrus[3], 25% for stratocumulus[2] and 20 to 25% for altocumulus[4].


The data mining results compare quite well with established cloud facts and world weather satellite data, particularly on location and coverage of clouds.

In the next post more complex data mining is presented i.e. time dependant analytics. For instance how to determine which clouds cause most prolonged rainfall? And how is this analytics done efficiently, given the amount of re-sorting of data involved?


  1. ^ a b c d e f Common Cloud Names, Shapes, and Altitudes
  2. ^ a Cloudwatching
  3. ^ Cirrus Clouds and Climate
  4. ^ Comparisons and analyses of aircraft and satellite observations for wintertime mixed-phase clouds