Chapter 3 Data transformation

Since we had multiple datasets from different sources which contained data from different time periods and countries, a primary concern was structuring the datasets such that it would be easy to join them as needed for analysis that interested us while feeling secure that we weren’t lose information from mismatched names and the like. For example, when we initially joined the colonial history dataset and modern GDP’s there were issues with countries being named differently that resulted in data loss. Our strategy was to include the iso3 country code for each dataset so that we could easily join datasets on country code whenever needed.

3.1 Colonial History Data and Adding GDP/HDI/Gini Indicators

The colonial history dataset included several countries that no longer exist in modern day or that aren’t usually differed to as a separate country. We examined and removed several countries from this dataset for these kinds of reasons.

Countries removed:

  • Small states that ended up merging to form Germany and Italy (Germany and Italy are in the dataset). North and South Vietnam since they were only separate during the war (Vietnam is in the the dataset).
  • Zanzibar. Tanzania is included, in fact the modern GDP datset has Tanzania listed twice with too different values (one of these may represent Zanzibar), but given the breadth of this analysis it seemed to make the most sense to remove Zanzibar, and allow Tanzania to be included twice.
  • Czechoslovakia. Both Czech Republic and Slovakia are included in the dataset and since the goal is to compare with on modern data it makes sense to look at these states as they are currently.
  • Austria-Hungary. Both Austria and Hungary and included separately in the dataset.
  • Yemen (one variant). Yemen had three variants, two of which were colonized by the UK, and the other which was colonized with Turkey. We removed one of the UK variants, and gave the Yemen colonized by Turkey the countrycode for modern day Yemen. We felt that without deep knowledge of the political history of Yemen, this was the best choice we could make for the sake of investigating impact of colonialism by colonizing country.
  • Korea (unified). This no longer exists and both North and South Korea are included in the dataset.

Since the dataset with the current per capita GDP, Gini Coefficient, and HDI included iso3 codes, merging them was simple at this point.

3.2 Angus Maddison Dataset

The Maddison dataset comes from a project has been developing Maddison’s work since his death. These numbers are the most updated versions including recent historical work. The links to the sources provide detailed documentation/explanation of the changes and significance.

The dataset had information on England starting from 1200 while information on the UK began around 1700 (there was overlap with information listed for both England and UK). We dealt with this by creating an England dataset with a coloumn for both rgdpnapc and cgdppc (see the data section for an explanation as to the meaning of these values, we ended up only using cgdppc) which merged the UK and England information. If data existed for the UK that is what’s included, otherwise information on England is included if it exists. This seemed a reasonable way to look at the change of GDP overtime in England/UK especially given that it’s per capita, so an increase in population from encompassing more area would be accounted for.

3.3 Colonial Transformation Dataset

We used the countrycode R library again to match the ISO 3-char codes with those used in the colonial dataset. We then cleaned the column names for ease of use. They were originally named with over 35-character-long strings that described the indicators and their ranges in detail. We encoded these names with abbreviations, the meanings of which can be found on our github. “PT” indicators are markers of political transformation, “ST” of social transformation, “ET” economic transformation, and “CT” overall colonial transformations. Description of the indicators at large is available on the source website, and a summary of the variable encoding is included on our github in /src.