Introduction
For this two-part project you will be working with a medium-size dataset (about 800K records) of education statistical indicators obtained from the World Bank. The dataset consists of two file:1
2f308b71c-00be-4519-9b0c-0d3100b75092_Data.csv
f308b71c-00be-4519-9b0c-0d3100b75092_Definition and Source.csv
both in csv or comma-separated value format. The first file contains the data, with each record having five fields, described in the first line of this file as:1
Country,CountryCode,SeriesCode,2015
The next 762,000 lines contain records of this form, while the last five lines look like:
1 | Data from database: Education Statistics - All Indicators,,,, |
and can safely be ignored (these lines are either blank - hence the rows of commas - or contain data provenance information).
The second file contains information about the data contained in the first file. Each line is a record consisting of four fields, described in the first line of this file as:1
Code, Indicator Name, Long definition, Source
Each of these lines can be quite long, especially (and not surprisingly) the field called “Long definition.” Using the elipsis (“…”) to shorten the line, a sample field (the 10th line of the file to be precise) looks like:1
UIS.NERA.3, "Adjusted...", "Total...", UNESCO Institute for Statistics
Aside from the textual descriptions, the most interesting field here is the first, which by design contains interesting information about the type of data represented, and which is also appears as the third field in the data file. The key idea is that records in the data file that share the same “Series Code” (or simply “Code” as its called in the definitions file file) are by definition directly comparable. So, for example, if I am interested in the adjusted net enrollment rate for females in primary school (SE.PRM.TENR.FE) across different countries or populations, I could compare all the records from the data file with this “Series Code” (there are 254 of these).
Reading Data
Your first task is to read these data into Python and construct an appropriate representation of these records. You will write two functions to achieve this end.
First, write a function
1 | defreadDefinitions(filename): |
which opens file filename of the second type described above, that is:1
2Code, Indicator Name, Long definition, Source
and returns a dictionary D with entries of the form:
Code: [ Indicator Name, Long definition, Source ]
Reading csv files can be tricky, because it is quite possible that some fields may contain explicit commas embedded in a single field (see, e.g., the “Long definition” field). In such cases, these extraneous commas are protected by enclosing the field in quotes. For example, consider the following line taken from a hypothetical csv file representing Olympic gold medalists:1
Fencing, "Garozzo, Daniele", ITA, 2016
Here, the correct interpretation is a single record (or line) containing four fields (and not five) because the comma in the gold medalist’s name is not semantically equivalent to the other three commas in the line. Fortunately, because this is such a common issue, Python provides a library for reading csv files while honoring the semantics of commas embedded in quoted fields. To use the library, you must first:1
import csv
You will need to read up on how to use the csv library: more information can be found here:
https://docs.python.org/3.5/library/csv.html
Important: do not attempt to split() the line, as I can guarantee this will fail. To be successful, you will need to use the csv library. Also, be wary of the first line in the csv file, which is a “header” and should not be included in the dictionary as data.
The second function you should write:
1 | defreadData(filename): |
should open a file filename of the first type above, that is:1
Country, CountryCode, SeriesCode, 2015
and return a tuple of two dictionaries, C and V . The first dictionary, C, should have entries of the form:1
Country : Country
and therefore should look like, e.g.,1
{'USA': 'United States', 'ITA': 'Italy', 'PRT': 'Portugal', ...}
while the second dictionary, V , in the tuple should have entries of the form:1
Series Code: {Country Code: 2015 [YR2015], ...}
and thus look like, e.g.,1
{'UIS.LR.AG15T99.GPI': {'ARE': '1.02945005893707', ...}, ...}
Again, be wary of the fact that the first line in this csv file is also a “header” and should not be include in the dictionary as data, but nor should the last five lines of this file, which represent summary information and not real data.
In constructing the readData() function, it is important not to include entries where the corresponding value is missing (missing values in this data file are indicated by a string, ‘..’); there will be many fewer values than the 762000 rows in the data file might imply. To get to the heart of the matter, you will write a third function:
1 | defmakeProfiles(C, V): |
which will return a dictionary, P, indexed by country code with values consisting of the number of datapoints present in V for that particular country. Thus part of P produced by your first implementation of makeProfiles() might look like:1
{'MAR': 307, 'IBT': 0, 'WSM': 293, 'URY': 269, ...}
Indicating that there is no data about ‘IBT’ present in this dataset. You will note that ‘IBT’ is defined in C as ‘IDA & IBRD total’ rather than some country name; IDA and IBRD stand for International Development Association and International Bank for Reconstruction and Development, respectively, two branches of the World Bank from which these data were obtained. Given that no data is provided, your complete version of makeProfiles() should alter C to remove the entry for IBT and any other similar code that has no data associated with it. Thus the value of P produced by your final version of makeProfiles() might look like:1
{'SMR': 14, 'UMC': 28, 'ARG': 263, ...}
should not contain any entries with 0 values, and should also modify C to remove those entries. If your code is like mine, final versions of both P and C should contain 241 entries.
Finally, you will implement a plotting function:
1 | defplotProfile(P): |
that will reproduce the figure shown here below.
Note how all of the country codes are arranged alphabetically along the x axis, with the y values corresponding to the number of datapoints for that particular country code. You will need to study the matplotlib pyplot documentation:
http://matplotlib.org/api/pyplot_api.html
to learn how to make your code work.
Once this code is complete, we’ll turn our attention to a more interesting analysis of the data.