Can we Predict if we can Build a Bridge in Africa?

Agongora
4 min readNov 19, 2020

For my Labs project I worked with a non-profit called Bridges to Properity. They build bridges in Rwanda to help provide access to resources. The problem they were facing is that just to know if it’s technically feasible to build a bridge (not what kinda of area should have priority this is strictly can we build) it would take one of their engineers a week to review the site. This is very time costly because there are ~1500 sites so it’d take 20 years to review all the bridge sites. Our job was to clean the data and try to make a model that would predict whether or not it was technically feasible to build a bridge at a given site and relay that information in a consumbale way to the front end. Going into this project I was concerned about falling short or doing something wrong because this feels very important but I’m happy to say I got that and did a pretty good job.

The user story we went with was, “As a user, I’d like to input information about a bridge and get out a prediction + probability on whether or not it’s technically feasaible to build at a given site”. Breaking down this user story into tasks was pretty straight forward: clean the data, try several models, check metrics, hyperparameter tune, choose a final model, incorporate that model in the DS API, and deploy the API.

What I Did

There was a column called comments, this column was the data they collected in 2013/2014. Back then the data they collected was different and of a different format but they wanted to keep it so they threw it into the comments column and asked us to parse and incorporate it into the main dataset. I tried to find a clean delimeter and use string manipulation to parse out the correct info but that only worked about for about have of the rows so after 3 days of trying I decided on using Regex. I wish I had done this sooner but I was weary of doing so because I didn’t know regex before then, I had to learn it. Using regex it became a lot more easy (given some testing) to pull out the information I needed. After parsing that column I incorporated overlapping data if a nan was present in the main dataset otherwise the most current data took precedent. I also added new columns for data that was present in the comments column that didn’t correspond to a specific existing column.

Next came cleaning the rest of the columns. I started by dropping all columns that were ≥40% empty. I decided to fix the crossing method column first. It looked like there were only a couple of main crossing methods but a lot of the crossing methods that meant the same thing were cassed difeerently or spelled incorrectly or some other niche error so I started binning them into a few main categories first by lowercasing all the letters and then by searching for key words. This handled most of them but there were a few cases that i just mapped manually in a dictionary.

Next came cleaning distance to reliable all weather crossing point. I noticed that a lot of rows were of the following nature: number or numeral km or kilo
so in regex: (\d.*|\w+.*)(km|kilo) I had to repeat this for meters but divide by 1000 so the units would be in kilometers. Problem, there were strings that were units of time. What i had to search for now was more like this: (\d.*|\w+.*)(hour|hr) then multiplied it by the average km/hr or km/min. At the very end all I had to do is change word numbers (“one”) to numerals (1). Fixing flood duration was basically the same thing but I had to adjust units so everything was in days.

I want to talk about crops and jobs for abit even though i decide to drop them. No amount of regex or string maipulation was working so i decided to use spacy to tokenize the strings then checked out how many unique values were in every column. Jobs even with tokenization wasn’t cleaned veruy well so i could just expllode jobs and but a 1 wherever a givenb job was present. There were like 200 unique jobs which is too much to explode. Crops on teh other hand was relatively cleaned, i had klike 50 crops to look at the end and I could still bin some. The problem is while checking wach of the 50 crops I noticed most appeared less than 15 times so I made the arbitrary rule that if the crop didn’t appear 100 times minimum then I wouldn’t include it. After doing this I has like 2 crops so decided keeping crops was not worth because after applying it about 90% of those thwo crops were null. At this point the data was pretty much clean. All I had left to do is impute, knn impute, and OneHotEncode some columns.

The Model

We had 89 labeled data points and it was imbalanced,. My pipeline was OneHotEncoder, Standard Scaler, and Label propogation to try a semi supervised techinique. This predicted 0 (not feasible) for all 1500 points so something wasn’t working. I tried using imbalanced-learn’s RandomOverSampling to balance the classes but after fittng the pipeline with the new data it still only predicted the 0 class.

What does this mean

I knew from the start that the project wouldn’t pan out because we had so little labeled data. Unless there’s another semi-supervised approach to this I think the non-profit has to build more bridges before predictions are possible. Regardless I had a lot of fun on the project using regex, imbalanced-learn, and models I had never heaerd of. Shows me how much there’s left to learn in the data science field. Also really showed me how much I like data cleaning. I don’t really enjoy the process but having a really nasty dataset and making useable feels good.

--

--