fbpx

Project Objective: This project focused on the similarity between two texts. The objective was to write a program that takes as inputs two texts and uses a metric to determine how similar they are.

Documents that are exactly the same would get a score of 1, and documents that don’t have any words in common would get a score of 0

The big challenge with this project was that it was done without importing any libraries. It was done using just python.

After building the app, deploy the app online so that anyone can use it to get similarity between 2 texts

Process: Since the challenge was to accomplish the task without a library, I had to figure out how to calculate text similarity manually without using any library.

The first thing I did was pre-process the sample text. The pre-processing steps included

  • Lowering the text
  • Removing punctuation and any character (like emoji) that is not numbers and letters.
  • Splitting the text
  • Removing stop words from the list of stop words text file I created
  • And other text pre-processing steps
The second step I took was to calculate text-similarity. To calculate text similarity, I got the Term Frequency Inverse Document Frequency (TFIDF) value for each document. I then used the TFIDF values to calculate the cosine similarity scoring technique between 2 sample texts.
 
The most difficult part of the project was calculating TF-IDF manually as well as calculating cosine similarity score manually without a library.
 
After the project was built, I deployed it to Heroku using Docker for the environment.

Result: The result of this is a deployed POST method APP where a user can input 2 sample text and get a similarity score. The similarity score may not exactly be accurate because the calculations where done manually without importing a library.

Project Objective: This cross-functional project across data science, web developers, UX/UI, iOS teams aimed to build a travel website. This travel website is supposed to help people plan a trip across the United States. The Data science aspect of the project was creating models to allow users to get predictions about the potential cost of housing and gas prices during the entirety of their trip.

My part in this project was actually building the data science API for the web development team to use on the backend and front end of website. Other data scientist on the team built the model, I took the model predictions and made it available in an API format

Process: I used FastAPI library to build the API, I used docker for the environment, used Amazon Web Services (AWS) Elastic Bean Stalk for hosting the API, and used AWS Route 53 to obtain https security certificate for the website. 

After the model has been built by other Data Scientists, the model was pickled and handed to me for integration into the API.

After building the API, I gave the API link to Web development team and iOS team to incorporate into their website and mobile application

Result: The result is that we had a data science API that can be used to get gas price prediction by zip code. Also, you can use your city (represented as longitude and latitude) to get prediction about what the price of AirBnb rooms will be in the city you are traveling to.

Customers could use this functionality to estimate the cost of their trip from car expenses to housing expense.

Project Objective: This multi image recognition project aims to build a model that can classify 15 various fruits. These are the steps taken to accomplish that mission.

Process: Python Package such as Split_Folders was used to split the image dataset into training, validation, and test datasets.

TensorFlow ImageDataGenerator was used to import and rescale the validation and training dataset. This dataset has 15 different classes representing 15 different fruits.

The model was built using TensorFlow Sequential Model Function with 4 convolutional and max pooling layers, 512 neurons in the dense layer, and relu & softmax activation.

The training and validation dataset was fit through the model using a callback functions that terminates training once validation accuracy reaches 98%. The trained model was then saved for easier access next time the model needs to be used.

The trained model was used to predict new fruit images from the testing dataset. The loss and accuracy results from the model training and validation datasets were graphed to help examine the model for overfitting or underfitting.

Result: The initial model is not very accurate at predicting various types of fruits, but there are somethings I could do in the future to improve the model and make it more accurate at prediction various types of fruits.

The model predictions are as follows:

  1. Apple – Correctly Predicted as Apple
  2. Banana – Correctly Predicted as Bananas
  3. Carambola – Correctly Predicted as Carambola
  4. Guava – Correctly Predicted as Guava
  5. Kiwi – Incorrectly Predicted as Peach
  6. Mango – Correctly Predicted as Mango
  7. Muskmelon – Incorrectly Predicted as Peach
  8. Orange – Incorrectly Predicted as Muskmelon
  9. Peach – Incorrectly Predicted as Persimmon
  10. Pear – Incorrectly Predicted as Peach
  11. Persimmon – Incorrectly Predicted as Pear
  12. Pitaya – Incorrectly Predicted as Persimmon
  13. Plum – Incorrectly Predicted as Apple
  14. Pomegranate – Incorrectly Predicted as Plum
  15. Tomatoes – Incorrectly Predicted as Pomegranate

Project Objective: This project was aimed at implementing the Gaussian naive Bayes classification algorithm from scratch. The primary object was to create an algorithm that gave the same results as established libraries like sci-kit learn.

Process: The first part of the project was creating a “fit method”. The implementation of the fit method involved splitting the features up into various classes. The number of class was determined by the number of unique values in the target. Then the second part of the fit method was to obtain the mean and variance of each feature by class.

The second part of the project involved implementing the “predict method”. The predict method involved using the gaussian naive bayes equation and the x_test values to obtain the probabilities of each feature by class. The final step of the predict method was to get the product of the feature probabilities and the prior for each class, which will give us the final probability prediction of which class our x_test values belong to.

Result: The result of this project is that a successful implementation of the gaussian naive bayes algorithm that produced accurate predictions was created. The algorithm gave the same results as the official sklearn guassianNB package. 

Project Objective: The purpose of this project is to predict the insurance premium of customers and also to predict the customer lifetime value of an insurance company’s customers.

Process: The process involved first wrangling the data, inputing missing values, encoding categorical values using ordinal encoding & one hot encoding techniques. Initial data visualization and analysis was also done to get a better feel of the data. Initial visualization led to the discovery of data leakage which was promptly removed.

After pre-processing the data, the data was ran through several regression models to find the optimal model that produced the minimum error.

After the modeling process is complete, the next step is deploying the model using a flask app.

Result: This is an ongoing project that is not yet finished. The result of this project will be a deployed model showing what factors impact insurance premium and customer lifetime value.

Project Objective: This was a team project between data scientists, data engineers, front end engineers, back end engineers, and marketers. The app rates and ranks hacker news commenters by negativity of comment sentiment. It also allows users to search by username to view comments and sentiment levels of specific users.

Process: My specific job in this team project was building a machine learning algorithm that performed sentiment analysis of hacker news comments. I used the vader sentiment analysis model to analyze and rank hacker news comments data.

I created 2 models. The first model is a model that ranks the comments by their negativity level. And the second model was a model that gave the average negativity score of an author’s comments.

Result: The result of the models that I built was a new table that showed the author name, author comments, comment negativity ranking, and author average ranking. This new data table was then provided to the data engineering team.

The data engineering team stored the data on the back-end using a flask app. An API endpoint to access the rankings, comments, and author’s was provided to the web development team for integration into the web app.

Project Objective: This was a team project between data scientists, data engineers, front end engineers, back end engineers, and marketers. Using our app, you can search for a specific song and see its audio features displayed in a visually appealing way. Our app allows you to save your favorite songs, the app also identifies songs with similar audio features. You will receive suggestions based on your favorites.

Process: My specific job in this team project was using Tableau to visualize various spotify songs and their features.

Result: The result of this project was that I created an amazing visualization of Spotify songs and my team members were able to integrate it into our final app.

Project Objective: The purpose of this project was to analyze and visualize covid19 using Tableau

Process: Getting the data from github, combining and transforming the data to make it ready for analysis, then analyzing the data using Tableau.

Result: The result was that I was able to create a visualization and analysis of covid19 data.

Project Objective: How can we predict the life expectancy of a country? What combined factors determine how long a group of people are going to live? This project seeks to answer the question-what is the impact of immunization related factors, mortality factors, economical factors and social factors on life expectancy.

Process: Data wrangling techniques with tools like pandas and python was used to clean the data.

Models such as linear regression, ridge regression, decision tree regressor, random forest regressor, and gradient boosting regressor was used to build a predictive model using the data.

Data visualization techniques like plotly, matplotlib, 2d and 3d graphs was used to visualize the relationship between features and target.

Model interpretation techniques such as Feature importances, graphing linear model coefficients, 1 & 2 feature partial dependence (PDP)box plots, and SHapley Additive exPlanations (SHAP) values was used to explain how the best performing model made it’s predictions.

Result: There is no single predictor of life expectancy. The factors that had the most impact on life expectancy are number of deaths from HIV AIDS, number of years in school, and the body mass index of the population. Using these factors, one can predict the life expectancy of a population using health, social, and economic variables. This project and data analysis is most useful at predicting life expectancy at the population level and not the individual level.

Project Objective: The goal of this project was to determine the top 10 states with the most student loan default rate and the top 10 states that has the highest amount of money in default from the default rate 2016 data set from data.gov.

Accomplishments for this project

  • Student loan default rate for all states which includes the top 10 states with highest default rate.
  • How much money is in default for each state which includes the top 10 states with highest default money.
  • A geographical map of the United States showing default rate and how much money is in default.

This Data Science Project was Analyzed Using:

  • Python, Pandas, Numpy, Seaborn
  • Matplotlib.pyplot and Matplotlib Express
Scroll to Top