Data Science Industrial Projects

Data Science Industrial Projects

-I led the following data projects in IBM Plan A100 as a leader.

1.Product Recommendations for E-commerce store (Sep 2015- Feb 2016)

  • Industry: FMCG (Fast-moving consumer goods)
  • Client: one of the largest FMCG company in the world
  • Details: Applied APP Event Tracking, Market Basket Analysis to build product recommendation system with integrated inputs from marketing team, solving inefficiency of traditional product recommendation driven by marketing experience.
    • Worked with IT Team to track events on the APP to collect 1TB browsing data
    • Used Hadoop to perform ETL job to generate behavior features for 500,000 members
    • Used R to build market basket analysis to recommend product at right time through right channel
  • Achievement: Activated 46% inactive customers; transferred 10% customers from trying samples to purchase

2.Targeted Coupon Prediction and Recommendation (Jun 2015 - Aug 2015)

  • Industry: Hotel Industry
  • Client: one of the largest Hotal chain in the world
  • Details:
    • Used SQL server to retrieved data of a previous coupon campaign; Cleaned missing values of customer profile
    • Built logistic regression in R to calculate probability of using coupon; generated customer list to send coupon
  • Achievement: Hit rate +40%, presented actionable insights to the management

3.Sentiment Analysis and Customer Insights (Feb 2016 - May 2016)

  • Industry: Fashion Industry
  • Client: a wellknow fashion brand in the world
  • Details: Applied Web Scrapping and Sentiment Analysis to analyze reviews on ecommerce website, provided a report of 85 pages to evaluate product and branding performance
    • Used Python to scrape 500,000+ customer reviews on Taobao Ÿ
    • Led a team of 70 master students to clean and integrate data; Cleaned 12,000 fake reviews to get real customers’ opinion
    • Extrated keywords and sentence pattern from database; Used Python to build sentiment analysis to classify reviews into KPI aspects, which achieved 90% accuracy rate Ÿ
  • Achievement: Built up the system of 45 KPIs in Product, Purchase and Service dimensions to evaluate the performance of client and its competitors. According to their performances, we wrote a report to provide suggestions on supply chain, dressing experience, product quality and style, finally presented to Vice President of Asia Pacific and signed contract.

4.Voice Mining to maximize reponse rate in telemarketing (Aug 2015 - Jan 2016)

  • Industry: Banking Industry
  • Client: Citic Bank (a Fortune 500 bank in the world)
  • Details: Applied Voice Recognition, Text Mining and Regression Model to transfer voice to topics, and provide suggestions to maximize customer response rate in the telemarketing campaign.
    • Surveyed and interviewed front line telemarketers to record common sentence patterns
    • Cooperated with IBM Beijing Lab to transfer voice to text, used Python to classify text into 72 topics Ÿ
    • Discovered factors of a successful selling by log-linear model, provided selling suggestions to increase success rate
  • Achievement: Provided a system to understand the conversations between sales and customers by transfer voice into structural data which provided selling suggestions to increase success rate


Kaggle Restaurant Visitation Forecasting - The Kaggle Restaurant Visitation Forecasting competition revolved around taking in a dataset of reservation and visitation data to predict the total number of visitors to a restaurant for future dates. For this competition, I used Python to cleaned and visualized daily visitation data (250K rows) to explore the dataset; generated 80+ engineered features, e.g. lagging visitors (mean, median, max, min) for last 14/28/60/120/180 days. After that, I tried a variety of different supervised learning approaches (KNN, gradient boosting and random forest), but ultimately I ranked top 22% with 0.480 RMSE score by ensembling models of gradient boosting and random forest.

Kaggle Titanic - The Kaggle Titanic competition revolved around taking in a dataset of all the passengers in the Titanic, and then predicting whether or not they survived. The features in the dataset included room location, age, gender, etc. For this competition, I used a variety of different supervised learning approaches (SVMs, KNNs, Decision Trees), but ultimately found that a KNN model (where K = 17) got the best accuracy of 78.95%. I used Numpy and Sklearn to help preprocess the data and create the models.


Data Mining by SPSS Modeler - This series of tutorial got 50K views in YouTube, I taught data cleansing, RFM Model, Data Aggregration, Market Basket Analysis, Linear Regression, Logistic Regression, K-means Clustering in the tutorial.