This project was one of my earliest full-cycle analytics experiences — and it taught me how much story lies hidden inside raw data.
I worked with a large New York City Taxi dataset containing millions of trip records, focusing on understanding what factors influence fare pricing and customer payment behavior.
I started by performing data cleaning and exploratory data analysis (EDA) using Python libraries like Pandas and Matplotlib, handling outliers and inconsistencies across distance, duration, and fare features. Once the dataset was structured and validated, I explored correlations between trip distance, fare, and time to identify underlying patterns.
To go beyond descriptive analysis, I conducted statistical testing (t-tests) to evaluate whether cash and card payments showed significant differences in average fare amounts. This step gave clear evidence of customer behavior patterns and provided insight into pricing fairness.
Next, I developed a predictive regression model using Scikit-Learn, training it to estimate fare amounts based on trip characteristics. The model achieved a strong R² correlation, accurately mapping distance and duration to fare.
The project concluded with visual reports and dashboards highlighting pricing trends, payment ratios, and predictive insights — a compact view of both analytical and business outcomes.
Tech Stack
- 
Languages: Python, SQL
 - 
Libraries: Pandas, Scikit-Learn, Matplotlib, Seaborn
 - 
Tools: Excel, Power BI
 - 
Techniques: EDA, Statistical Analysis, Predictive Modeling
 
Key Learnings
- 
Translating unstructured raw data into meaningful insights
 - 
Applying inferential statistics to validate assumptions
 - 
Using predictive modeling to connect analysis with real-world impact
 - 
Communicating results through structured visualization and reporting
 






