Exploratory Data Analysis (In-Depth)

We now proceed with exploring our dataset through several graphs, plots, and analyses. The first thing that we'll do is univariate analysis where we analyze basic statistics of a feature from our dataset.

Univariate Analysis

Species

For Species, the only analysis that we can do is to identify what set of species and the number of species included in the dataset.

Based on the dataset, there's currently 19 different species included in our dataset (excluding 'All Species'). Since we want to exclude 'All Species', we proceed to fix our dataset.

Geolocation

As for the Geolocation, we have all 17 regions in the Philippines since they were all available from the data source.

Volume

Moving on, we proceed to the Volume feature. Since this is numerical variable, the describe() method provides a different set of attributes.

From here, we can see that there is a very large gap between the 75% quantile and the max value. This is further supported by the very large standard deviation which means that our data "could be" clumped between 1.24 and 278 and might go as high as 328,205. To prove this point, we provide a boxplot below.

Placeholder

This may not seem like a "very intuitive" box plot but it provides us a view that there's quite a substantial amount of Volume data that is "extraneous". It is quite interesting to identify which species or timeframe contribute to this occurrence. Since the graph above is quite scuffed, we provide a boxplot below that contains no "outliers".

Placeholder

We can't fully view how they are distributed so below is a histogram showing the distribution of aquaculture volume.

Placeholder

As we can see from the histogram above, the graph is indeed positively skewed which means that there are indeed instances where the Volume value is really high.

Value

The same set of visualizations can be used for the Value feature since we are dealing with a quantitative value.

Compared to the Volume feature, the Value feature contains quantitatively higher degree in values but as we can see the difference between the 75% quantile and the maximum value is still large. This can be further proven by the boxplot below.

Placeholder

There are still a considerable amount of "outliers" in our Value column and a similar question arises: what causes this occurrence? Hopefully, we get an answer to this question later. Again, we repeat the boxplot by removing the outliers and by also providing a histogram to determine the distribution of the Value column.

Placeholder Placeholder

Similar to the Volume column, the histogram for the Value column is also positively skewed with the values going to the right decreasing to around 0 or 1 count (eyeballed values). Since we cannot infer something new from these plots, we proceed to bivariate analysis to further analyze pairings of features and hopefully get to answer the questions that we found earlier.

Bivariate Analysis

Bivariate analysis investigates the relationship between two variables. (Masud, n.d.) For this section, we pair up features, plot graphs for each pairing, and identify relationships/findings based on these plots.

Species - Value

We first begin with the Species-Value pairing. Given the nature of these two features, we proceed to use a horizontal bar chart to identify the total Value of each specie.

Placeholder

We can see that the Milkfish specie contributes to the highest value compared to the remainder. We can infer from here that the Milkfish specie contributes to the extraneous values from our Value data earlier. Note that we used sum for our grouping in this plot. Let's check if mean would have a different plot.

Placeholder

If we clearly think about it, it doesn't actually contribute to anything new since the values were simply "normalized" in some way. Since there's nothing else to see here, we proceed to the next pairing.

Species - Volume

We proceed with the Species - Volume pairing. We can also use the same set of plots above since Volume is also a quantitative variable.

Placeholder

Wow, this is something. Milkfish contributes to the highest value but seaweeds contribute to the highest volume. Why could this be?

Seaweeds typically reach harvestable size in around 50 days (~1.6 months) [SEAFDEC], allowing for multiple harvests each year. In contrast, milkfish require 3–4 months to grow in brackishwater ponds and up to 6–8 months in marine pens and cages [FAO]. The faster growth cycle of seaweeds enables more frequent harvests, contributing to higher annual production volumes. Additionally, seaweed farming is less labor- and resource-intensive, as it requires no feed or active water management. It also involves lower capital investment and technical skill, making it more accessible to small-scale coastal producers.

This finding leads us to the cause of the outliers from the boxplot earlier.

Geolocation - Value

Next up, we have Geolocation vs Value. Here, we identify if there's a difference between the value of aquaculture per region.

Placeholder Placeholder

From this plot, we can see that Central Luzon offers a large total and average value compared to other regions. Why is this so? Why do they value their aquaculture products so much? Manlosa et al. (2021) highlight how saltwater intrusion and environmental changes have driven the conversion of rice paddies into fish farms in Central Luzon, especially in low-lying areas like Pampanga, Bulacan, and Bataan. As a result, these provinces—along with inland areas such as Nueva Ecija and Tarlac—now play a major role in aquaculture production, cultivating species like bangus, sugpo, and tilapia. Based on the data we gathered, the top three species produced in BARMM are seaweed, P. vannamei (whiteleg shrimp), and oyster. In contrast, the leading species in Central Luzon are tilapia, milkfish, and white shrimp. As shown in the previous sections, seaweed and P. vannamei rank highest in terms of production volume. However, it was also shown that their mean economic value is relatively low compared to that of milkfish and tilapia. This disparity helps explain why Central Luzon, despite having lower production volumes, achieves a higher overall economic value than BARMM.

Geolocation - Volume

Next up, we have the Geolocation - Volume pairing. Similar to the plots above, we can do the same set of visuals for this pairing.

Placeholder

Again, the total volume for BARMM is quite large in number compared to the following regions. Location-wise, BARMM is at the southwestern portion of the Philippines and is comprised of several islands surrounded by a significantly large body of water. This might be a reason as to why there's a "significant" discrepancy in Volume compared to other regions especially to those in the Visayas regions since they are also surrounded significantly by bodies of water.

Value - Year/Quarter

This next pairing may not seem like a pair but the Year and Quarter features can be actually combined to provide us a time series analysis of Value and Volume (next section). For this part, we will be plotting a line chart for the Value variable.

Placeholder

Based on this line chart, we cannot actually discern any particular trend without performing a linear regression, which we can do but for now we will only provide a visualization.

Volume - Year/Quarter

Second to the last is the Volume time series where we can plot a line chart to determine if a particular trend is occurring.

Placeholder

Similar to our findings earlier, there's also no discernable trend that can be visually found in this graph.

Value - Volume

Finally, we have the Value - Volume pair. Both values are quantitative values so we utilized a scatter plot.

Placeholder

This is quite an interesting plot since we can say through visual eyeballing that there are two probable clusters from this graph but we cannot yet conclude if this is so but for now we can state that there are two probable clusters pertaining to different line regressions.

Multivariate Analysis

We're now transitioning to creating analyses for multiple features. We won't delve into all groupings but rather focus only to the set of features that makes sense.

Value - Year/Quarter per Species

The first set of group includes the value of each specie as a time series. We provide below a heatmap showing the changes in value for each specie as time goes on.

Placeholder

Aside from the heatmap, we can also create multiple boxplots to further determine the gap on the distribution of each specie.

Placeholder

As we can see from the heatmap above, almost half of the available species has very low value compared to Milkfish, Tilapia, and Tiger Prawn (visually judged). Some of these species include mussel, mudfish, oyster, prawns, catfish, and carp which I think are luxury species when based on the choices of the common people, thus contributing to little value. As for the boxplots, we can see that the distribution for Milkfish do have a larger gap as compared to the other species which have lower quantitative value.

Value - Year/Quarter per Region

Next up is the time series changes in value per region. We again utilize the same plot function above.

Placeholder Placeholder

Similar to our findings earlier, Central Luzon has high aquaculture value. Through visual analysis, we can see that Central Luzon leads on all the Year-Quarter combinations in terms of value. As for the following region, there's no clear contender as to who to watch for.

Volume - Year/Quarter per Species

The next few plots below are the `Volume` feature equivalent of the `Value` plots above.

Placeholder Placeholder

Seaweed clearly has the highest volume output among all aquaculture species, likely because it is relatively easy to cultivate, requires minimal maintenance, and is well-suited for small-scale production. Its accessibility allows even low-capital communities to engage in farming, contributing to its widespread adoption and consistently high output.

Volume - Year/Quarter per Region

Placeholder Placeholder

BARMM consistently leads in aquaculture production volume, largely due to extensive seaweed farming, as shown by its high median and wide distribution in the boxplot and sustained dominance in the heatmap. CAR ranks second, with stable yet moderate output likely driven by freshwater and inland aquaculture systems. In contrast, most other regions exhibit lower and less variable production volumes, reflecting smaller-scale or less intensive aquaculture operations. These differences highlight how geography, species specialization, and investment influence regional output.