In this project I aim to create a database filled with information about mangas, volume sales, adapatations linked to the manga and many more data that can be used for analysis later.

In this project I will make use of pandas dataframe to clean and analyse, webscraping to gather the data and SQL to create and fill my database.

We are going to scrap https://myanimelist.net/ because they have a huge catalogue of both manga and their adaptation, they also include ratings which can be useful for the analysis part.

We check if we have duplicates

At this step we want to go through each row of the dataframe below and get the information we need

Let's Store it

We now have the page for each manga. The goal now is to go through each page and gather the data we need. We can try on one page and create a function that we will apply to every row.

Let's try with Berserk #RipMiura :(

Let's change some columns type to strings.

Lets create the main function so we can apply it to every row. This function will take the information and store it in a dictionary, from there we just retrieve the info from the dict and transform it before storing it in the dataframe.

Before using the function we create a list to track webpages where we encounter an error. The error we encounter are mainly issues linked to the website thinking we are a bot so they don't allow us to connect. By adding delays we can work around that but we have to track other errors so we can solve them. For example at some point mangas had no score so we had to store it as NaN but we encountered this case only after around 30000 pages.

Lets' start the webscraping, we converted the name into str because when we stored np.nan's in the CSV, it's stored as text.

Here we can see our error list and rerun the webscraping, if the error was because the access to the website was blocked it will be solved directly, if not we can fix our code to fit the needs.

Let's store it in a CSV

We are done for the webscraping part, we will clean this data and transform it according to our needs but we don't need to gather more info at this step.