This comprehensive guide will explore practical techniques for web scraping Amazon's product categories without encountering blocking issues. Our tool is Playwright, a Python library that empowers developers to automate web interactions and effortlessly extract data from web pages. Playwright offers the flexibility to navigate web pages, interact with elements, and gather information within a headless or visible browser environment. Even better, Playwright is compatible with various browsers like Chrome, Firefox, and Safari, enabling you to test your web scraping scripts across different platforms. Moreover, Playwright boasts robust error handling and retry mechanisms, which can help you tackle shared web scraping obstacles like timeouts and network errors.
Throughout this tutorial, we will guide you through the stepwise procedure of scraping data related to air fryers from Amazon using Playwright in Python. We will also demonstrate how to save this extracted data as a CSV file. By the end of this tutorial, you will have gained a solid understanding of how to scrape Amazon product categories effectively while avoiding potential roadblocks. Additionally, you'll become proficient in utilizing Playwright to automate web interactions and efficiently extract data.
Amazon boasts an astonishing online inventory exceeding 12 million products. When you factor in the contributions of Marketplace Sellers, this number skyrockets to over 350 million unique products. This vast assortment has solidified Amazon's reputation as the "go-to" destination for online shopping. It's often the first stop for customers seeking to purchase or gather in-depth information about a product. Amazon offers a treasure trove of valuable product data, encompassing everything from prices and product descriptions to images and customer reviews.
Given this wealth of product data and Amazon's immense customer base, it's no surprise that small and large businesses and professionals are keenly interested in harvesting and analyzing this Amazon product data.
In this article, we'll introduce our Amazon scraper and illustrate how you can effectively collect Amazon product information.
Here's a step-by-step guide for using Playwright in Python to scrape air fryer data from Amazon:
Step 1: Install Required Libraries
First, make sure you have Playwright and other necessary dependencies installed. You can use pip to install Playwright:
In this section, we've imported several essential Python modules and libraries to support various operations in our project.
re Module: We're utilizing the 're' module for working with regular expressions. Regular expressions are powerful tools for pattern matching and text manipulation.
random Module: The 'random' module is essential for generating random numbers, making it handy for tasks like generating test data or shuffling the order of tests.
asyncio Module: We're incorporating the 'asyncio' module to manage asynchronous programming in Python. It is particularly crucial when using Playwright's asynchronous API for web automation.
datetime Module: The 'datetime' module comes into play when we need to work with dates and times. It provides a range of functionalities for manipulating, creating date and time objects and formatting them as strings.
pandas Library: We're bringing in the 'pandas' library, a powerful data manipulation and analysis tool. In this tutorial, it will store and manipulate data retrieved from the web pages we're testing.
async_playwright Module: The 'async_playwright' module is essential for systematizing browsers using Playwright, an open-source Node.js library designed for automation testing and web scraping.
We're well-equipped to perform various tasks efficiently in our project by including these modules and libraries.
This script utilizes a combination of libraries to streamline browser testing with Playwright. These libraries serve distinct purposes, including data generation, asynchronous programming control, data manipulation and storage, and browser interaction automation.
The second step involves extracting product URLs from the air fryer search. Product URL extraction refers to gathering and structuring the web links of products listed on a web page or online platform seeking help from e-commerce data scraping services.
Before initiating the scraping of product URLs, it is essential to take into account several considerations to ensure a responsible and efficient approach:
Standardized URL Format: Ensure the collected product URLs adhere to a standardized format, such as "https://www.amazon.in/+product name+/dp/ASIN." This format comprises the website's domain name, the product name without spaces, and the product's sole ASIN (Amazon Standard Identification Number) at the last. This standardized set-up facilitates data organization and analysis, maintaining URL consistency and clarity.
Filtering for Relevant Data: When extracting data from Amazon for air fryers, it is crucial to filter the information exclusively for them and exclude any accessories often displayed alongside them in search results. Implement filtering criteria based on factors like product category or keywords in the product title or description. This filtering ensures that the retrieved data pertains solely to air fryers, enhancing its relevance and utility.
Handling Pagination: During product URL scraping, you may need to navigate multiple pages by clicking the "Next" button at the bottom of the webpage to access all results. However, there may be instances where clicking the "next" button flops to load the following page, potentially causing errors in the scraping process. To mitigate such issues, consider implementing error-handling mechanisms, including timeouts, retries, and checks to confirm the total loading of the next page before data extraction. These precautions ensure effective and efficient scraping while minimizing errors and respecting the website's resources.
Considering these considerations, you can ensure a responsible and practical approach to scraping product URLs while maintaining the quality of the collected data.
In this context, we eusemploy the Python function 'get_product_urls' to extract product links from a web page. This function leverages the Playwright library to automate browser testing and retrieve the resulting product URLs from an Amazon webpage.
The function performs a sequence of actions. It initially checks for a "next" button on the page. If found, the function clicks on it and invokes itself recursively to extract URLs from the subsequent page. This process continues until all pertinent product URLs are available.
Within the function, execute the following steps:
In this phase, we aim to determine the attributes we wish to collect from the website, which includes the Product Name, Brand, Number of Reviews, Ratings, MRP, Sale Price, Bestseller rank, Technical Details, and product description ("About the Amazon air fryer product").
Our Amazon data scraping services focus on extracting the Product Name. It involves identifying and retrieving the names of each product from their respective web pages. The product names hold significance as they give customers a quick overview of the product's identity, features, and intended use. This step aims to pinpoint the elements within a web page that house the product names and extract the textual content from those elements.
To extract product names from web pages, we employ an asynchronous function called 'get_product_name' that works on an individual page object. This function follows a structured process:
It initiates by locating the product's title element on the page, achieved by using the 'query_selector()' method of the page object along with the appropriate CSS selector.
Once the element is successfully available, the function extracts the element's text content using the 'text_content()' method. Store this extracted text in the 'product_name' variable for further processing.
When the function encounters difficulties in finding or retrieving the product name for a specific item, it has a mechanism to handle exceptions. In such cases, it assigns the value "Not Available" to the 'product_name' variable. This proactive approach ensures the robustness of our web scraping script, allowing it to continue functioning smoothly even in the face of unexpected errors during the data extraction process.
Scraping Brand Name
In web scraping, capturing the brand name associated with a specific product plays a pivotal role in identifying the manufacturer or company behind the product. The procedure for extracting brand names mirrors that of product names. We begin by seeking pertinent elements on the webpage using a CSS selector and extracting the textual content from those elements.
However, brand information on the page can manifest in several different formats. For example, the brand name is by the text "Brand: 'brand name'" or appears as "Visit the 'brand name' Store." To accurately extract the brand name, it's crucial to filter out these extra elements and isolate the genuine brand name.
We can employ regular expressions or string manipulation functions within our web scraping script to accomplish this. This way, by eliminating extraneous text and focusing solely on the brand name, we can ensure the precision and efficiency of our brand extraction process.
We can employ a function similar to the one used for product name extraction to extract the brand name from web pages. In this case, the function is named 'get_brand_name,' its operation revolves around locating the element containing the brand name via a CSS selector.
When the function successfully locates the element, it extracts the text content from that element using the 'text_content()' method and assigns it to a 'brand_name' variable. It's important to emphasize that the extracted text may include extraneous information such as "Visit," "the," "Store," and "Brand:" Eliminate these extra elements using regular expressions.
By filtering out these unwanted words, we can isolate the genuine brand name, ensuring the accuracy of our data. If the function encounters an exception while locating the brand name element or extracting its text content, it defaults to returning the brand name as "Not Available."
By incorporating this function into our web scraping script, we can effectively obtain the brand names of the products under scrutiny, thereby enhancing our understanding of the manufacturers and companies associated with these products.
Similarly, we can apply the same technique to extract other attributes, such as MRP and Sale price, from the web pages.
It's crucial to retrieve the Manufacturer's Retail Price from the relevant webpage to make well-informed judgments about the worth of a product. This data holds significant value for retailers and customers, as it empowers them to make informed choices regarding their purchases. The process of extracting the MRP is akin to extracting the product name.
The subsequent phase in our data extraction process involves collecting the star ratings for each product from their respective web pages. These ratings, typically provided by customers on a scale of 1 to 5 stars, offer valuable insights into the product's quality. It's important to note that not all products will have ratings or reviews. The website may indicate that the product is either "New to Amazon" or has "No Reviews." Various factors may contribute to this, including limited availability, low popularity, or the product being new to the market and yet to be reviewed by customers. Nevertheless, extracting star ratings remains vital in assisting customers in making well-informed purchasing decisions.
To extract the star rating of a product from a web page, we utilize the 'get_star_rating' function. Initially, the function will locate the star rating element on the page using a CSS selector that points to the element housing the star ratings. Accomplish it using the 'page.wait_for_selector()' method. After locating the element, the function retrieves the inner text content of the element through the 'star_rating_elem.inner_text()' method.
However, an exception arises while finding the star rating element or extracting its text content. In that case, the function employs an alternative approach to verify whether there are no reviews for the product. To do this, it attempts to locate the element with an ID that signifies the absence of reviews using the 'page.query_selector()' method. If this element is available, assign the text content of that element to the 'star_rating' variable.
In cases where both of these attempts prove ineffective, the function enters the second block of exception. It denotes the star rating as "Not Available" without any further effort to extract rating information. It ensures the user is duly informed about the unavailability of star ratings for the specific product.
Extracting the "About this item" section from product web pages is crucial in furnishing a concise overview of a product's primary features, advantages, and specifications. This data aids potential buyers in comprehending the product's purpose, functionality, and points of distinction compared to similar offerings in the market. It is a valuable resource for consumers to compare various products and assess whether a particular product aligns with their requirements and preferences. Acquiring this information from the product listing is pivotal for making well-informed purchasing decisions and ensuring customer satisfaction.
The 'get_bullet_points' function collects bullet point information from the web page. It initiates the process by attempting to locate an unordered list element that encompasses bullet points. Achieve it by applying a CSS selector for the 'About this item' element with the corresponding ID. After locating the 'About this item' unordered list element, the function retrieves all the list item elements beneath it using the 'query_selector_all()' method.
The function then iterates through each list item element, gathering its inner text, and appends it to the bullet points list. In cases where an exception arises during the endeavor to find the unordered list element or the list item elements, the function promptly designates the bullet points as an empty list.
Ultimately, the function returns the compiled list of bullet points, ensuring the extracted information is accessible for further use.
Subsequently, we invoke the functions and store the data in a vacant list
This Python script employs an asynchronous " main " function to scrape product data from Amazon web pages. It leverages the Playwright library to launch the Firefox browser and navigate to Amazon's site. Following this, the "extract_product_urls" function is available to extract the URLs of each product on the page. Store it in a list named "product_url." The script proceeds to iterate through each product URL, using the "perform_request_with_retry" function to fetch product pages and extract a range of information, including product name, brand, star rating, review count, MRP, sale price, best sellers rank, technical details, and descriptions.
The gathered data is assembled into tuples and stored in a list called "data." The function also offers progress updates after handling every 10 product URLs and a completion message when all URLs are available. Subsequently, the data is transformed into a Pandas DataFrame and saved as a CSV file using the "to_csv" method. Lastly, the browser is closed using the "browser.close()" statement. Invoke the "main" function as an asynchronous coroutine via the "asyncio.run(main())" statement.
Conclusion:
This guide provides a stepwise walkthrough for scraping Amazon Air Fryer data with Playwright in Python. We cover all aspects, starting from the initial setup of the Playwright environment and launching a web browser to the subsequent actions of navigating to Amazon's search page and extracting crucial details like product name, brand, star rating, MRP, sale price, best seller rank, technical specifications, and bullet points.
Our instructions are to be user-friendly, offering guidance on extracting product URLs, iterating through each URL, and utilizing Pandas to organize the gathered data into a structured dataframe. Leveraging Playwright's cross-browser compatibility and robust error handling, users can streamline the web scraping process and retrieve valuable information from Amazon product listings.
Web scraping can often be laborious and time-intensive, but with Playwright in Python, users can automate these procedures, significantly reducing the time and effort required.
For further details, contact iWeb Data Scraping now! You can also reach us for all your web scraping service and mobile app data scraping needs.