Sometimes you need to get data for your own purpose or functions extension from the website which provides the data in different way. Data analysis or ML or build your app with data aggrecated from different sites… A lot of use case. But you need to make sure that this behaivior isn’t violate the license agreement, guiding etc.
1. What you need?
- Python (whatever language is ok. I choose python just because of the Beautiful Soup)
- Beautiful Soup 4 – This is the tool to extract the data
- requests – Handle https stuffs
- Sellenium – handy way to enter sites with login page
- Google ChromeDriver – needed to work together with Sellenium
- Browser development menu – read the source to find the target place. Tools available as well. Simply use search works well.
2. Setup and Install
python -m venv env_ws
pip install beautifulsoup4
pip install requests
pip install selenium
https://sites.google.com/chromium.org/driver/downloads
3. Login
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
def login():
login_name = input("User ID: ")
login_password = getpass.getpass("Password: ")
options = Options()
options.add_argument('--headless')
# For Chrome:
#driver = webdriver.Chrome(executable_path='path/to/webdriver')
driver = webdriver.Chrome()
# For firebox:
#driver = webdriver.Firefox()
# Navigate to the login page
driver.get(Login_URL)
# Find the login form elements
email_input = driver.find_element(By.ID, "login_email")
password_input = driver.find_element(By.ID, "login_pass")
login_button = driver.find_element(By.XPATH, "//button[contains(text(), 'ログイン')]")
# Fill in the login credentials
email_input.send_keys(login_name)
password_input.send_keys(login_password)
login_button.click()
# Wait for the login to complete (change the timeout as needed)
try:
WebDriverWait(driver, 10).until(EC.url_contains(Home_URL))
except:
print("Login failed")
return
cookies = driver.get_cookies()
driver.quit()
return(cookies) ...
Use getpass package to avoid showing the password during input.
4. Create Soup object through cookies
# Transfer cookies to a requests session
session = requests.Session()
for cookie in cookies:
session.cookies.set(cookie['name'], cookie['value'])
response = session.get(Base_URL)
soup = BeautifulSoup(response.content, "html.parser")
5. Use soup to extract data
Parse the html is basically use soup.find_all() or soup.find()
- find_all() for repeating items
- soup.find for unique items
Important - Need to keep in mind all the time about the scope of the tag defined.
- Need to care about the null case when the target items is not found.
product_items = soup.find_all("article", class_ = "product-item")
for product_item in product_items:
product = {}
product_no = product_item['id'].rsplit("--")[1]
product['product_no'] = product_no
product_name = product_item.find("h3", class_="product-name").text.strip()
product['product_name'] = product_name
product_detail = product_item.find("div", class_="product-detail").text.strip()
if product_item.find("p", class_="product-name"):
product_available = False
else:
product_available = True
product["product_available"] = product_available
product_price = product_item.find("dd", class_="product-price").text.strip()
product['product_price'] = product_price
parse_product(session, Product_Base_URL+product_no, product)
product_list.append(product)
6. Tips on performance
- Only used Selenium for login. After get logged in, pass the cookie to the session and close the window.
- Use multi-thread. Be careful not to crash the website.
import threading
threads = []
for i in range (number_of_pages):
thread = threading.Thread(target = thread_parse_page, args = (i + 1, session))
threads.append(thread)
thread.start()
for thread in threads:
thread.join()