-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternative dataset - sold prices #25
Comments
This would be so helpful!!! |
This is a nice idea - I didn't realize this data was available on the website. Will take a look when I get a chance, or feel free to submit a pull request if you want to have a go. I agree it should probably be a separate class or at least not have any impact on the current API. |
Thanks for the response to both. The easiest way to find the information is actually on the 'Market info' Tab. Please keep me updated if any enhancements on this could be added. Many thanks |
Hello all, I am @alandinbedia 's friend. I followed @osmya 's input initially to find the list of sold properties. I haven't formatted the history of transactions. import json
import requests
from bs4 import BeautifulSoup
import pandas as pd
class SoldProperties:
def __init__(self, url: str, get_floorplans: bool = False):
"""Initialize the scraper with a URL from the results of a property
search performed on www.rightmove.co.uk.
Args:
url (str): full HTML link to a page of rightmove search results.
get_floorplans (bool): optionally scrape links to the individual
floor plan images for each listing (be warned this drastically
increases runtime so is False by default).
"""
self._status_code, self._first_page = self._request(url)
self._url = url
self._validate_url()
self._results = self._get_results()
@staticmethod
def _request(url: str):
r = requests.get(url)
return r.status_code, r.content
def refresh_data(self, url: str = None, get_floorplans: bool = False):
"""Make a fresh GET request for the rightmove data.
Args:
url (str): optionally pass a new HTML link to a page of rightmove
search results (else defaults to the current `url` attribute).
get_floorplans (bool): optionally scrape links to the individual
flooplan images for each listing (this drastically increases
runtime so is False by default).
"""
url = self.url if not url else url
self._status_code, self._first_page = self._request(url)
self._url = url
self._validate_url()
self._results = self._get_results()
def _validate_url(self):
"""Basic validation that the URL at least starts in the right format and
returns status code 200."""
real_url = "{}://www.rightmove.co.uk/{}/find.html?"
protocols = ["http", "https"]
types = ["property-to-rent", "property-for-sale", "new-homes-for-sale"]
urls = [real_url.format(p, t) for p in protocols for t in types]
conditions = [self.url.startswith(u) for u in urls]
conditions.append(self._status_code == 200)
if not any(conditions):
raise ValueError(f"Invalid rightmove search URL:\n\n\t{self.url}")
@property
def url(self):
return self._url
@property
def table(self):
return self._results
def _parse_page_data_of_interest(self, request_content: str):
"""Method to scrape data from a single page of search results. Used
iteratively by the `get_results` method to scrape data from every page
returned by the search."""
soup = BeautifulSoup(request_content, features='lxml')
start = 'window.__PRELOADED_STATE__ = '
tags = soup.find_all(
lambda tag: tag.name == 'script' and start in tag.get_text())
if not tags:
raise ValueError('Could not extract data from current page!')
if len(tags) > 1:
raise ValueError('Inconsistent data in current page!')
json_str = tags[0].get_text()[len(start):]
json_obj = json.loads(json_str)
return json_obj
def _get_properties_list(self, json_obj):
return json_obj['results']['properties']
def _get_results(self):
"""Build a Pandas DataFrame with all results returned by the search."""
print('Scraping page {}'.format(1))
print('- Parsing data from page {}'.format(1))
try:
page_data = self._parse_page_data_of_interest(self._first_page)
properties = self._get_properties_list(page_data)
except ValueError:
print('Failed to get property data from page {}'.format(1))
final_results = properties
current = page_data['pagination']['current']
last = page_data['pagination']['last']
if current == last:
return
# Scrape each page
for page in range(current + 1, last):
print('Scraping page {}'.format(page))
# Create the URL of the specific results page:
p_url = f"{str(self.url)}&page={page}"
# Make the request:
print('- Downloading data from page {}'.format(page))
status_code, page_content = self._request(p_url)
# Requests to scrape lots of pages eventually get status 400, so:
if status_code != 200:
print('Failed to access page {}'.format(page))
continue
# Create a temporary DataFrame of page results:
print('- Parsing data from page {}'.format(page))
try:
page_data = self._parse_page_data_of_interest(page_content)
properties = self._get_properties_list(page_data)
except ValueError:
print('Failed to get property data from page {}'.format(page))
# Append the list or properties.
final_results += properties
# Transform the final results into a table.
property_data_frame = pd.DataFrame.from_records(final_results)
return property_data_frame
# 1. Adapt the URL here
# Go to: https://www.rightmove.co.uk/house-prices.html
# Type region of interest.
# Click on 'list view' so that RightMove show the results on the web navigator.
# Copy the corresponding link here.
url = "https://www.rightmove.co.uk/house-prices/detail.html?country=england&locationIdentifier=REGION%5E70417&searchLocation=London+Fields&radius=0.25"
# 2. Launch the data scraping here.
sold_properties = SoldProperties(url)
# 3. Save the results somewhere.
sold_properties.table.to_csv('sold_properties.csv') |
Since it was not really useful, I had a look at the other page and re-adapted the class easily to find out the list of properties for sale. import json
import requests
from bs4 import BeautifulSoup
import pandas as pd
class PropertiesForSale:
def __init__(self, url: str, get_floorplans: bool = False):
"""Initialize the scraper with a URL from the results of a property
search performed on www.rightmove.co.uk.
Args:
url (str): full HTML link to a page of rightmove search results.
get_floorplans (bool): optionally scrape links to the individual
floor plan images for each listing (be warned this drastically
increases runtime so is False by default).
"""
self._status_code, self._first_page = self._request(url)
self._url = url
self._validate_url()
self._results = self._get_results()
@staticmethod
def _request(url: str):
r = requests.get(url)
return r.status_code, r.content
def refresh_data(self, url: str = None, get_floorplans: bool = False):
"""Make a fresh GET request for the rightmove data.
Args:
url (str): optionally pass a new HTML link to a page of rightmove
search results (else defaults to the current `url` attribute).
get_floorplans (bool): optionally scrape links to the individual
flooplan images for each listing (this drastically increases
runtime so is False by default).
"""
url = self.url if not url else url
self._status_code, self._first_page = self._request(url)
self._url = url
self._validate_url()
self._results = self._get_results()
def _validate_url(self):
"""Basic validation that the URL at least starts in the right format and
returns status code 200."""
real_url = "{}://www.rightmove.co.uk/{}/find.html?"
protocols = ["http", "https"]
types = ["property-to-rent", "property-for-sale", "new-homes-for-sale"]
urls = [real_url.format(p, t) for p in protocols for t in types]
conditions = [self.url.startswith(u) for u in urls]
conditions.append(self._status_code == 200)
if not any(conditions):
raise ValueError(f"Invalid rightmove search URL:\n\n\t{self.url}")
@property
def url(self):
return self._url
@property
def table(self):
return self._results
def _parse_page_data_of_interest(self, request_content: str):
"""Method to scrape data from a single page of search results. Used
iteratively by the `get_results` method to scrape data from every page
returned by the search."""
soup = BeautifulSoup(request_content, features='lxml')
start = 'window.jsonModel = '
tags = soup.find_all(
lambda tag: tag.name == 'script' and start in tag.get_text())
if not tags:
raise ValueError('Could not extract data from current page!')
if len(tags) > 1:
raise ValueError('Inconsistent data in current page!')
json_str = tags[0].get_text()[len(start):]
json_obj = json.loads(json_str)
return json_obj
def _get_properties_list(self, json_obj):
return json_obj['properties']
def _get_results(self):
"""Build a Pandas DataFrame with all results returned by the search."""
print('Scraping page {}'.format(1))
print('- Parsing data from page {}'.format(1))
try:
page_data = self._parse_page_data_of_interest(self._first_page)
properties = self._get_properties_list(page_data)
except ValueError:
print('Failed to get property data from page {}'.format(1))
final_results = properties
page = 2;
last = int(page_data['pagination']['last'])
chunk_size = int(page_data['pagination']['next'])
# Scrape each page
while True:
next_index = (page - 1) * chunk_size
if next_index > last:
print('Finished!')
break
print('Scraping page {}'.format(page))
# Create the URL of the specific results page:
p_url = f"{str(self.url)}&index={page * chunk_size}"
# Make the request:
print('- Downloading data from page {}'.format(page))
status_code, page_content = self._request(p_url)
# Requests to scrape lots of pages eventually get status 400, so:
if status_code != 200:
print('Failed to access page {}'.format(page))
continue
# Create a temporary DataFrame of page results:
print('- Parsing data from page {}'.format(page))
try:
page_data = self._parse_page_data_of_interest(page_content)
properties = self._get_properties_list(page_data)
except ValueError:
print('Failed to get property data from page {}'.format(page))
# Append the list or properties.
final_results += properties
# Go to the next page.
page += 1
# Transform the final results into a table.
property_data_frame = pd.DataFrame.from_records(final_results)
return property_data_frame
# 1. Adapt the URL here
# Go to: https://www.rightmove.co.uk/house-prices.html
# Type region of interest.
# Click on 'list view' so that RightMove show the results on the web navigator.
# Copy the corresponding link here.
url = "https://www.rightmove.co.uk/property-for-sale/find.html?searchType=SALE&locationIdentifier=REGION%5E70417&insId=1&radius=0.0&minPrice=&maxPrice=&minBedrooms=&maxBedrooms=&displayPropertyType=&maxDaysSinceAdded=&_includeSSTC=on&sortByPriceDescending=&primaryDisplayPropertyType=&secondaryDisplayPropertyType=&oldDisplayPropertyType=&oldPrimaryDisplayPropertyType=&newHome=&auction=false"
# 2. Launch the data scraping here.
properties_for_sale = PropertiesForSale(url)
# 3. Save the results somewhere.
properties_for_sale.table.to_csv('properties_for_sale.csv') HTH |
Thanks for doing this, will take a proper look at it when I get the time to add it to the package. |
|
@davidok8 hey I think your first class is incredibly useful especially as it gives exact postcode and price history I think the output could be more streamlined, so I'll work on that and open a PR @toby-p I am not sure what edit:grammar |
Glad to know the first one is useful. The second class merely returns the list of properties not sold yet. True, it does not contain any market information (probaly the market history of the area). On the other hand you can find complementary information (GPS location, size in sq feet, addedOrReduced, area in development, etc.). You have to reformat the data...
|
@davidok8 check latest commit in this PR You can access a processed df by invoking .processed_data on a SoldProperties object Note that some pf the code is redundant - i will trim it later changes to your class:
|
This is extremely useful. Is it possible to include the get_floorplans option as in the main class? |
Toby
Would be interesting to create an additional class to take data from previously sold
https://www.rightmove.co.uk/house-prices/London-87490.html?soldIn=1&page=1
Thoughts?
The text was updated successfully, but these errors were encountered: