Alternative dataset - sold prices #25

osmya · 2020-04-24T08:21:31Z

Toby

Would be interesting to create an additional class to take data from previously sold
https://www.rightmove.co.uk/house-prices/London-87490.html?soldIn=1&page=1

Thoughts?

alandinbedia · 2020-04-24T13:41:10Z

This would be so helpful!!!

toby-p · 2020-04-30T15:22:20Z

This is a nice idea - I didn't realize this data was available on the website. Will take a look when I get a chance, or feel free to submit a pull request if you want to have a go. I agree it should probably be a separate class or at least not have any impact on the current API.

alandinbedia · 2020-04-30T15:40:51Z

Thanks for the response to both. The easiest way to find the information is actually on the 'Market info' Tab.
For example: I do my normal search for 'Sale' properties, then the property details have a few tabs, such as description, floorplan, map etc... the last one is market info which is from the land registry files so very useful to see what was the sold value for a house selling now.

Please keep me updated if any enhancements on this could be added.
I am also asking a friend of mine (as I am not a developer) to help me to see if he can figure it out and will share if we manage.

Many thanks

oddkiva · 2020-05-03T12:21:01Z

Hello all, I am @alandinbedia 's friend.

I followed @osmya 's input initially to find the list of sold properties. I haven't formatted the history of transactions.

import json
import requests

from bs4 import BeautifulSoup

import pandas as pd


class SoldProperties:

    def __init__(self, url: str, get_floorplans: bool = False):
        """Initialize the scraper with a URL from the results of a property
        search performed on www.rightmove.co.uk.

        Args:
            url (str): full HTML link to a page of rightmove search results.
            get_floorplans (bool): optionally scrape links to the individual
                floor plan images for each listing (be warned this drastically
                increases runtime so is False by default).
        """
        self._status_code, self._first_page = self._request(url)
        self._url = url
        self._validate_url()
        self._results = self._get_results()

    @staticmethod
    def _request(url: str):
        r = requests.get(url)
        return r.status_code, r.content

    def refresh_data(self, url: str = None, get_floorplans: bool = False):
        """Make a fresh GET request for the rightmove data.

        Args:
            url (str): optionally pass a new HTML link to a page of rightmove
                search results (else defaults to the current `url` attribute).
            get_floorplans (bool): optionally scrape links to the individual
                flooplan images for each listing (this drastically increases
                runtime so is False by default).
        """
        url = self.url if not url else url
        self._status_code, self._first_page = self._request(url)
        self._url = url
        self._validate_url()
        self._results = self._get_results()

    def _validate_url(self):
        """Basic validation that the URL at least starts in the right format and
        returns status code 200."""
        real_url = "{}://www.rightmove.co.uk/{}/find.html?"
        protocols = ["http", "https"]
        types = ["property-to-rent", "property-for-sale", "new-homes-for-sale"]
        urls = [real_url.format(p, t) for p in protocols for t in types]
        conditions = [self.url.startswith(u) for u in urls]
        conditions.append(self._status_code == 200)
        if not any(conditions):
            raise ValueError(f"Invalid rightmove search URL:\n\n\t{self.url}")

    @property
    def url(self):
        return self._url

    @property
    def table(self):
        return self._results

    def _parse_page_data_of_interest(self, request_content: str):
        """Method to scrape data from a single page of search results. Used
        iteratively by the `get_results` method to scrape data from every page
        returned by the search."""
        soup = BeautifulSoup(request_content, features='lxml')

        start = 'window.__PRELOADED_STATE__ = '
        tags = soup.find_all(
            lambda tag: tag.name == 'script' and start in tag.get_text())
        if not tags:
            raise ValueError('Could not extract data from current page!')
        if len(tags) > 1:
            raise ValueError('Inconsistent data in current page!')

        json_str = tags[0].get_text()[len(start):]
        json_obj = json.loads(json_str)

        return json_obj

    def _get_properties_list(self, json_obj):
        return json_obj['results']['properties']

    def _get_results(self):
        """Build a Pandas DataFrame with all results returned by the search."""
        print('Scraping page {}'.format(1))
        print('- Parsing data from page {}'.format(1))
        try:
            page_data = self._parse_page_data_of_interest(self._first_page)
            properties = self._get_properties_list(page_data)
        except ValueError:
            print('Failed to get property data from page {}'.format(1))

        final_results = properties

        current = page_data['pagination']['current']
        last = page_data['pagination']['last']
        if current == last:
            return

        # Scrape each page
        for page in range(current + 1, last):
            print('Scraping page {}'.format(page))

            # Create the URL of the specific results page:
            p_url = f"{str(self.url)}&page={page}"

            # Make the request:
            print('- Downloading data from page {}'.format(page))
            status_code, page_content = self._request(p_url)

            # Requests to scrape lots of pages eventually get status 400, so:
            if status_code != 200:
                print('Failed to access page {}'.format(page))
                continue

            # Create a temporary DataFrame of page results:
            print('- Parsing data from page {}'.format(page))
            try:
                page_data = self._parse_page_data_of_interest(page_content)
                properties = self._get_properties_list(page_data)
            except ValueError:
                print('Failed to get property data from page {}'.format(page))

            # Append the list or properties.
            final_results += properties

        # Transform the final results into a table.
        property_data_frame = pd.DataFrame.from_records(final_results)

        return property_data_frame


# 1. Adapt the URL here
#    Go to: https://www.rightmove.co.uk/house-prices.html
#    Type region of interest.
#    Click on 'list view' so that RightMove show the results on the web navigator.
#    Copy the corresponding link here.
url = "https://www.rightmove.co.uk/house-prices/detail.html?country=england&locationIdentifier=REGION%5E70417&searchLocation=London+Fields&radius=0.25"

# 2. Launch the data scraping here.
sold_properties = SoldProperties(url)

# 3. Save the results somewhere.
sold_properties.table.to_csv('sold_properties.csv')

oddkiva · 2020-05-03T12:22:51Z

Since it was not really useful, I had a look at the other page and re-adapted the class easily to find out the list of properties for sale.

import json
import requests

from bs4 import BeautifulSoup

import pandas as pd


class PropertiesForSale:

    def __init__(self, url: str, get_floorplans: bool = False):
        """Initialize the scraper with a URL from the results of a property
        search performed on www.rightmove.co.uk.

        Args:
            url (str): full HTML link to a page of rightmove search results.
            get_floorplans (bool): optionally scrape links to the individual
                floor plan images for each listing (be warned this drastically
                increases runtime so is False by default).
        """
        self._status_code, self._first_page = self._request(url)
        self._url = url
        self._validate_url()
        self._results = self._get_results()

    @staticmethod
    def _request(url: str):
        r = requests.get(url)
        return r.status_code, r.content

    def refresh_data(self, url: str = None, get_floorplans: bool = False):
        """Make a fresh GET request for the rightmove data.

        Args:
            url (str): optionally pass a new HTML link to a page of rightmove
                search results (else defaults to the current `url` attribute).
            get_floorplans (bool): optionally scrape links to the individual
                flooplan images for each listing (this drastically increases
                runtime so is False by default).
        """
        url = self.url if not url else url
        self._status_code, self._first_page = self._request(url)
        self._url = url
        self._validate_url()
        self._results = self._get_results()

    def _validate_url(self):
        """Basic validation that the URL at least starts in the right format and
        returns status code 200."""
        real_url = "{}://www.rightmove.co.uk/{}/find.html?"
        protocols = ["http", "https"]
        types = ["property-to-rent", "property-for-sale", "new-homes-for-sale"]
        urls = [real_url.format(p, t) for p in protocols for t in types]
        conditions = [self.url.startswith(u) for u in urls]
        conditions.append(self._status_code == 200)
        if not any(conditions):
            raise ValueError(f"Invalid rightmove search URL:\n\n\t{self.url}")

    @property
    def url(self):
        return self._url

    @property
    def table(self):
        return self._results

    def _parse_page_data_of_interest(self, request_content: str):
        """Method to scrape data from a single page of search results. Used
        iteratively by the `get_results` method to scrape data from every page
        returned by the search."""
        soup = BeautifulSoup(request_content, features='lxml')

        start = 'window.jsonModel = '
        tags = soup.find_all(
            lambda tag: tag.name == 'script' and start in tag.get_text())
        if not tags:
            raise ValueError('Could not extract data from current page!')
        if len(tags) > 1:
            raise ValueError('Inconsistent data in current page!')

        json_str = tags[0].get_text()[len(start):]
        json_obj = json.loads(json_str)

        return json_obj

    def _get_properties_list(self, json_obj):
        return json_obj['properties']

    def _get_results(self):
        """Build a Pandas DataFrame with all results returned by the search."""
        print('Scraping page {}'.format(1))
        print('- Parsing data from page {}'.format(1))
        try:
            page_data = self._parse_page_data_of_interest(self._first_page)
            properties = self._get_properties_list(page_data)
        except ValueError:
            print('Failed to get property data from page {}'.format(1))

        final_results = properties

        page = 2;
        last = int(page_data['pagination']['last'])
        chunk_size = int(page_data['pagination']['next'])

        # Scrape each page
        while True:
            next_index = (page - 1) * chunk_size
            if next_index > last:
                print('Finished!')
                break

            print('Scraping page {}'.format(page))

            # Create the URL of the specific results page:
            p_url = f"{str(self.url)}&index={page * chunk_size}"

            # Make the request:
            print('- Downloading data from page {}'.format(page))
            status_code, page_content = self._request(p_url)

            # Requests to scrape lots of pages eventually get status 400, so:
            if status_code != 200:
                print('Failed to access page {}'.format(page))
                continue

            # Create a temporary DataFrame of page results:
            print('- Parsing data from page {}'.format(page))
            try:
                page_data = self._parse_page_data_of_interest(page_content)
                properties = self._get_properties_list(page_data)
            except ValueError:
                print('Failed to get property data from page {}'.format(page))

            # Append the list or properties.
            final_results += properties

            # Go to the next page.
            page += 1

        # Transform the final results into a table.
        property_data_frame = pd.DataFrame.from_records(final_results)

        return property_data_frame


# 1. Adapt the URL here
#    Go to: https://www.rightmove.co.uk/house-prices.html
#    Type region of interest.
#    Click on 'list view' so that RightMove show the results on the web navigator.
#    Copy the corresponding link here.
url = "https://www.rightmove.co.uk/property-for-sale/find.html?searchType=SALE&locationIdentifier=REGION%5E70417&insId=1&radius=0.0&minPrice=&maxPrice=&minBedrooms=&maxBedrooms=&displayPropertyType=&maxDaysSinceAdded=&_includeSSTC=on&sortByPriceDescending=&primaryDisplayPropertyType=&secondaryDisplayPropertyType=&oldDisplayPropertyType=&oldPrimaryDisplayPropertyType=&newHome=&auction=false"

# 2. Launch the data scraping here.
properties_for_sale = PropertiesForSale(url)

# 3. Save the results somewhere.
properties_for_sale.table.to_csv('properties_for_sale.csv')

HTH

toby-p · 2020-05-05T13:48:11Z

Thanks for doing this, will take a proper look at it when I get the time to add it to the package.

p2327 · 2020-05-14T21:22:05Z

I am going to look into this now

p2327 · 2020-05-15T14:37:58Z

@davidok8 hey I think your first class is incredibly useful especially as it gives exact postcode and price history

I think the output could be more streamlined, so I'll work on that and open a PR @toby-p

I am not sure what class PropertiesForSale does?

edit:grammar

oddkiva · 2020-05-15T22:14:32Z

Glad to know the first one is useful.

The second class merely returns the list of properties not sold yet. True, it does not contain any market information (probaly the market history of the area).

On the other hand you can find complementary information (GPS location, size in sq feet, addedOrReduced, area in development, etc.). You have to reformat the data...

@davidok8 hey I think your first class is incredibly useful especially as it gives exact postcode and price history

I think the output could be more streamlined, so I'll work on that and open a PR @toby-p

I am not sure what class PropertiesForSale does?

edit:grammar

p2327 · 2020-05-18T19:13:34Z

@davidok8

check latest commit in this PR

You can access a processed df by invoking .processed_data on a SoldProperties object

Note that some pf the code is redundant - i will trim it later

changes to your class:

# imports
import ast
import re
import datetime as dt
from datetime import datetime
from lxml import html
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import json

# Env
address_pattern = r'([\s\S]+?)([A-Za-z][A-Za-z]?[0-9][0-9]?[A-Za-z]?[0-9]?\s[0-9]?[A-Za-z][A-Za-z])'
outwardcode_pattern = r'([A-Za-z][A-Za-z]?[0-9][0-9]?[A-Za-z]?[0-9]?)'

# Helpers
def extract_price(series):
    prices = []
    for entry in series:
        prices.append(int(entry[0]['displayPrice'].strip('£').replace(',', '')))
    return prices


def extract_date(series):
    dates = []
    for entry in series:
        dates.append(datetime.strptime(entry[0]['dateSold'], '%d %b %Y'))
    return dates


def extract_tenure(series):
    tenures = []
    for entry in series:
        tenures.append(entry[0]['tenure'])
    return tenures


def extract_coords(series, lat=False):
    coords = []
    if lat:
        for entry in series:
            coords.append(entry['lat'])
    else:
        for entry in series:
            coords.append(entry['lng'])
    return coords

class SoldProperties:

    def __init__(self, url: str, get_floorplans: bool = False):
        """Initialize the scraper with a URL from the results of a property
        search performed on www.rightmove.co.uk.

        Args:
            url (str): full HTML link to a page of rightmove search results.
            get_floorplans (bool): optionally scrape links to the individual
                floor plan images for each listing (be warned this drastically
                increases runtime so is False by default).
        """
        self._status_code, self._first_page = self._request(url)
        self._url = url
        self._validate_url()
        self._results = self._get_results()

    @staticmethod
    def _request(url: str):
        r = requests.get(url)
        return r.status_code, r.content

    def refresh_data(self, url: str = None, get_floorplans: bool = False):
        """Make a fresh GET request for the rightmove data.

        Args:
            url (str): optionally pass a new HTML link to a page of rightmove
                search results (else defaults to the current `url` attribute).
            get_floorplans (bool): optionally scrape links to the individual
                flooplan images for each listing (this drastically increases
                runtime so is False by default).
        """
        url = self.url if not url else url
        self._status_code, self._first_page = self._request(url)
        self._url = url
        self._validate_url()
        self._results = self._get_results()

    def _validate_url(self):
        """Basic validation that the URL at least starts in the right format and
        returns status code 200."""
        real_url = "{}://www.rightmove.co.uk/{}/find.html?"
        protocols = ["http", "https"]
        types = ["property-to-rent", "property-for-sale", "new-homes-for-sale"]
        urls = [real_url.format(p, t) for p in protocols for t in types]
        conditions = [self.url.startswith(u) for u in urls]
        conditions.append(self._status_code == 200)
        if not any(conditions):
            raise ValueError(f"Invalid rightmove search URL:\n\n\t{self.url}")

    @property
    def url(self):
        return self._url

    @property
    def table(self):
        return self._results

    def _parse_page_data_of_interest(self, request_content: str):
        """Method to scrape data from a single page of search results. Used
        iteratively by the `get_results` method to scrape data from every page
        returned by the search."""
        soup = BeautifulSoup(request_content, features='lxml')

        start = 'window.__PRELOADED_STATE__ = '
        tags = soup.find_all(
            lambda tag: tag.name == 'script' and start in tag.get_text())
        if not tags:
            raise ValueError('Could not extract data from current page!')
        if len(tags) > 1:
            raise ValueError('Inconsistent data in current page!')

        json_str = tags[0].get_text()[len(start):]
        json_obj = json.loads(json_str)

        return json_obj

    def _get_properties_list(self, json_obj):
        return json_obj['results']['properties']

    def _get_results(self):
        """Build a Pandas DataFrame with all results returned by the search."""
        print('Scraping page {}'.format(1))
        print('- Parsing data from page {}'.format(1))
        try:
            page_data = self._parse_page_data_of_interest(self._first_page)
            properties = self._get_properties_list(page_data)
        except ValueError:
            print('Failed to get property data from page {}'.format(1))

        final_results = properties

        current = page_data['pagination']['current']
        last = page_data['pagination']['last']
        if current == last:
            return

        # Scrape each page
        for page in range(current + 1, last):
            print('Scraping page {}'.format(page))

            # Create the URL of the specific results page:
            p_url = f"{str(self.url)}&page={page}"

            # Make the request:
            print('- Downloading data from page {}'.format(page))
            status_code, page_content = self._request(p_url)

            # Requests to scrape lots of pages eventually get status 400, so:
            if status_code != 200:
                print('Failed to access page {}'.format(page))
                continue

            # Create a temporary DataFrame of page results:
            print('- Parsing data from page {}'.format(page))
            try:
                page_data = self._parse_page_data_of_interest(page_content)
                properties = self._get_properties_list(page_data)
            except ValueError:
                print('Failed to get property data from page {}'.format(page))

            # Append the list or properties.
            final_results += properties

        # Transform the final results into a table.
        property_data_frame = pd.DataFrame.from_records(final_results)

        def process_data(rawdf):
            df = rawdf.copy()
        
            address = df['address'].str.extract(address_pattern, expand=True).to_numpy()
            outwardcodes = df['address'].str.extract(outwardcode_pattern, expand=True).to_numpy()
            
            df = (df.drop(['address', 'images', 'hasFloorPlan', 'detailUrl'], axis=1)
                    .assign(address=address[:, 0])
                    .assign(postcode=address[:, 1])
                    .assign(outwardcode=outwardcodes[:, 0])
                    #.assign(transactions=df.transactions.apply(ast.literal_eval))
                    #.assign(location=df.location.apply(ast.literal_eval))
                    .assign(last_price=lambda x: extract_price(x.transactions))
                    .assign(sale_date=lambda x: extract_date(x.transactions))
                    .assign(tenure=lambda x: extract_tenure(x.transactions))
                    .assign(lat=lambda x: extract_coords(x.location, lat=True))
                    .assign(lng=lambda x: extract_coords(x.location))
                    .drop(['transactions', 'location'], axis=1)
            )
            return df
     
        #return process_data(property_data_frame)

        return property_data_frame

    @property
    def processed_data(self):
        df = self._results
    
        address = df['address'].str.extract(address_pattern, expand=True).to_numpy()
        outwardcodes = df['address'].str.extract(outwardcode_pattern, expand=True).to_numpy()
        
        df = (df.drop(['address', 'images', 'hasFloorPlan', 'detailUrl'], axis=1)
                .assign(address=address[:, 0])
                .assign(postcode=address[:, 1])
                .assign(outwardcode=outwardcodes[:, 0])
                #.assign(transactions=df.transactions.apply(ast.literal_eval))
                #.assign(location=df.location.apply(ast.literal_eval))
                .assign(last_price=lambda x: extract_price(x.transactions))
                .assign(sale_date=lambda x: extract_date(x.transactions))
                .assign(tenure=lambda x: extract_tenure(x.transactions))
                .assign(lat=lambda x: extract_coords(x.location, lat=True))
                .assign(lng=lambda x: extract_coords(x.location))
                .drop(['transactions', 'location'], axis=1)
                .reindex(columns=['last_price', 
                                'sale_date', 
                                'propertyType',
                                'bedrooms',
                                'bathrooms', 
                                'tenure', 
                                'address', 
                                'postcode', 
                                'outwardcode', 
                                'lat', 
                                'lng'])
        )
        return df

andrewwilso · 2020-07-29T10:41:24Z

This is extremely useful. Is it possible to include the get_floorplans option as in the main class?

toby-p added the enhancement label Apr 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative dataset - sold prices #25

Alternative dataset - sold prices #25

osmya commented Apr 24, 2020

alandinbedia commented Apr 24, 2020

toby-p commented Apr 30, 2020

alandinbedia commented Apr 30, 2020

oddkiva commented May 3, 2020 •

edited

Loading

oddkiva commented May 3, 2020 •

edited

Loading

toby-p commented May 5, 2020

p2327 commented May 14, 2020 •

edited

Loading

p2327 commented May 15, 2020 •

edited

Loading

oddkiva commented May 15, 2020

p2327 commented May 18, 2020 •

edited

Loading

andrewwilso commented Jul 29, 2020

Alternative dataset - sold prices #25

Alternative dataset - sold prices #25

Comments

osmya commented Apr 24, 2020

alandinbedia commented Apr 24, 2020

toby-p commented Apr 30, 2020

alandinbedia commented Apr 30, 2020

oddkiva commented May 3, 2020 • edited Loading

oddkiva commented May 3, 2020 • edited Loading

toby-p commented May 5, 2020

p2327 commented May 14, 2020 • edited Loading

p2327 commented May 15, 2020 • edited Loading

oddkiva commented May 15, 2020

p2327 commented May 18, 2020 • edited Loading

andrewwilso commented Jul 29, 2020

oddkiva commented May 3, 2020 •

edited

Loading

oddkiva commented May 3, 2020 •

edited

Loading

p2327 commented May 14, 2020 •

edited

Loading

p2327 commented May 15, 2020 •

edited

Loading

p2327 commented May 18, 2020 •

edited

Loading