Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with numbers #11

Open
maweed opened this issue Oct 30, 2018 · 3 comments
Open

Issue with numbers #11

maweed opened this issue Oct 30, 2018 · 3 comments

Comments

@maweed
Copy link

maweed commented Oct 30, 2018

Hi
Thanks a lot for your great job.
I have some issues regarding number, most of numbers in the text is converted to date !
for example
text2="The meeting will be held at paris Allé 6, 0208 paris. Election 30 of a chairperson in france. page 18 of 20"

then we did get
[datetime.datetime(2008, 2, 6, 0, 0, tzinfo=), datetime.datetime(1930, 1, 1, 0, 0, tzinfo=), datetime.datetime(2018, 1, 1, 0, 0, tzinfo=), datetime.datetime(1920, 1, 1, 0, 0, tzinfo=)]
As you see all numbers her should not be extracted as date.
Is there any sulotion ?

Thanks

@DanielJDufour
Copy link
Owner

DanielJDufour commented Oct 30, 2018

Hi, @maweed . Thank you for posting this issue! :-) I'm definitely open to your thoughts if you have any ideas!

There's a few things you could do to increase the confidence of your results, but unfortunately they are a bit hackish.

Check if Full Year Found in Text

from date_extractor import extract_dates
text="The meeting will be held at paris Allé 6, 0208 paris. Election 30 of a chairperson in france."
dates = extract_dates(text)

# filter out if full 4-letter year doesn't match
dates = [date for date in dates if str(date.year) in text]

Check Precision

from date_extractor import extract_dates
text="The meeting will be held at paris Allé 6, 0208 paris. Election 30 of a chairperson in france."
dates = extract_dates(text, return_precision=True)

# filter out if only matched year and not month and day
dates = [date for date, precision in dates if precision != 'year']

Check If White Space Between Year, Month and Day
I'll have to write some code inside the date-extractor in order to add this option, but this will filter out examples where the date is found in a string without white-space like 0208.

Have a different set of rules for text versus filenames
I can write code that only accepts no space between the month and year if it is short text like a filename.

Thoughts? What would work for you?

Also, open to pull requests if you want to make a contribution! :-)

@maweed
Copy link
Author

maweed commented Oct 31, 2018

Thanks for your efforts to solve my issues, but unfortunately still not working for me, the first suggestion doesn't work (becouse the text may include other date! ) so i wonder if there is some combination to detect if the precision is year and 4 digit or not ?
for me the function work well for detect all dates, but still get more data (numbers and address)
my dates in all documents can be in two forms :

  • only year (4 digit ) or
  • separate day , month (number, or name) year (four digits) like 05/10/2016 or 23 March 2018
    I wonder why you don't use annotation,tagging or indicator for what is date and what is not!
    or use other approches like word2vec to find the similarity between dates to detrmine the real date!
    Thank you again

@DanielJDufour
Copy link
Owner

@maweed, those are great suggestions. I'm definitely open to any improvements that can be made and pull requests!

Part of the history is that back when I initially created this library a few years ago, regex based parsing was substantially faster than the alternatives. There also wasn't a lot of training data in some of the languages this library supports. That said, times change and maybe it's time for an upgrade :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants