Skip to content

Instantly share code, notes, and snippets.

@nitaku
Last active March 7, 2022 04:22
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save nitaku/8579a28a78ddd3391d6b to your computer and use it in GitHub Desktop.
Save nitaku/8579a28a78ddd3391d6b to your computer and use it in GitHub Desktop.
Word cloud intersection

This experiment tries to define and show the concepts of intersection and difference between two word clouds.

Two distinct documents A and B (the first about infovis and datavis, and the other about Human-Computer Interaction) are each used to compute word frequencies. Each word in common between the two documents is then used to create the intersection set, in which each word is assigned the minimum value among its original weights (something reminiscent of a fuzzy intersection operation). The remaining words are used to create the difference sets. For each word in common, the set which had the greatest weight also retains that word with a weight subtracted of the intersection weight mentioned above. In pseudocode:

for each word in common between A and B
  let a_w be the weight of the word in A
  let b_w be the weight of the word in B
  put the word into the intersection set with weight = min(a_w, b_w)
  if a_w - min(a_w, b_w) > 0
    put the word into the A \ B set with weight = a_w - min(a_w, b_w)
  else if b_w - min(a_w, b_w) > 0
    put the word into the B \ A set with weight = b_w - min(a_w, b_w)

for each remaining word in A
  put the word into the A \ B set with weight = a_w (original weight)

for each remaining word in B
  put the word into the B \ A set with weight = b_w (original weight)

It can be seen by the three resulting "clouds" that some words can be interpreted as more specific to the infovis-datavis world (e.g. data, visualization, graphic, arts), while others are more used in HCI (e.g. user, computer, system). There's also a fair amount of intersection between the two (e.g. information, design, research, field). Please note that the size of the intersection set is heavily influenced by the choice of removing stopwords.

While not completely satisfactory in its meaningfulness (as it's often the case with word clouds), this experiment could lead to an interesting research path. Many different formulas can be used to define a concept of intersection (something that normalizes on document length, or even something inspired by TF-IDF could be interesting), and many choices are available for representing the results (a round, Venn-like layout could be nice).

This example employed a treemap word cloud layout, the useful nlp_compromise javascript library, and a precomputed list of English stopwords.

word
a
about
above
after
again
against
all
am
an
and
any
are
aren't
as
at
be
because
been
before
being
below
between
both
but
by
can't
cannot
could
couldn't
did
didn't
do
does
doesn't
doing
don't
down
during
each
few
for
from
further
had
hadn't
has
hasn't
have
haven't
having
he
he'd
he'll
he's
her
here
here's
hers
herself
him
himself
his
how
how's
i
i'd
i'll
i'm
i've
if
in
into
is
isn't
it
it's
its
itself
let's
me
more
most
mustn't
my
myself
no
nor
not
of
off
on
once
only
or
other
ought
our
ours
ourselves
out
over
own
same
shan't
she
she'd
she'll
she's
should
shouldn't
so
some
such
than
that
that's
the
their
theirs
them
themselves
then
there
there's
these
they
they'd
they'll
they're
they've
this
those
through
to
too
under
until
up
very
was
wasn't
we
we'd
we'll
we're
we've
were
weren't
what
what's
when
when's
where
where's
which
while
who
who's
whom
why
why's
with
won't
would
wouldn't
you
you'd
you'll
you're
you've
your
yours
yourself
yourselves
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
word
will
one
two
three
four
five
six
seven
eight
nine
ten
-
a
able
about
above
abst
accordance
according
accordingly
across
act
actually
added
adj
affected
affecting
affects
after
afterwards
again
against
ah
all
almost
alone
along
already
also
although
always
am
among
amongst
an
and
announce
another
any
anybody
anyhow
anymore
anyone
anything
anyway
anyways
anywhere
apparently
approximately
are
aren
arent
aren't
arise
around
as
aside
ask
asking
at
auth
available
away
awfully
b
back
be
became
because
become
becomes
becoming
been
before
beforehand
begin
beginning
beginnings
begins
behind
being
believe
below
beside
besides
between
beyond
biol
both
brief
briefly
but
by
c
ca
came
can
cannot
can't
cause
causes
certain
certainly
co
com
come
comes
contain
containing
contains
could
couldnt
couldn't
d
date
did
didn't
do
does
doesnt
doesn't
doing
done
dont
don't
down
downwards
due
during
e
each
ed
edu
effect
eg
eight
eighty
either
else
elsewhere
end
ending
enough
especially
et
et-al
etc
even
evenly
ever
every
everybody
everyone
everything
everywhere
except
f
far
few
ff
following
follows
for
former
formerly
found
from
further
furthermore
g
gave
get
gets
getting
give
given
gives
giving
go
goes
gone
got
gotten
h
had
happens
hardly
has
hasnt
hasn't
have
havent
haven't
having
he
hed
he'd
he'll
hence
her
here
hereafter
hereby
herein
heres
hereupon
hers
herself
hes
he's
hi
him
himself
his
how
how's
howbeit
however
i
id
i'd
ie
if
i'll
im
i'm
immediate
immediately
in
inc
indeed
index
instead
into
inward
is
isnt
isn't
it
itd
it'd
itll
it'll
its
it's
itself
ive
i've
j
just
k
keep
keeps
kept
know
known
knows
l
largely
last
lately
later
latter
latterly
least
less
lest
let
lets
like
liked
likely
line
little
'll
look
looking
looks
ltd
m
made
mainly
make
makes
many
may
maybe
me
mean
means
meantime
meanwhile
merely
mg
might
million
miss
ml
more
moreover
most
mostly
mr
mrs
much
mug
must
my
myself
n
na
name
namely
nay
nd
near
nearly
necessarily
necessary
need
needs
neither
never
nevertheless
new
next
no
nobody
non
none
nonetheless
noone
nor
normally
nos
not
noted
nothing
now
nowhere
o
obviously
of
off
often
oh
ok
okay
on
once
ones
one's
only
onto
or
ord
other
others
otherwise
ought
our
ours
ourselves
out
outside
over
overall
owing
own
p
part
particular
particularly
past
per
perhaps
please
plus
possible
possibly
potentially
pp
previously
primarily
probably
promptly
put
q
que
quickly
quite
qv
r
rather
rd
re
're
readily
really
recent
recently
ref
refs
regarding
regardless
regards
related
relatively
respectively
resulted
resulting
retweet
rt
s
's
said
same
saw
say
saying
says
sec
seem
seemed
seeming
seems
seen
self
selves
sent
several
shall
she
she'd
she'll
shes
she's
should
shouldn't
showed
shown
showns
shows
significant
significantly
similar
similarly
since
slightly
so
some
somebody
somehow
someone
somethan
something
sometime
sometimes
somewhat
somewhere
soon
sorry
specifically
specified
specify
specifying
still
stop
strongly
sub
substantially
successfully
such
sufficiently
sup
'sup
sure
t
take
taken
taking
tell
tends
th
than
thank
thanks
thanx
that
that'll
thats
that's
that've
the
their
theirs
them
themselves
then
thence
there
thereafter
thereby
thered
there'd
therefore
therein
there'll
thereof
therere
there're
theres
there's
thereto
thereupon
there've
these
they
theyd
they'd
they'll
theyre
they're
theyve
they've
think
thinks
this
those
thou
though
thoughh
thousand
throug
through
throughout
thru
thus
til
to
together
too
took
tooks
toward
towards
tried
tries
truly
try
trying
ts
twice
u
un
under
unfortunately
unless
unlike
unlikely
until
unto
up
upon
ups
us
use
used
useful
usefully
usefulness
uses
using
usually
v
value
various
've
very
via
viz
vol
vols
vs
w
want
wants
was
wasnt
wasn't
way
we
wed
we'd
welcome
we'll
well
went
were
we're
weren't
we've
what
whatever
what'll
whats
what's
when
whence
whenever
where
whereafter
whereas
whereby
wherein
wheres
where's
whereupon
wherever
whether
which
while
whim
who
whod
who'd
whoever
whole
who'll
whom
whomever
whos
who's
whose
why
widely
willing
wish
with
within
without
won't
words
world
would
wouldn't
x
y
yes
yet
you
youd
you'd
youll
you'll
your
you're
youre
yours
yourself
yourselves
youve
you've
z
Human–computer interaction (HCI) researches the design and use of computer technology, focusing particularly on the interfaces between people (users) and computers. Researchers in the field of HCI both observe the ways in which humans interact with computers and design technologies that lets humans interact with computers in novel ways.
As a field of research, Human-Computer Interaction is situated at the intersection of computer science, behavioral sciences, design, media studies, and several other fields of study. The term was popularized by Stuart K. Card and Allen Newell of Carnegie Mellon University and Thomas P. Moran of IBM Research in their seminal 1983 book, The Psychology of Human-Computer Interaction, although the authors first used the term in 1980[1] and the first known use was in 1975.[2] The term connotes that, unlike other tools with only limited uses (such as a hammer, useful for driving nails, but not much else), a computer has many uses and this takes place as an open-ended dialog between the user and the computer. The notion of dialog likens human-computer interaction to human-to-human interaction, an analogy the discussion of which is crucial to theoretical considerations in the field.[3][4]
Introduction
Humans interact with computers in many ways; and the interface between humans and the computers they use is crucial to facilitating this interaction. Desktop applications, internet browsers, handheld computers, and computer kiosks make use of the prevalent graphical user interfaces (GUI) of today.[5] Voice user interfaces (VUI) are used for speech recognition and synthesising systems, and the emerging multi-modal and gestalt User Interfaces (GUI) allow humans to engage with embodied character agents in a way that cannot be achieved with other interface paradigms.
The Association for Computing Machinery defines human-computer interaction as "a discipline concerned with the design, evaluation and implementation of interactive computing systems for human use and with the study of major phenomena surrounding them".[5] An important facet of HCI is the securing of user satisfaction (or simply End User Computing Satisfaction). "Because human–computer interaction studies a human and a machine in communication, it draws from supporting knowledge on both the machine and the human side. On the machine side, techniques in computer graphics, operating systems, programming languages, and development environments are relevant. On the human side, communication theory, graphic and industrial design disciplines, linguistics, social sciences, cognitive psychology, social psychology, and human factors such as computer user satisfaction are relevant. And, of course, engineering and design methods are relevant."[5] Due to the multidisciplinary nature of HCI, people with different backgrounds contribute to its success. HCI is also sometimes referred to as human–machine interaction (HMI), man–machine interaction (MMI) or computer–human interaction (CHI).
Poorly designed human-machine interfaces can lead to many unexpected problems. A classic example of this is the Three Mile Island accident, a nuclear meltdown accident, where investigations concluded that the design of the human–machine interface was at least partially responsible for the disaster.[6][7][8] Similarly, accidents in aviation have resulted from manufacturers' decisions to use non-standard flight instrument or throttle quadrant layouts: even though the new designs were proposed to be superior in regards to basic human–machine interaction, pilots had already ingrained the "standard" layout and thus the conceptually good idea actually had undesirable results.
Goals
Human-Computer Interaction studies the ways in which humans make, or make not, use of computational artifacts, systems and infrastructures. In doing so, much of the research in the field seek to `improve' human-computer interaction by improving the `usability' of computer interfaces.[9] How `usability' is to be precisely understood, how it relates to other social and cultural values and when it is, and when it may not be a desirable property of computer interfaces is increasingly debated.[10][11]
Much of the research in the field of Human-Computer Interaction takes an interest in:
methods for designing novel computer interfaces, thereby optimizing a design for a desired property such as, e.g., learnability or efficiency of use. An example of a design method that has been continuously developed by HCI researchers is Participatory Design.
methods for implementing interfaces, e.g., by means of software tool kits and libraries
methods for evaluating and comparing interfaces with respect to their usability or other desirable properties
methods for studying human computer use and its sociocultural implications more broadly
models and theories of human computer use as well as conceptual frameworks for the design of computer interfaces, such as, e.g., cognitivist user models, Activity Theory or ethnomethodological accounts of human computer use[12]
perspectives that critically reflect upon the values that underlie computational design, computer use and HCI research practice[13]
Visions of what researchers in the field seek to achieve vary. When pursuing a cognitivist perspective, researchers of HCI may seek to align computer interfaces with the mental model that humans have of their activities. When pursuing a post-cognitivist perspective, researchers of HCI may, e.g., seek to align computer interfaces with existing social practices or existing sociocultural values.
Professional practitioners in HCI are usually designers concerned with the practical application of design methodologies to problems in the world. Their work often revolves around designing graphical user interfaces and web interfaces.
Researchers in HCI are interested in developing new design methodologies, experimenting with new devices, prototyping new software systems, exploring new interaction paradigms, and developing models and theories of interaction.
Differences with related fields
This section does not cite any references or sources. Please help improve this section by adding citations to reliable sources. Unsourced material may be challenged and removed. (February 2013)
HCI differs from human factors and ergonomics as HCI focuses more on users working specifically with computers, rather than other kinds of machines or designed artifacts. There is also a focus in HCI on how to implement the computer software and hardware mechanisms to support human–computer interaction. Thus, human factors is a broader term; HCI could be described as the human factors of computers – although some experts try to differentiate these areas.
HCI also differs from human factors in that there is less of a focus on repetitive work-oriented tasks and procedures, and much less emphasis on physical stress and the physical form or industrial design of the user interface, such as keyboards and mouse devices.
Three areas of study have substantial overlap with HCI even as the focus of inquiry shifts. In the study of personal information management (PIM), human interactions with the computer are placed in a larger informational context – people may work with many forms of information, some computer-based, many not (e.g., whiteboards, notebooks, sticky notes, refrigerator magnets) in order to understand and effect desired changes in their world. In computer-supported cooperative work (CSCW), emphasis is placed on the use of computing systems in support of the collaborative work of a group of people. The principles of human interaction management (HIM) extend the scope of CSCW to an organizational level and can be implemented without use of computers.
Design
Principles
The user interacts directly with hardware for the human input and output such as displays, e.g. through a graphical user interface. The user interacts with the computer over this software interface using the given input and output (I/O) hardware.
Software and hardware must be matched, so that the processing of the user input is fast enough, the latency of the computer output is not disruptive to the workflow.
When evaluating a current user interface, or designing a new user interface, it is important to keep in mind the following experimental design principles:
Early focus on user(s) and task(s): Establish how many users are needed to perform the task(s) and determine who the appropriate users should be; someone who has never used the interface, and will not use the interface in the future, is most likely not a valid user. In addition, define the task(s) the users will be performing and how often the task(s) need to be performed.
Empirical measurement: Test the interface early on with real users who come in contact with the interface on a daily basis. Keep in mind that results may vary with the performance level of the user and may not be an accurate depiction of the typical human-computer interaction. Establish quantitative usability specifics such as: the number of users performing the task(s), the time to complete the task(s), and the number of errors made during the task(s).
Iterative design: After determining the users, tasks, and empirical measurements to include, perform the following iterative design steps:
Design the user interface
Test
Analyze results
Repeat
Repeat the iterative design process until a sensible, user-friendly interface is created.[14]
Methodologies
A number of diverse methodologies outlining techniques for human–computer interaction design have emerged since the rise of the field in the 1980s. Most design methodologies stem from a model for how users, designers, and technical systems interact. Early methodologies, for example, treated users' cognitive processes as predictable and quantifiable and encouraged design practitioners to look to cognitive science results in areas such as memory and attention when designing user interfaces. Modern models tend to focus on a constant feedback and conversation between users, designers, and engineers and push for technical systems to be wrapped around the types of experiences users want to have, rather than wrapping user experience around a completed system.
Activity theory: used in HCI to define and study the context in which human interactions with computers take place. Activity theory provides a framework to reason about actions in these contexts, analytical tools with the format of checklists of items that researchers should consider, and informs design of interactions from an activity-centric perspective.[15]
User-centered design: user-centered design (UCD) is a modern, widely practiced design philosophy rooted in the idea that users must take center-stage in the design of any computer system. Users, designers and technical practitioners work together to articulate the wants, needs and limitations of the user and create a system that addresses these elements. Often, user-centered design projects are informed by ethnographic studies of the environments in which users will be interacting with the system. This practice is similar but not identical to participatory design, which emphasizes the possibility for end-users to contribute actively through shared design sessions and workshops.
Principles of user interface design: these are seven principles of user interface design that may be considered at any time during the design of a user interface in any order: tolerance, simplicity, visibility, affordance, consistency, structure and feedback.[16]
Value sensitive design: Value Sensitive Design (VSD) is a method for building technology that account for the values of the people who use the technology directly, as well as those who the technology affects, either directly or indirectly. VSD uses an iterative design process that involves three types of investigations: conceptual, empirical and technical. Conceptual investigations aim at understanding and articulating the various stakeholders of the technology, as well as their values and any values conflicts that might arise for these stakeholders through the use of the technology. Empirical investigations are qualitative or quantitative design research studies used to inform the designers' understanding of the users' values, needs, and practices. Technical investigations can involve either analysis of how people use related technologies, or the design of systems to support values identified in the conceptual and empirical investigations.[17]
Display designs
Displays are human-made artifacts designed to support the perception of relevant system variables and to facilitate further processing of that information. Before a display is designed, the task that the display is intended to support must be defined (e.g. navigating, controlling, decision making, learning, entertaining, etc.). A user or operator must be able to process whatever information that a system generates and displays; therefore, the information must be displayed according to principles in a manner that will support perception, situation awareness, and understanding.
Thirteen principles of display design
Christopher Wickens et al. defined 13 principles of display design in their book An Introduction to Human Factors Engineering.[18]
These principles of human perception and information processing can be utilized to create an effective display design. A reduction in errors, a reduction in required training time, an increase in efficiency, and an increase in user satisfaction are a few of the many potential benefits that can be achieved through utilization of these principles.
Certain principles may not be applicable to different displays or situations. Some principles may seem to be conflicting, and there is no simple solution to say that one principle is more important than another. The principles may be tailored to a specific design or situation. Striking a functional balance among the principles is critical for an effective design.[19]
Perceptual principles
1. Make displays legible (or audible). A display's legibility is critical and necessary for designing a usable display. If the characters or objects being displayed cannot be discernible, then the operator cannot effectively make use of them.
2. Avoid absolute judgment limits. Do not ask the user to determine the level of a variable on the basis of a single sensory variable (e.g. color, size, loudness). These sensory variables can contain many possible levels.
3. Top-down processing. Signals are likely perceived and interpreted in accordance with what is expected based on a user's experience. If a signal is presented contrary to the user's expectation, more physical evidence of that signal may need to be presented to assure that it is understood correctly.
4. Redundancy gain. If a signal is presented more than once, it is more likely that it will be understood correctly. This can be done by presenting the signal in alternative physical forms (e.g. color and shape, voice and print, etc.), as redundancy does not imply repetition. A traffic light is a good example of redundancy, as color and position are redundant.
5. Similarity causes confusion: Use discriminable elements. Signals that appear to be similar will likely be confused. The ratio of similar features to different features causes signals to be similar. For example, A423B9 is more similar to A423B8 than 92 is to 93. Unnecessary similar features should be removed and dissimilar features should be highlighted.
Mental model principles
6. Principle of pictorial realism. A display should look like the variable that it represents (e.g. high temperature on a thermometer shown as a higher vertical level). If there are multiple elements, they can be configured in a manner that looks like it would in the represented environment.
7. Principle of the moving part. Moving elements should move in a pattern and direction compatible with the user's mental model of how it actually moves in the system. For example, the moving element on an altimeter should move upward with increasing altitude.
Principles based on attention
8. Minimizing information access cost. When the user's attention is diverted from one location to another to access necessary information, there is an associated cost in time or effort. A display design should minimize this cost by allowing for frequently accessed sources to be located at the nearest possible position. However, adequate legibility should not be sacrificed to reduce this cost.
9. Proximity compatibility principle. Divided attention between two information sources may be necessary for the completion of one task. These sources must be mentally integrated and are defined to have close mental proximity. Information access costs should be low, which can be achieved in many ways (e.g. proximity, linkage by common colors, patterns, shapes, etc.). However, close display proximity can be harmful by causing too much clutter.
10. Principle of multiple resources. A user can more easily process information across different resources. For example, visual and auditory information can be presented simultaneously rather than presenting all visual or all auditory information.
Memory principles
11. Replace memory with visual information: knowledge in the world. A user should not need to retain important information solely in working memory or retrieve it from long-term memory. A menu, checklist, or another display can aid the user by easing the use of their memory. However, the use of memory may sometimes benefit the user by eliminating the need to reference some type of knowledge in the world (e.g. an expert computer operator would rather use direct commands from memory than refer to a manual). The use of knowledge in a user's head and knowledge in the world must be balanced for an effective design.
12. Principle of predictive aiding. Proactive actions are usually more effective than reactive actions. A display should attempt to eliminate resource-demanding cognitive tasks and replace them with simpler perceptual tasks to reduce the use of the user's mental resources. This will allow the user to not only focus on current conditions, but also think about possible future conditions. An example of a predictive aid is a road sign displaying the distance to a certain destination.
13. Principle of consistency. Old habits from other displays will easily transfer to support processing of new displays if they are designed consistently. A user's long-term memory will trigger actions that are expected to be appropriate. A design must accept this fact and utilize consistency among different displays.
Human–computer interface
Main article: User interface
The human–computer interface can be described as the point of communication between the human user and the computer. The flow of information between the human and computer is defined as the loop of interaction. The loop of interaction has several aspects to it, including:
Task environment: The conditions and goals set upon the user.
Machine environment: The environment that the computer is connected to, e.g. a laptop in a college student's dorm room.
Areas of the interface: Non-overlapping areas involve processes of the human and computer not pertaining to their interaction. Meanwhile, the overlapping areas only concern themselves with the processes pertaining to their interaction.
Input flow: The flow of information that begins in the task environment, when the user has some task that requires using their computer.
Output: The flow of information that originates in the machine environment.
Feedback: Loops through the interface that evaluate, moderate, and confirm processes as they pass from the human through the interface to the computer and back.
Fit: This is the match between the computer design, the user and the task to optimize the human resources needed to accomplish the task.
Current research
This section does not cite any references or sources. Please help improve this section by adding citations to reliable sources. Unsourced material may be challenged and removed. (October 2010)
Topics in HCI include:
User customization
End-user development studies how ordinary users could routinely tailor applications to their own needs and use this power to invent new applications based on their understanding of their own domains. With their deeper knowledge of their own knowledge domains, users could increasingly be important sources of new applications at the expense of generic systems programmers (with systems expertise but low domain expertise).
Embedded computation
Computation is passing beyond computers into every object for which uses can be found. Embedded systems make the environment alive with little computations and automated processes, from computerized cooking appliances to lighting and plumbing fixtures to window blinds to automobile braking systems to greeting cards. To some extent, this development is already taking place. The expected difference in the future is the addition of networked communications that will allow many of these embedded computations to coordinate with each other and with the user. Human interfaces to these embedded devices will in many cases be very different from those appropriate to workstations.
Augmented reality
A common staple of science fiction, augmented reality refers to the notion of layering relevant information into our vision of the world. Existing projects show real-time statistics to users performing difficult tasks, such as manufacturing. Future work might include augmenting our social interactions by providing additional information about those we converse with.
Social computing
In recent years, there has been an explosion of social science research focusing on interactions as the unit of analysis. Much of this research draws from psychology, social psychology, and sociology. For example, one study found out that people expected a computer with a man's name to cost more than a machine with a woman's name.[20] Other research finds that individuals perceive their interactions with computers more positively than humans, despite behaving the same way towards these machines.[21]
Factors of change
Traditionally, as explained in a journal article discussing user modeling and user-adapted interaction, computer usage was modeled as a human-computer dyad in which the two were connected by a narrow explicit communication channel, such as text-based terminals. Much work has been done to make the interaction between a computing system and a human. However, as stated in the introduction, there is much room for mishaps and failure. Because of this, human-computer interaction shifted focus beyond the inter-face (to respond to observations as articulated by D. Engelbart: "If ease of use was the only valid criterion, people would stick to tricycles and never try bicycles."[22]
The means by which humans interact with computers continues to evolve rapidly. Human–computer interaction is affected by the forces shaping the nature of future computing. These forces include:
Decreasing hardware costs leading to larger memory and faster systems
Miniaturization of hardware leading to portability
Reduction in power requirements leading to portability
New display technologies leading to the packaging of computational devices in new forms
Specialized hardware leading to new functions
Increased development of network communication and distributed computing
Increasingly widespread use of computers, especially by people who are outside of the computing profession
Increasing innovation in input techniques (e.g., voice, gesture, pen), combined with lowering cost, leading to rapid computerization by people previously left out of the "computer revolution."
Wider social concerns leading to improved access to computers by currently disadvantaged groups
The future for HCI, based on current promising research, is expected[23] to include the following characteristics:
Ubiquitous communication. Computers are expected to communicate through high speed local networks, nationally over wide-area networks, and portably via infrared, ultrasonic, cellular, and other technologies. Data and computational services will be portably accessible from many if not most locations to which a user travels.
High-functionality systems. Systems can have large numbers of functions associated with them. There are so many systems that most users, technical or non-technical, do not have time to learn them in the traditional way (e.g., through thick manuals).
Mass availability of computer graphics. Computer graphics capabilities such as image processing, graphics transformations, rendering, and interactive animation are becoming widespread as inexpensive chips become available for inclusion in general workstations and mobile devices.
Mixed media. Commercial systems can handle images, voice, sounds, video, text, formatted data. These are exchangeable over communication links among users. The separate worlds of consumer electronics (e.g., stereo sets, VCRs, televisions) and computers are partially merging. Computer and print worlds are expected to cross-assimilate each other.
High-bandwidth interaction. The rate at which humans and machines interact is expected to increase substantially due to the changes in speed, computer graphics, new media, and new input/output devices. This can lead to some qualitatively different interfaces, such as virtual reality or computational video.
Large and thin displays. New display technologies are finally maturing, enabling very large displays and displays that are thin, lightweight, and low in power consumption. This is having large effects on portability and will likely enable the development of paper-like, pen-based computer interaction systems very different in feel from desktop workstations of the present.
Information utilities. Public information utilities (such as home banking and shopping) and specialized industry services (e.g., weather for pilots) are expected to proliferate. The rate of proliferation can accelerate with the introduction of high-bandwidth interaction and the improvement in quality of interfaces.
Scientific conferences
One of the main conferences for new research in human-computer interaction is the annually held ACM's Conference on Human Factors in Computing Systems, usually referred to by its short name CHI (pronounced kai, or khai). CHI is organized by ACM Special Interest Group on Computer–Human Interaction (SIGCHI). CHI is a large conference, with thousands of attendants, and is quite broad in scope. It is attended by academics, practitioners and industry people, with company sponsors such as Google, Microsoft, and PayPal.
There are also dozens of other smaller, regional or specialized HCI-related conferences held around the world each year, including:[24]
ASSETS: ACM International Conference on Computers and Accessibility
CSCW: ACM conference on Computer Supported Cooperative Work
CC: Aarhus decennial conference on Critical Computing
DIS: ACM conference on Designing Interactive Systems
ECSCW: European Conference on Computer-Supported Cooperative Work
GROUP: ACM conference on supporting group work
HRI: ACM/IEEE International Conference on Human–robot interaction
ICMI: International Conference on Multimodal Interfaces
ITS: ACM conference on Interactive Tabletops and Surfaces
MobileHCI: International Conference on Human–Computer Interaction with Mobile Devices and Services
NIME: International Conference on New Interfaces for Musical Expression
Ubicomp: International Conference on Ubiquitous computing
UIST: ACM Symposium on User Interface Software and Technology
i-USEr: International Conference on User Science and Engineering
INTERACT: IFIP TC13 Conference on Human-Computer Interaction
svg = d3.select('svg')
width = svg.node().getBoundingClientRect().width
height = svg.node().getBoundingClientRect().height
treemap = d3.layout.treemap()
.size([width, height])
.value((node) -> node.count)
.sort((a,b) ->
return +1 if a.name is 'a' or b.name is 'b'
return -1 if a.name is 'b' or b.name is 'a'
return a.count-b.count
)
.ratio(1/3)
.padding((node) ->
if node.depth is 0
return [0,0,40,0] # make room for set labels
else if node.depth is 1
return 4
else
return 0
)
.round(false) # bugfix: d3 wrong ordering
correct_x = d3.scale.linear()
.domain([0, width])
.range([0, width*1.05])
correct_y = d3.scale.linear()
.domain([0, height])
.range([0, height*3/4])
# define a stable color scale to differentiate words and sets
color = (txt, set) ->
iset = {'a': 0, 'intersection': 1, 'b': 2}[set]
Math.seedrandom(txt+'abcdef')
noise = (W) -> Math.random()*W - W/2
d3.hcl(iset*90+noise(90), 40, 50)
# translate the viewBox to have (0,0) at the center of the vis
svg
.attr
viewBox: "#{-width/2} #{-height/2} #{width} #{height}"
# append a group for zoomable content
zoomable_layer = svg.append('g')
# define a zoom behavior
zoom = d3.behavior.zoom()
.scaleExtent([1,10]) # min-max zoom
.on 'zoom', () ->
# GEOMETRIC ZOOM
zoomable_layer
.attr
transform: "translate(#{zoom.translate()})scale(#{zoom.scale()})"
# bind the zoom behavior to the main SVG
svg.call(zoom)
# group the visualization
vis = zoomable_layer.append('g')
.attr
transform: "translate(#{-width/2},#{-height/2})"
d3.csv 'english_stopwords_long.txt', (stopwords_array) ->
# build an index of stopwords
stopwords = {}
stopwords_array.forEach (w) -> stopwords[w.word] = true
d3.text 'infovis.txt', (infovis_txt) ->
data_a = nlp.ngram(infovis_txt, {min_count: 1, max_size: 1})[0].filter (w) -> w.word not of stopwords
index_a = {}
data_a.forEach (d) ->
index_a[d.word] = d
d3.text 'hci.txt', (hci_txt) ->
data_b = nlp.ngram(hci_txt, {min_count: 1, max_size: 1})[0].filter (w) -> w.word not of stopwords
index_b = {}
data_b.forEach (d) ->
index_b[d.word] = d
diff_a = data_a.filter (a) -> a.word not of index_b
diff_b = data_b.filter (b) -> b.word not of index_a
intersection = []
data_a.forEach (a) ->
data_b.forEach (b) ->
if a.word is b.word
min = Math.min(a.count, b.count)
intersection.push {word: a.word, count: min}
if a.count-min > 0
diff_a.push {word: a.word, count: a.count-min}
if b.count-min > 0
diff_b.push {word: b.word, count: b.count-min}
a = {
children: (diff_a.filter (d) -> d.count > 1),
name: "a"
}
intersection = {
children: (intersection.filter (d) -> d.count > 1),
name: "intersection"
}
b = {
children: (diff_b.filter (d) -> d.count > 1),
name: "b"
}
tree = {
children: [a,intersection,b],
name: "root"
}
nodes_data = treemap.nodes(tree)
labels = vis.selectAll('.label')
.data(nodes_data.filter((node) -> node.depth is 2))
enter_labels = labels.enter().append('svg')
.attr
class: 'label'
enter_labels.append('text')
.text((node) -> node.word.toUpperCase())
.attr
dy: '0.35em'
fill: (node) -> color(node.word, node.parent.name)
.each (node) ->
bbox = this.getBBox()
bbox_aspect = bbox.width / bbox.height
node_bbox = {width: node.dx, height: node.dy}
node_bbox_aspect = node_bbox.width / node_bbox.height
rotate = bbox_aspect >= 1 and node_bbox_aspect < 1 or bbox_aspect < 1 and node_bbox_aspect >= 1
node.label_bbox = {
x: bbox.x+(bbox.width-correct_x(bbox.width))/2,
y: bbox.y+(bbox.height-correct_y(bbox.height))/2,
width: correct_x(bbox.width),
height: correct_y(bbox.height)
}
if rotate
node.label_bbox = {
x: node.label_bbox.y,
y: node.label_bbox.x,
width: node.label_bbox.height,
height: node.label_bbox.width
}
d3.select(this).attr('transform', 'rotate(-90)')
enter_labels
.attr
x: (node) -> node.x
y: (node) -> node.y
width: (node) -> node.dx
height: (node) -> node.dy
viewBox: (node) -> "#{node.label_bbox.x} #{node.label_bbox.y} #{node.label_bbox.width} #{node.label_bbox.height}"
preserveAspectRatio: 'none'
# draw set labels
vis.append('text')
.text('A ∖ B')
.attr
class: 'set_label'
x: a.x + a.dx/2
y: height - 22
dy: '0.35em'
vis.append('text')
.text('A ∩ B')
.attr
class: 'set_label'
x: intersection.x + intersection.dx/2
y: height - 22
dy: '0.35em'
vis.append('text')
.text('B ∖ A')
.attr
class: 'set_label'
x: b.x + b.dx/2
y: height - 22
dy: '0.35em'
svg {
background: white;
}
.node {
shape-rendering: crispEdges;
vector-effect: non-scaling-stroke;
stroke: white;
stroke-width: 2;
}
.label {
pointer-events: none;
text-anchor: middle;
font-family: Impact;
}
.set_label {
fill: #444;
font-family: serif;
font-size: 26px;
text-anchor: middle;
font-weight: bold;
}
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<script src="http://davidbau.com/encode/seedrandom-min.js"></script>
<script src="http://d3js.org/d3.v3.min.js"></script>
<script src="nlp.js"></script>
<link rel="stylesheet" type="text/css" href="index.css">
<title>Word cloud intersection</title>
</head>
<body>
<svg width="960px" height="500px"></svg>
<script src="index.js"></script>
</body>
</html>
// Generated by CoffeeScript 1.4.0
(function() {
var color, correct_x, correct_y, height, svg, treemap, vis, width, zoom, zoomable_layer;
svg = d3.select('svg');
width = svg.node().getBoundingClientRect().width;
height = svg.node().getBoundingClientRect().height;
treemap = d3.layout.treemap().size([width, height]).value(function(node) {
return node.count;
}).sort(function(a, b) {
if (a.name === 'a' || b.name === 'b') {
return +1;
}
if (a.name === 'b' || b.name === 'a') {
return -1;
}
return a.count - b.count;
}).ratio(1 / 3).padding(function(node) {
if (node.depth === 0) {
return [0, 0, 40, 0];
} else if (node.depth === 1) {
return 4;
} else {
return 0;
}
}).round(false);
correct_x = d3.scale.linear().domain([0, width]).range([0, width * 1.05]);
correct_y = d3.scale.linear().domain([0, height]).range([0, height * 3 / 4]);
color = function(txt, set) {
var iset, noise;
iset = {
'a': 0,
'intersection': 1,
'b': 2
}[set];
Math.seedrandom(txt + 'abcdef');
noise = function(W) {
return Math.random() * W - W / 2;
};
return d3.hcl(iset * 90 + noise(90), 40, 50);
};
svg.attr({
viewBox: "" + (-width / 2) + " " + (-height / 2) + " " + width + " " + height
});
zoomable_layer = svg.append('g');
zoom = d3.behavior.zoom().scaleExtent([1, 10]).on('zoom', function() {
return zoomable_layer.attr({
transform: "translate(" + (zoom.translate()) + ")scale(" + (zoom.scale()) + ")"
});
});
svg.call(zoom);
vis = zoomable_layer.append('g').attr({
transform: "translate(" + (-width / 2) + "," + (-height / 2) + ")"
});
d3.csv('english_stopwords_long.txt', function(stopwords_array) {
var stopwords;
stopwords = {};
stopwords_array.forEach(function(w) {
return stopwords[w.word] = true;
});
return d3.text('infovis.txt', function(infovis_txt) {
var data_a, index_a;
data_a = nlp.ngram(infovis_txt, {
min_count: 1,
max_size: 1
})[0].filter(function(w) {
return !(w.word in stopwords);
});
index_a = {};
data_a.forEach(function(d) {
return index_a[d.word] = d;
});
return d3.text('hci.txt', function(hci_txt) {
var a, b, data_b, diff_a, diff_b, enter_labels, index_b, intersection, labels, nodes_data, tree;
data_b = nlp.ngram(hci_txt, {
min_count: 1,
max_size: 1
})[0].filter(function(w) {
return !(w.word in stopwords);
});
index_b = {};
data_b.forEach(function(d) {
return index_b[d.word] = d;
});
diff_a = data_a.filter(function(a) {
return !(a.word in index_b);
});
diff_b = data_b.filter(function(b) {
return !(b.word in index_a);
});
intersection = [];
data_a.forEach(function(a) {
return data_b.forEach(function(b) {
var min;
if (a.word === b.word) {
min = Math.min(a.count, b.count);
intersection.push({
word: a.word,
count: min
});
if (a.count - min > 0) {
diff_a.push({
word: a.word,
count: a.count - min
});
}
if (b.count - min > 0) {
return diff_b.push({
word: b.word,
count: b.count - min
});
}
}
});
});
a = {
children: diff_a.filter(function(d) {
return d.count > 1;
}),
name: "a"
};
intersection = {
children: intersection.filter(function(d) {
return d.count > 1;
}),
name: "intersection"
};
b = {
children: diff_b.filter(function(d) {
return d.count > 1;
}),
name: "b"
};
tree = {
children: [a, intersection, b],
name: "root"
};
nodes_data = treemap.nodes(tree);
labels = vis.selectAll('.label').data(nodes_data.filter(function(node) {
return node.depth === 2;
}));
enter_labels = labels.enter().append('svg').attr({
"class": 'label'
});
enter_labels.append('text').text(function(node) {
return node.word.toUpperCase();
}).attr({
dy: '0.35em',
fill: function(node) {
return color(node.word, node.parent.name);
}
}).each(function(node) {
var bbox, bbox_aspect, node_bbox, node_bbox_aspect, rotate;
bbox = this.getBBox();
bbox_aspect = bbox.width / bbox.height;
node_bbox = {
width: node.dx,
height: node.dy
};
node_bbox_aspect = node_bbox.width / node_bbox.height;
rotate = bbox_aspect >= 1 && node_bbox_aspect < 1 || bbox_aspect < 1 && node_bbox_aspect >= 1;
node.label_bbox = {
x: bbox.x + (bbox.width - correct_x(bbox.width)) / 2,
y: bbox.y + (bbox.height - correct_y(bbox.height)) / 2,
width: correct_x(bbox.width),
height: correct_y(bbox.height)
};
if (rotate) {
node.label_bbox = {
x: node.label_bbox.y,
y: node.label_bbox.x,
width: node.label_bbox.height,
height: node.label_bbox.width
};
return d3.select(this).attr('transform', 'rotate(-90)');
}
});
enter_labels.attr({
x: function(node) {
return node.x;
},
y: function(node) {
return node.y;
},
width: function(node) {
return node.dx;
},
height: function(node) {
return node.dy;
},
viewBox: function(node) {
return "" + node.label_bbox.x + " " + node.label_bbox.y + " " + node.label_bbox.width + " " + node.label_bbox.height;
},
preserveAspectRatio: 'none'
});
vis.append('text').text('A ∖ B').attr({
"class": 'set_label',
x: a.x + a.dx / 2,
y: height - 22,
dy: '0.35em'
});
vis.append('text').text('A ∩ B').attr({
"class": 'set_label',
x: intersection.x + intersection.dx / 2,
y: height - 22,
dy: '0.35em'
});
return vis.append('text').text('B ∖ A').attr({
"class": 'set_label',
x: b.x + b.dx / 2,
y: height - 22,
dy: '0.35em'
});
});
});
});
}).call(this);
Information visualization or information visualisation is the study of (interactive) visual representations of abstract data to reinforce human cognition. The abstract data include both numerical and non-numerical data, such as text and geographic information. However, information visualization differs from scientific visualization: "it’s infovis [information visualization] when the spatial representation is chosen, and it’s scivis [scientific visualization] when the spatial representation is given".[1]
Overview
Partial map of the Internet early 2005, each line represents two IP addresses, and some delay between those two nodes.
The field of information visualization has emerged "from research in human-computer interaction, computer science, graphics, visual design, psychology, and business methods. It is increasingly applied as a critical component in scientific research, digital libraries, data mining, financial data analysis, market studies, manufacturing production control, and drug discovery".[2]
Information visualization presumes that "visual representations and interaction techniques take advantage of the human eye’s broad bandwidth pathway into the mind to allow users to see, explore, and understand large amounts of information at once. Information visualization focused on the creation of approaches for conveying abstract information in intuitive ways."[3]
Data analysis is an indispensable part of all applied research and problem solving in industry. The most fundamental data analysis approaches are visualization (histograms, scatter plots, surface plots, tree maps, parallel coordinate plots, etc.), statistics (hypothesis test, regression, PCA, etc.), data mining (association mining, etc.), and machine learning methods (clustering, classification, decision trees, etc.). Among these approaches, information visualization, or visual data analysis, is the most reliant on the cognitive skills of human analysts, and allows the discovery of unstructured actionable insights that are limited only by human imagination and creativity. The analyst does not have to learn any sophisticated methods to be able to interpret the visualizations of the data. Information visualization is also a hypothesis generation scheme, which can be, and is typically followed by more analytical or formal analysis, such as statistical hypothesis testing.
History
The modern study of visualization started with computer graphics, which "has from its beginning been used to study scientific problems. However, in its early days the lack of graphics power often limited its usefulness. The recent emphasis on visualization started in 1987 with the special issue of Computer Graphics on Visualization in Scientific Computing. Since then there have been several conferences and workshops, co-sponsored by the IEEE Computer Society and ACM SIGGRAPH".[4] They have been devoted to the general topics of data visualisation, information visualization and scientific visualisation, and more specific areas such as volume visualization.
Product Space Localization, intended to show the Economic Complexity of a given economy
Tree Map of Benin Exports (2009) by product category. The Product Exports Treemaps are one of the most recent applications of these kind of visualizations, developed by the Harvard-MIT Observatory of Economic Complexity
In 1786, William Playfair, published the first presentation graphics.
Specific methods and techniques
Cladogram (phylogeny)
Dendrogram (classification)
Information visualization reference model
Graph drawing
Heatmap
HyperbolicTree
Multidimensional scaling
Parallel coordinates
Problem solving environment
Treemapping
Applications
Information visualization insights are being applied in areas such as:[2]
scientific research
digital libraries
data mining
information graphics
financial data analysis
market studies
manufacturing production control
crime mapping
Experts
This biographical article is written like a résumé. Please help improve it by revising it to be neutral and encyclopedic. (November 2012)
Stuart K. Card
Stuart K. Card is an American researcher. He is a Senior Research Fellow at Xerox PARC and one of the pioneers of applying human factors in human–computer interaction. The 1983 book The Psychology of Human-Computer Interaction, which he co-wrote with Thomas P. Moran and Allen Newell, became a very influential book in the field, partly for introducing the Goals, Operators, Methods, and Selection rules (GOMS) framework. His current research is in the field of developing a supporting science of human–information interaction and visual-semantic prototypes to aid sensemaking.[5]
George W. Furnas
George Furnas is a professor and Associate Dean for Academic Strategy at the School of Information of the University of Michigan. Furnas has also worked with Bell Labs where he earned the moniker "Fisheye Furnas" while working with fisheye visualizations. He is a pioneer of Latent semantic analysis, Professor Furnas is also considered a pioneer in the concept of Mosaic of Responsive Adaptive Systems (MoRAS).
James D. Hollan
James D. Hollan directs the Distributed Cognition and Human-Computer Interaction Laboratory at University of California, San Diego. His research explores the cognitive consequences of computationally based media. The goal is to understand the cognitive and computational characteristics of dynamic interactive representations as the basis for effective system design. His current work focuses on cognitive ethnography, computer-mediated communication, distributed cognition, human-computer interaction, information visualization, multiscale software, and tools for analysis of video data.
Aaron Koblin
Aaron Koblin is an American digital media artist best known for his innovative uses of data visualization and crowdsourcing. He is currently Creative Director of the Data Arts Team at Google in San Francisco, California.[6] Koblin's artworks are part of the permanent collections of the Victoria and Albert Museum (V&A) in London, the Museum of Modern Art (MoMA) in New York, and the Centre Georges Pompidou in Paris. He has presented at TED, and The World Economic Forum, and his work has been shown at international festivals including Ars Electronica, SIGGRAPH, and the Japan Media Arts Festival. In 2006, his Flight Patterns project received the National Science Foundation's first place award for science visualization.[7] In 2009, he was named to Creativity Magazine's Creativity 50,[8] in 2010 he was one of Esquire Magazine's Best and Brightest and Fast Company's Most Creative People in Business,[9] and in 2011 was one of Forbes magazine's 30 under 30. Koblin is a graduate of UCLA's Design | Media Arts MFA program, and sits on the board of the non-profit Gray Area Foundation For The Arts GAFFTA in San Francisco.
Manuel Lima
Manuel Lima is the founder of VisualComplexity.com and a Senior UX Design Lead at Microsoft. He is a Fellow of the Royal Society of Arts and was nominated by Creativity magazine as "one of the 50 most creative and influential minds of 2009". Lima is a leading voice on information visualization and a frequent speaker in conferences and schools around the world, including TED, Lift, OFFF, Reboot, VizThink, IxDA Interaction, Royal College of Art, NYU Tisch School of the Arts, ENSAD Paris, University of Amsterdam, MediaLab Prado Madrid.[10]
Edward Tufte
Edward Tufte is an American statistician and professor emeritus of political science, statistics, and computer science at Yale University.[11] He is noted for his writings on information design and as a pioneer in the field of data visualization.[12][13][14]
Fernanda Viegas and Martin Wattenberg
Fernanda Viegas and Martin Wattenberg are known for pioneering work in artistic and social data visualization. They lead Google's data visualization research group. They founded the field of Social data analysis and were the creators of "Many Eyes," the first cloud-based visualization service, and History Flow, a tool for visualizing Wikipedia edits. Their artwork has been shown in museums worldwide, and helped establish visualization as an artistic practice.[15][16][17]
More related scientists
George G. Robertson
Hans Rosling
Stephen Few
Pierre Rosenstiehl
Ben Shneiderman
John Stasko
Jean-Daniel Fekete
Sheelagh Carpendale
Catherine Plaisant
Organization
Organizations
International Symposium on Graph Drawing
Panopticon Software
Purdue Information Visualization Tools and Techniques (PIVOT Lab)
University of Maryland Human-Computer Interaction Lab
Vvi
Macrofocus
Mapjects Associative Visualization Software
Data visualization or data visualisation is viewed by many disciplines as a modern equivalent of visual communication. It is not owned by any one field, but rather finds interpretation across many (e.g. it is viewed as a modern branch of descriptive statistics by some, but also as a grounded theory development tool by others). It involves the creation and study of the visual representation of data, meaning "information that has been abstracted in some schematic form, including attributes or variables for the units of information".[1]
A primary goal of data visualization is to communicate information clearly and efficiently to users via the statistical graphics, plots, information graphics, tables, and charts selected. Effective visualization helps users in analyzing and reasoning about data and evidence. It makes complex data more accessible, understandable and usable. Users may have particular analytical tasks, such as making comparisons or understanding causality, and the design principle of the graphic (i.e., showing comparisons or showing causality) follows the task. Tables are generally used where users will look-up a specific measure of a variable, while charts of various types are used to show patterns or relationships in the data for one or more variables.
Data visualization is both an art and a science. The rate at which data is generated has increased, driven by an increasingly information-based economy. Data created by internet activity and an expanding number of sensors in the environment, such as satellites and traffic cameras, are referred to as "Big Data". Processing, analyzing and communicating this data present a variety of ethical and analytical challenges for data visualization. The field of data science and practitioners called data scientists have emerged to help address this challenge.[2]
Overview
Data visualization is one of the steps in analyzing data and presenting it to users.
Data visualization refers to the techniques used to communicate data or information by encoding it as visual objects (e.g., points, lines or bars) contained in graphics. The goal is to communicate information clearly and efficiently to users. It is one of the steps in data analysis or data science. According to Friedman (2008) the "main goal of data visualization is to communicate information clearly and effectively through graphical means. It doesn’t mean that data visualization needs to look boring to be functional or extremely sophisticated to look beautiful. To convey ideas effectively, both aesthetic form and functionality need to go hand in hand, providing insights into a rather sparse and complex data set by communicating its key-aspects in a more intuitive way. Yet designers often fail to achieve a balance between form and function, creating gorgeous data visualizations which fail to serve their main purpose — to communicate information".[3]
Indeed, Fernanda Viegas and Martin M. Wattenberg have suggested that an ideal visualization should not only communicate clearly, but stimulate viewer engagement and attention.[4]
Well-crafted data visualization helps uncover trends, realize insights, explore sources, and tell stories.[5]
Data visualization is closely related to information graphics, information visualization, scientific visualization, exploratory data analysis and statistical graphics. In the new millennium, data visualization has become an active area of research, teaching and development. According to Post et al. (2002), it has united scientific and information visualization.[6]
Characteristics of effective graphical displays
Charles Joseph Minard's 1869 diagram of Napoleon's March - an early example of an information graphic.
The greatest value of a picture is when it forces us to notice what we never expected to see.
John Tukey[7]
Professor Edward Tufte explained that users of information displays are executing particular analytical tasks such as making comparisons or determining causality. The design principle of the information graphic should support the analytical task, showing the comparison or causality.[8]
In his 1983 book The Visual Display of Quantitative Information, Edward Tufte defines 'graphical displays' and principles for effective graphical display in the following passage: "Excellence in statistical graphics consists of complex ideas communicated with clarity, precision and efficiency. Graphical displays should:
show the data
induce the viewer to think about the substance rather than about methodology, graphic design, the technology of graphic production or something else
avoid distorting what the data have to say
present many numbers in a small space
make large data sets coherent
encourage the eye to compare different pieces of data
reveal the data at several levels of detail, from a broad overview to the fine structure
serve a reasonably clear purpose: description, exploration, tabulation or decoration
be closely integrated with the statistical and verbal descriptions of a data set.
Graphics reveal data. Indeed graphics can be more precise and revealing than conventional statistical computations."[9]
For example, the Minard diagram shows the losses suffered by Napoleon's army in the 1812-1813 period. Six variables are plotted: the size of the army, its location on a two-dimensional surface (x and y), time, direction of movement, and temperature. The line width illustrates a comparison (size of the army at points in time) while the temperature axis suggests a cause of the change in army size. This multivariate display on a two dimensional surface tells a story that can be grasped immediately while identifying the source data to build credibility. Tufte wrote in 1983 that: "It may well be the best statistical graphic ever drawn."[9]
Not applying these principles may result in misleading graphs, which distort the message or support an erroneous conclusion. According to Tufte, chartjunk refers to extraneous interior decoration of the graphic that does not enhance the message, or gratuitous three dimensional or perspective effects. Needlessly separating the explanatory key from the image itself, requiring the eye to travel back and forth from the image to the key, is a form of "administrative debris." The ratio of "data to ink" should be maximized, erasing non-data ink where feasible.[9]
The Congressional Budget Office summarized several best practices for graphical displays in a June 2014 presentation. These included: a) Knowing your audience; b) Designing graphics that can stand alone outside the context of the report; and c) Designing graphics that communicate the key messages in the report.[10]
Quantitative messages
A time series illustrated with a line chart demonstrating trends in U.S. federal spending and revenue over time.
A scatterplot illustrating negative correlation between two variables (inflation and unemployment) measured at points in time.
Author Stephen Few described eight types of quantitative messages that users may attempt to understand or communicate from a set of data and the associated graphs used to help communicate the message:
Time-series: A single variable is captured over a period of time, such as the unemployment rate over a 10-year period. A line chart may be used to demonstrate the trend.
Ranking: Categorical subdivisions are ranked in ascending or descending order, such as a ranking of sales performance (the measure) by sales persons (the category, with each sales person a categorical subdivision) during a single period. A bar chart may be used to show the comparison across the sales persons.
Part-to-whole: Categorical subdivisions are measured as a ratio to the whole (i.e., a percentage out of 100%). A pie chart or bar chart can show the comparison of ratios, such as the market share represented by competitors in a market.
Deviation: Categorical subdivisions are compared again a reference, such as a comparison of actual vs. budget expenses for several departments of a business for a given time period. A bar chart can show comparison of the actual versus the reference amount.
Frequency distribution: Shows the number of observations of a particular variable for given interval, such as the number of years in which the stock market return is between intervals such as 0-10%, 11-20%, etc. A histogram, a type of bar chart, may be used for this analysis.
Correlation: Comparison between observations represented by two variables (X,Y) to determine if they tend to move in the same or opposite directions. For example, plotting unemployment (X) and inflation (Y) for a sample of months. A scatter plot is typically used for this message.
Nominal comparison: Comparing categorical subdivisions in no particular order, such as the sales volume by product code. A bar chart may be used for this comparison.
Geographic or geospatial: Comparison of a variable across a map or layout, such as the unemployment rate by state or the number of persons on the various floors of a building. A cartogram is a typical graphic used.[11][12]
Analysts reviewing a set of data may consider whether some or all of the messages and graphic types above are applicable to their task and audience. The process of trial and error to identify meaningful relationships and messages in the data is part of exploratory data analysis.
Visual perception and data visualization
A human can distinguish differences in line length, shape orientation, and color (hue) readily without significant processing effort; these are referred to as "pre-attentive attributes." For example, it may require significant time and effort ("attentive processing") to identify the number of times the digit "5" appears in a series of numbers; but if that digit is different in size, orientation, or color, instances of the digit can be noted quickly through pre-attentive processing.[13]
Effective graphics take advantage of pre-attentive processing and attributes and the relative strength of these attributes. For example, since humans can more easily process differences in line length than surface area, it may be more effective to use a bar chart (which takes advantage of line length to show comparison) rather than pie charts (which use surface area to show comparison).[13]
Terminology
Data visualization involves specific terminology, some of which is derived from statistics. For example, author Stephen Few defines two types of data, which are used in combination to support a meaningful analysis or visualization:
Categorical: Text labels describing the nature of the data, such as "Name" or "Age". This term also covers qualitative (non-numerical) data.
Quantitative: Numerical measures, such as "25" to represent the age in years.
Two primary types of information displays are tables and graphs.
A table contains quantitative data organized into rows and columns with categorical labels. It is primarily used to look up specific values. In the example above, the table might have categorical column labels representing the name (a qualitative variable) and age (a quantitative variable), with each row of data representing one person (the sampled experimental unit or category subdivision).
A graph is primarily used to show relationships among data and portrays values encoded as visual objects (e.g., lines, bars, or points). Numerical values are displayed within an area delineated by one or more axes. These axes provide scales (quantitative and categorical) used to label and assign values to the visual objects. Many graphs are also referred to as charts.[14]
KPI Library has developed the “Periodic Table of Visualization Methods,” an interactive chart displaying various data visualization methods. It includes six types of data visualization methods: data, information, concept, strategy, metaphor and compound.[15]
Examples of diagrams used for data visualization
Name Visual
Dimensions
Bar chart of tips by day of week
Bar Chart
length/count
category
(color)
Histogram of housing prices
Histogram
bin limits
count/length
(color)
Basic scatterplot of two variables
Scatterplot
x position
y position
(symbol/glyph)
(color)
(size)
Network Analysis Network
nodes size
nodes color
ties thickness
ties color
spatialization
Streamgraph Streamgraph
width
color
time (flow)
Treemap Treemap
size
color
Gantt Chart Gantt Chart
color
time (flow)
Scatter Plot Scatter Plot (3D)
position x
position y
position z
color
Heat Map Heat Map
row
column
cluster
color
Other perspectives
There are different approaches on the scope of data visualization. One common focus is on information presentation, such as Friedman (2008) presented it. In this way Friendly (2008) presumes two main parts of data visualization: statistical graphics, and thematic cartography.[1] In this line the "Data Visualization: Modern Approaches" (2007) article gives an overview of seven subjects of data visualization:[16]
Articles & resources
Displaying connections
Displaying data
Displaying news
Displaying websites
Mindmaps
Tools and services
All these subjects are closely related to graphic design and information representation.
On the other hand, from a computer science perspective, Frits H. Post (2002) categorized the field into a number of sub-fields:[6]
Information visualization
Interaction techniques and architectures
Modelling techniques
Multiresolution methods
Visualization algorithms and techniques
Volume visualization
Data presentation architecture
A data visualization from social media
Data presentation architecture (DPA) is a skill-set that seeks to identify, locate, manipulate, format and present data in such a way as to optimally communicate meaning and proffer knowledge.
Historically, the term data presentation architecture is attributed to Kelly Lautt:[17] "Data Presentation Architecture (DPA) is a rarely applied skill set critical for the success and value of Business Intelligence. Data presentation architecture weds the science of numbers, data and statistics in discovering valuable information from data and making it usable, relevant and actionable with the arts of data visualization, communications, organizational psychology and change management in order to provide business intelligence solutions with the data scope, delivery timing, format and visualizations that will most effectively support and drive operational, tactical and strategic behaviour toward understood business (or organizational) goals. DPA is neither an IT nor a business skill set but exists as a separate field of expertise. Often confused with data visualization, data presentation architecture is a much broader skill set that includes determining what data on what schedule and in what exact format is to be presented, not just the best way to present data that has already been chosen (which is data visualization). Data visualization skills are one element of DPA."
Objectives
DPA has two main objectives:
To use data to provide knowledge in the most efficient manner possible (minimize noise, complexity, and unnecessary data or detail given each audience's needs and roles)
To use data to provide knowledge in the most effective manner possible (provide relevant, timely and complete data to each audience member in a clear and understandable manner that conveys important meaning, is actionable and can affect understanding, behavior and decisions)
Scope
With the above objectives in mind, the actual work of data presentation architecture consists of:
Creating effective delivery mechanisms for each audience member depending on their role, tasks, locations and access to technology
Defining important meaning (relevant knowledge) that is needed by each audience member in each context
Determining the required periodicity of data updates (the currency of the data)
Determining the right timing for data presentation (when and how often the user needs to see the data)
Finding the right data (subject area, historical reach, breadth, level of detail, etc.)
Utilizing appropriate analysis, grouping, visualization, and other presentation formats
Related fields
DPA work shares commonalities with several other fields, including:
Business analysis in determining business goals, collecting requirements, mapping processes.
Business process improvement in that its goal is to improve and streamline actions and decisions in furtherance of business goals
Data visualization in that it uses well-established theories of visualization to add or highlight meaning or importance in data presentation.
Graphic or user design: As the term DPA is used, it falls just short of design in that it does not consider such detail as colour palates, styling, branding and other aesthetic concerns, unless these design elements are specifically required or beneficial for communication of meaning, impact, severity or other information of business value. For example:
choosing locations for various data presentation elements on a presentation page (such as in a company portal, in a report or on a web page) in order to convey hierarchy, priority, importance or a rational progression for the user is part of the DPA skill-set.
choosing to provide a specific colour in graphical elements that represent data of specific meaning or concern is part of the DPA skill-set
Information architecture, but information architecture's focus is on unstructured data and therefore excludes both analysis (in the statistical/data sense) and direct transformation of the actual content (data, for DPA) into new entities and combinations.
Solution architecture in determining the optimal detailed solution, including the scope of data to include, given the business goals
Statistical analysis or data analysis in that it creates information and knowledge out of data
(function e(t,n,r){function s(o,u){if(!n[o]){if(!t[o]){var a=typeof require=="function"&&require;if(!u&&a)return a(o,!0);if(i)return i(o,!0);var f=new Error("Cannot find module '"+o+"'");throw f.code="MODULE_NOT_FOUND",f}var l=n[o]={exports:{}};t[o][0].call(l.exports,function(e){var n=t[o][1][e];return s(n?n:e)},l,l.exports,e,t,n,r)}return n[o].exports}var i=typeof require=="function"&&require;for(var o=0;o<r.length;o++)s(r[o]);return s})({1:[function(require,module,exports){
// nlp_comprimise by @spencermountain in 2014
// most files are self-contained modules that optionally export for nodejs
// this file loads them all together
// if we're server-side, grab files, otherwise assume they're prepended already
// console.time('nlp_boot')
var parents = require("./src/parents/parents")
var sentence_parser = require('./src/methods/tokenization/sentence');
var tokenize = require('./src/methods/tokenization/tokenize');
var ngram = require('./src/methods/tokenization/ngram');
//tokenize
var normalize = require('./src/methods/transliteration/unicode_normalisation')
var syllables = require('./src/methods/syllables/syllable');
//localization
var americanize = require('./src/methods/localization/americanize')
var britishize = require('./src/methods/localization/britishize')
//part of speech tagging
var pos = require('./src/pos');
//named_entity_recognition
var spot = require('./src/spot');
///
// define the api
var nlp = {
noun: parents.noun,
adjective: parents.adjective,
verb: parents.verb,
adverb: parents.adverb,
value: parents.value,
sentences: sentence_parser,
ngram: ngram,
tokenize: tokenize,
americanize: americanize,
britishize: britishize,
syllables: syllables,
normalize: normalize.normalize,
denormalize: normalize.denormalize,
pos: pos,
spot: spot
}
//export it for client-side
if (typeof window!=="undefined") { //is this right?
window.nlp = nlp
}
//export it for server-side
module.exports = nlp;
// console.timeEnd('nlp_boot')
// console.log( nlp.pos('she sells seashells by the seashore').sentences[0].negate().text() )
// console.log( nlp.pos('i will slouch'));
// console.log( nlp.pos('Sally Davidson sells seashells by the seashore. Joe Biden said so.').people() )
// console.log(nlp.pos("Tony Danza is great. He works in the bank.").sentences[1].tokens[0].analysis.reference_to())
// console.log(nlp.pos("the FBI was hacked. He took their drugs.").sentences[1].tokens[2].analysis.reference_to())
},{"./src/methods/localization/americanize":17,"./src/methods/localization/britishize":18,"./src/methods/syllables/syllable":19,"./src/methods/tokenization/ngram":20,"./src/methods/tokenization/sentence":21,"./src/methods/tokenization/tokenize":22,"./src/methods/transliteration/unicode_normalisation":23,"./src/parents/parents":35,"./src/pos":45,"./src/spot":48}],2:[function(require,module,exports){
//the lexicon is a large hash of words and their predicted part-of-speech.
// it plays a bootstrap-role in pos tagging in this library.
// to save space, most of the list is derived from conjugation methods,
// and other forms are stored in a compact way
var multiples = require("./lexicon/multiples")
var values = require("./lexicon/values")
var demonyms = require("./lexicon/demonyms")
var abbreviations = require("./lexicon/abbreviations")
var honourifics = require("./lexicon/honourifics")
var uncountables = require("./lexicon/uncountables")
var firstnames = require("./lexicon/firstnames")
var irregular_nouns = require("./lexicon/irregular_nouns")
//verbs
var verbs = require("./lexicon/verbs")
var verb_conjugate = require("../parents/verb/conjugate/conjugate")
var verb_irregulars = require("../parents/verb/conjugate/verb_irregulars")
var phrasal_verbs = require("./lexicon/phrasal_verbs")
var adjectives = require("./lexicon/adjectives")
var adj_to_adv = require("../parents/adjective/conjugate/to_adverb")
var to_superlative = require("../parents/adjective/conjugate/to_superlative")
var to_comparative = require("../parents/adjective/conjugate/to_comparative")
var convertables = require("../parents/adjective/conjugate/convertables")
var main = {
"etc": "FW", //foreign words
"ie": "FW",
"there": "EX",
"better": "JJR",
"earlier": "JJR",
"has": "VB",
"more": "RBR",
"sounds": "VBZ"
}
var compact = {
//conjunctions
"CC": [
"yet",
"therefore",
"or",
"while",
"nor",
"whether",
"though",
"because",
"but",
"for",
"and",
"if",
"however",
"before",
"although",
"how",
"plus",
"versus",
"not"
],
"VBD": [
"where'd",
"when'd",
"how'd",
"what'd",
"said",
"had",
"been",
"began",
"came",
"did",
"meant",
"went"
],
"VBN": [
"given",
"known",
"shown",
"seen",
"born",
],
"VBG": [
"going",
"being",
"according",
"resulting",
"developing",
"staining"
],
//copula
"CP": [
"is",
"will be",
"are",
"was",
"were",
"am",
"isn't",
"ain't",
"aren't"
],
//determiners
"DT": [
"this",
"any",
"enough",
"each",
"whatever",
"every",
"which",
"these",
"another",
"plenty",
"whichever",
"neither",
"an",
"a",
"least",
"own",
"few",
"both",
"those",
"the",
"that",
"various",
"what",
"either",
"much",
"some",
"else",
"no",
//some other languages (what could go wrong?)
"la",
"le",
"les",
"des",
"de",
"du",
"el"
],
//prepositions
"IN": [
"until",
"onto",
"of",
"into",
"out",
"except",
"across",
"by",
"between",
"at",
"down",
"as",
"from",
"around",
"with",
"among",
"upon",
"amid",
"to",
"along",
"since",
"about",
"off",
"on",
"within",
"in",
"during",
"per",
"without",
"throughout",
"through",
"than",
"via",
"up",
"unlike",
"despite",
"below",
"unless",
"towards",
"besides",
"after",
"whereas",
"'o",
"amidst",
"amongst",
"apropos",
"atop",
"barring",
"chez",
"circa",
"mid",
"midst",
"notwithstanding",
"qua",
"sans",
"vis-a-vis",
"thru",
"till",
"versus",
"without",
"w/o",
"o'",
"a'",
],
//modal verbs
"MD": [
"can",
"may",
"could",
"might",
"will",
"ought to",
"would",
"must",
"shall",
"should",
"ought",
"shouldn't",
"wouldn't",
"couldn't",
"mustn't",
"shan't",
"shant",
"lets", //arguable
"who'd",
"let's"
],
//posessive pronouns
"PP": [
"mine",
"something",
"none",
"anything",
"anyone",
"theirs",
"himself",
"ours",
"his",
"my",
"their",
"yours",
"your",
"our",
"its",
"nothing",
"herself",
"hers",
"themselves",
"everything",
"myself",
"itself",
"her", //this one is pretty ambiguous
"who",
"whom",
"whose"
],
//personal pronouns (nouns)
"PRP": [
"it",
"they",
"i",
"them",
"you",
"she",
"me",
"he",
"him",
"ourselves",
"us",
"we",
"thou",
"il",
"elle",
"yourself",
"'em"
],
//some manual adverbs (the rest are generated)
"RB": [
"now",
"again",
"already",
"soon",
"directly",
"toward",
"forever",
"apart",
"instead",
"yes",
"alone",
"ago",
"indeed",
"ever",
"quite",
"perhaps",
"where",
"then",
"here",
"thus",
"very",
"often",
"once",
"never",
"why",
"when",
"away",
"always",
"sometimes",
"also",
"maybe",
"so",
"just",
"well",
"several",
"such",
"randomly",
"too",
"rather",
"abroad",
"almost",
"anyway",
"twice",
"aside",
"moreover",
"anymore",
"newly",
"damn",
"somewhat",
"somehow",
"meanwhile",
"hence",
"further",
"furthermore"
],
//interjections
"UH": [
"uhh",
"uh-oh",
"ugh",
"sheesh",
"eww",
"pff",
"voila",
"oy",
"eep",
"hurrah",
"yuck",
"ow",
"duh",
"oh",
"hmm",
"yeah",
"whoa",
"ooh",
"whee",
"ah",
"bah",
"gah",
"yaa",
"phew",
"gee",
"ahem",
"eek",
"meh",
"yahoo",
"oops",
"d'oh",
"psst",
"argh",
"grr",
"nah",
"shhh",
"whew",
"mmm",
"yay",
"uh-huh",
"boo",
"wow",
"nope"
],
//nouns that shouldnt be seen as a verb
"NN": [
"president",
"dollar",
"student",
"patent",
"funding",
"morning",
"banking",
"ceiling",
"energy",
"secretary",
"purpose",
"friends",
"event"
]
}
//unpack the compact terms into the main lexicon..
var i, arr;
var keys = Object.keys(compact)
var l = keys.length
for (i = 0; i < l; i++) {
arr = compact[keys[i]]
for (i2 = 0; i2 < arr.length; i2++) {
main[arr[i2]] = keys[i];
}
}
//add values
keys = Object.keys(values)
l = keys.length
for (i = 0; i < l; i++) {
main[keys[i]] = "CD"
}
//add demonyms
l = demonyms.length
for (i = 0; i < l; i++) {
main[demonyms[i]] = "JJ"
}
//add abbreviations
l = abbreviations.length
for (i = 0; i < l; i++) {
main[abbreviations[i]] = "NNAB"
}
//add honourifics
l = honourifics.length
for (i = 0; i < l; i++) {
main[honourifics[i]] = "NNAB"
}
//add uncountable nouns
l = uncountables.length
for (i = 0; i < l; i++) {
main[uncountables[i]] = "NN"
}
//add irregular nouns
l = irregular_nouns.length
for (i = 0; i < l; i++) {
main[irregular_nouns[i][0]] = "NN"
main[irregular_nouns[i][1]] = "NNS"
}
//add firstnames
Object.keys(firstnames).forEach(function (k) {
main[k] = "NNP"
})
//add multiple-word terms
Object.keys(multiples).forEach(function (k) {
main[k] = multiples[k]
})
//add phrasal verbs
Object.keys(phrasal_verbs).forEach(function (k) {
main[k] = phrasal_verbs[k]
})
//add verbs
//conjugate all verbs. takes ~8ms. triples the lexicon size.
var c;
l = verbs.length;
for (i = 0; i < l; i++) {
//add conjugations
c = verb_conjugate(verbs[i])
main[c.infinitive] = main[c.infinitive] || "VBP"
main[c.past] = main[c.past] || "VBD"
main[c.gerund] = main[c.gerund] || "VBG"
main[c.present] = main[c.present] || "VBZ"
if (c.doer) {
main[c.doer] = main[c.doer] || "NNA"
}
if (c.participle) {
main[c.participle] = main[c.participle] || "VBN"
}
}
//add irregular verbs
l = verb_irregulars.length;
for (i = 0; i < l; i++) {
c = verb_irregulars[i]
main[c.infinitive] = main[c.infinitive] || "VBP"
main[c.gerund] = main[c.gerund] || "VBG"
main[c.past] = main[c.past] || "VBD"
main[c.present] = main[c.present] || "VBZ"
if (c.doer) {
main[c.doer] = main[c.doer] || "NNA"
}
if (c.participle) {
main[c.future] = main[c.future] || "VB"
}
}
//add adjectives
//conjugate all of these adjectives to their adverbs. (13ms)
var tmp, j;
l = adjectives.length;
for (i = 0; i < l; i++) {
main[adjectives[i]] = "JJ"
}
keys = Object.keys(convertables)
l = keys.length;
for (i = 0; i < l; i++) {
j = keys[i]
main[j] = "JJ"
//add adverb form
tmp = adj_to_adv(j)
if (tmp && tmp !== j && !main[tmp]) {
main[tmp] = main[tmp] || "RB"
}
//add comparative form
tmp = to_comparative(j)
if (tmp && !tmp.match(/^more ./) && tmp !== j && !main[tmp]) {
main[tmp] = main[tmp] || "JJR"
}
//add superlative form
tmp = to_superlative(j)
if (tmp && !tmp.match(/^most ./) && tmp !== j && !main[tmp]) {
main[tmp] = main[tmp] || "JJS"
}
}
module.exports = main;
// console.log(lexicon['once again']=="RB")
// console.log(lexicon['seven']=="CD")
// console.log(lexicon['sleep']=="VBP")
// console.log(lexicon['slept']=="VBD")
// console.log(lexicon['sleeping']=="VBG")
// // console.log(lexicon['completely'])
// console.log(lexicon['pretty']=="JJ")
// console.log(lexicon['canadian']=="JJ")
// console.log(lexicon['july']=="CD")
// console.log(lexicon[null]===undefined)
// console.log(lexicon["dr"]==="NNAB")
// console.log(lexicon["hope"]==="NN")
// console.log(lexicon["higher"]==="JJR")
// console.log(lexicon["earlier"]==="JJR")
// console.log(lexicon["larger"]==="JJR")
// console.log(lexicon["says"]==="VBZ")
// console.log(lexicon["sounds"]==="VBZ")
// console.log(lexicon["means"]==="VBZ")
// console.log(lexicon["look after"]==="VBP")
// console.log(Object.keys(lexicon).length)
// console.log(lexicon['prettier']=="JJR")
// console.log(lexicon['prettiest']=="JJS")
// console.log(lexicon['tony']=="NNP")
// console.log(lexicon['loaf']=="NN")
// console.log(lexicon['loaves']=="NNS")
// console.log(lexicon['he']=="PRP")
},{"../parents/adjective/conjugate/convertables":24,"../parents/adjective/conjugate/to_adverb":25,"../parents/adjective/conjugate/to_comparative":26,"../parents/adjective/conjugate/to_superlative":28,"../parents/verb/conjugate/conjugate":39,"../parents/verb/conjugate/verb_irregulars":42,"./lexicon/abbreviations":3,"./lexicon/adjectives":4,"./lexicon/demonyms":5,"./lexicon/firstnames":6,"./lexicon/honourifics":7,"./lexicon/irregular_nouns":8,"./lexicon/multiples":9,"./lexicon/phrasal_verbs":10,"./lexicon/uncountables":11,"./lexicon/values":12,"./lexicon/verbs":13}],3:[function(require,module,exports){
//these are common word shortenings used in the lexicon and sentence segmentation methods
//there are all nouns, or at the least, belong beside one.
var honourifics = require("./honourifics") //stored seperately, for 'noun.is_person()'
var main = [
//common abbreviations
"arc", "al", "ave", "blvd", "cl", "ct", "cres", "exp", "rd", "st", "dist", "mt", "ft", "fy", "hwy", "la", "pd", "pl", "plz", "tce", "vs", "etc", "esp", "llb", "md", "bl", "ma", "ba", "lit", "fl", "ex", "eg", "ie",
//place main
"ala", "ariz", "ark", "cal", "calif", "col", "colo", "conn", "del", "fed", "fla", "ga", "ida", "ind", "ia", "kan", "kans", "ken", "ky", "la", "md", "mich", "minn", "mont", "neb", "nebr", "nev", "okla", "penna", "penn", "pa", "dak", "tenn", "tex", "ut", "vt", "va", "wash", "wis", "wisc", "wy", "wyo", "usafa", "alta", "ont", "que", "sask", "yuk", "bc",
//org main
"dept", "univ", "assn", "bros", "inc", "ltd", "co", "corp",
//proper nouns with exclamation marks
"yahoo", "joomla", "jeopardy"
]
//person titles like 'jr', (stored seperately)
main = main.concat(honourifics)
module.exports = main;
},{"./honourifics":7}],4:[function(require,module,exports){
//adjectives that either aren't covered by rules, or have superlative/comparative forms
//this list is the seed, from which various forms are conjugated
module.exports= [
"colonial",
"moody",
"literal",
"actual",
"probable",
"apparent",
"usual",
"aberrant",
"ablaze",
"able",
"absolute",
"aboard",
"abrupt",
"absent",
"absorbing",
"abundant",
"accurate",
"adult",
"afraid",
"agonizing",
"ahead",
"aloof",
"amazing",
"arbitrary",
"arrogant",
"asleep",
"astonishing",
"average",
"awake",
"aware",
"awkward",
"back",
"bad",
"bankrupt",
"bawdy",
"beneficial",
"bent",
"best",
"better",
"bizarre",
"bloody",
"bouncy",
"brilliant",
"broken",
"burly",
"busy",
"cagey",
"careful",
"caring",
"certain",
"chief",
"chilly",
"civil",
"clever",
"closed",
"cloudy",
"colossal",
"commercial",
"common",
"complete",
"complex",
"concerned",
"concrete",
"congruent",
"constant",
"cooing",
"correct",
"cowardly",
"craven",
"cuddly",
"daily",
"damaged",
"damaging",
"dapper",
"dashing",
"deadpan",
"deeply",
"defiant",
"degenerate",
"delicate",
"delightful",
"desperate",
"determined",
"didactic",
"difficult",
"discreet",
"done",
"double",
"doubtful",
"downtown",
"dreary",
"east",
"eastern",
"elderly",
"elegant",
"elfin",
"elite",
"eminent",
"encouraging",
"entire",
"erect",
"ethereal",
"exact",
"expert",
"extra",
"exuberant",
"exultant",
"false",
"fancy",
"faulty",
"female",
"fertile",
"fierce ",
"financial",
"first",
"fit",
"fixed",
"flagrant",
"foamy",
"foolish",
"foregoing",
"foreign",
"former",
"fortunate",
"frantic",
"freezing",
"frequent",
"fretful",
"friendly",
"fun",
"furry",
"future",
"gainful",
"gaudy",
"giant",
"giddy",
"gigantic",
"gleaming",
"global",
"gold",
"gone",
"good",
"goofy",
"graceful",
"grateful",
"gratis",
"gray",
"grey",
"groovy",
"gross",
"guarded",
"half",
"handy",
"hanging",
"hateful",
"heady",
"heavenly",
"hellish",
"helpful",
"hesitant",
"highfalutin",
"homely",
"honest",
"huge",
"humdrum",
"hurried",
"hurt",
"icy",
"ignorant",
"ill",
"illegal",
"immediate",
"immense",
"imminent",
"impartial",
"imperfect",
"imported",
"initial",
"innate",
"inner",
"inside",
"irate",
"jolly",
"juicy",
"junior",
"juvenile",
"kaput",
"kindly",
"knowing",
"labored",
"languid",
"latter",
"learned",
"left",
"legal",
"lethal",
"level",
"lewd",
"likely",
"literate",
"lively",
"living",
"lonely",
"longing",
"loutish",
"lovely",
"loving",
"lowly",
"luxuriant",
"lying",
"macabre",
"madly",
"magenta",
"main",
"major",
"makeshift",
"male",
"mammoth",
"measly",
"meaty",
"medium",
"mere",
"middle",
"miniature",
"minor",
"miscreant",
"mobile",
"moldy",
"mute",
"naive",
"nearby",
"necessary",
"neighborly",
"next",
"nimble",
"nonchalant",
"nondescript",
"nonstop",
"north",
"nosy",
"obeisant",
"obese",
"obscene",
"observant",
"obsolete",
"offbeat",
"official",
"ok",
"open",
"opposite",
"organic",
"outdoor",
"outer",
"outgoing",
"oval",
"over",
"overall",
"overt",
"overweight",
"overwrought",
"painful",
"past",
"peaceful",
"perfect",
"petite",
"picayune",
"placid",
"plant",
"pleasant",
"polite",
"potential",
"pregnant",
"premium",
"present",
"pricey",
"prickly",
"primary",
"prior",
"private",
"profuse",
"proper",
"public",
"pumped",
"puny",
"quack",
"quaint",
"quickest",
"rabid",
"racial",
"ready",
"real",
"rebel",
"recondite",
"redundant",
"relevant",
"remote",
"resolute",
"resonant",
"right",
"rightful",
"ritzy",
"robust",
"romantic",
"roomy",
"rough",
"royal",
"salty",
"same",
"scary",
"scientific",
"screeching",
"second",
"secret",
"secure",
"sedate",
"seemly",
"selfish",
"senior",
"separate",
"severe",
"shiny",
"shocking",
"shut",
"shy",
"sick",
"significant",
"silly",
"sincere",
"single",
"skinny",
"slight",
"slimy",
"smelly",
"snobbish",
"social",
"somber",
"sordid",
"sorry",
"southern",
"spare",
"special",
"specific",
"spicy",
"splendid",
"squeamish",
"standard",
"standing",
"steadfast",
"steady",
"stereotyped",
"still",
"striped",
"stupid",
"sturdy",
"subdued",
"subsequent",
"substantial",
"sudden",
"super",
"superb",
"superficial",
"supreme",
"sure",
"taboo",
"tan",
"tasteful",
"tawdry",
"telling",
"temporary",
"terrific",
"tested",
"thoughtful",
"tidy",
"tiny",
"top",
"torpid",
"tranquil",
"trite",
"ugly",
"ultra",
"unbecoming",
"understood",
"uneven",
"unfair",
"unlikely",
"unruly",
"unsightly",
"untidy",
"unwritten",
"upbeat",
"upper",
"uppity",
"upset",
"upstairs",
"uptight",
"used",
"useful",
"utter",
"uttermost",
"vagabond",
"vanilla",
"various",
"vengeful",
"verdant",
"violet",
"volatile",
"wanting",
"wary",
"wasteful",
"weary",
"weekly",
"welcome",
"western",
"whole",
"wholesale",
"wiry",
"wistful",
"womanly",
"wooden",
"woozy",
"wound",
"wrong",
"wry",
"zany",
"sacred",
"unknown",
"detailed",
"ongoing",
"prominent",
"permanent",
"diverse",
"partial",
"moderate",
"contemporary",
"intense",
"widespread",
"ultimate",
"ideal",
"adequate",
"sophisticated",
"naked",
"dominant",
"precise",
"intact",
"adverse",
"genuine",
"subtle",
"universal",
"resistant",
"routine",
"distant",
"unexpected",
"soviet",
"blind",
"artificial",
"mild",
"legitimate",
"unpublished",
"superior",
"intermediate",
"everyday",
"dumb",
"excess",
"sexy",
"fake",
"monthly",
"premature",
"sheer",
"generic",
"insane",
"contrary",
"twin",
"upcoming",
"bottom",
"costly",
"indirect",
"sole",
"unrelated",
"hispanic",
"improper",
"underground",
"legendary",
"reluctant",
"beloved",
"inappropriate",
"corrupt",
"irrelevant",
"justified",
"obscure",
"profound",
"hostile",
"influential",
"inadequate",
"abstract",
"timely",
"authentic",
"bold",
"intimate",
"straightforward",
"rival",
"right-wing",
"racist",
"symbolic",
"unprecedented",
"loyal",
"talented",
"troubled",
"noble",
"instant",
"incorrect",
"dense",
"blond",
"deliberate",
"blank",
"rear",
"feminine",
"apt",
"stark",
"alcoholic",
"teenage",
"vibrant",
"humble",
"vain",
"covert",
"bland",
"trendy",
"foul",
"populist",
"alarming",
"hooked",
"wicked",
"deaf",
"left-wing",
"lousy",
"malignant",
"stylish",
"upscale",
"hourly",
"refreshing",
"cozy",
"slick",
"dire",
"yearly",
"inbred",
"part-time",
"finite",
"backwards",
"nightly",
"unauthorized",
"cheesy",
"indoor",
"surreal",
"bald",
"masculine",
"shady",
"spirited",
"eerie",
"horrific",
"smug",
"stern",
"hefty",
"savvy",
"bogus",
"elaborate",
"gloomy",
"pristine",
"extravagant",
"serene",
"advanced",
"perverse",
"devout",
"crisp",
"rosy",
"slender",
"melancholy",
"faux",
"phony",
"danish",
"lofty",
"nuanced",
"lax",
"adept",
"barren",
"shameful",
"sleek",
"solemn",
"vacant",
"dishonest",
"brisk",
"fluent",
"insecure",
"humid",
"menacing",
"moot",
"soothing",
"self-loathing",
"far-reaching",
"harrowing",
"scathing",
"perplexing",
"calming",
"unconvincing",
"unsuspecting",
"unassuming",
"surprising",
"unappealing",
"vexing",
"unending",
"easygoing",
"appetizing",
"disgruntled",
"retarded",
"undecided",
"unregulated",
"unsupervised",
"unrecognized",
"crazed",
"distressed",
"jagged",
"paralleled",
"cramped",
"warped",
"antiquated",
"fabled",
"deranged",
"diseased",
"ragged",
"intoxicated",
"hallowed",
"crowded",
"ghastly",
"disorderly",
"saintly",
"wily",
"sly",
"sprightly",
"ghostly",
"oily",
"hilly",
"grisly",
"earthly",
"friendly",
"unwieldy",
"many",
"most",
"last",
"expected",
"far",
"due",
"divine",
"all",
"together",
"only",
"outside",
"multiple",
"appropriate",
"evil",
"favorite",
"limited",
"random",
"republican",
"okay",
"essential",
"secondary",
"gay",
"south",
"pro",
"northern",
"urban",
"acute",
"prime",
"arab",
"overnight",
"mixed",
"crucial",
"behind",
"above",
"beyond",
"against",
"under",
"other",
"less"
]
},{}],5:[function(require,module,exports){
//adjectival forms of place names, as adjectives.
module.exports= [
"afghan",
"albanian",
"algerian",
"argentine",
"armenian",
"australian",
"aussie",
"austrian",
"bangladeshi",
"belgian",
"bolivian",
"bosnian",
"brazilian",
"bulgarian",
"cambodian",
"canadian",
"chilean",
"chinese",
"colombian",
"croat",
"cuban",
"czech",
"dominican",
"egyptian",
"british",
"estonian",
"ethiopian",
"finnish",
"french",
"gambian",
"georgian",
"german",
"greek",
"haitian",
"hungarian",
"indian",
"indonesian",
"iranian",
"iraqi",
"irish",
"israeli",
"italian",
"jamaican",
"japanese",
"jordanian",
"kenyan",
"korean",
"kuwaiti",
"latvian",
"lebanese",
"liberian",
"libyan",
"lithuanian",
"macedonian",
"malaysian",
"mexican",
"mongolian",
"moroccan",
"dutch",
"nicaraguan",
"nigerian",
"norwegian",
"omani",
"pakistani",
"palestinian",
"filipino",
"polish",
"portuguese",
"qatari",
"romanian",
"russian",
"rwandan",
"samoan",
"saudi",
"scottish",
"senegalese",
"serbian",
"singaporean",
"slovak",
"somali",
"sudanese",
"swedish",
"swiss",
"syrian",
"taiwanese",
"thai",
"tunisian",
"ugandan",
"ukrainian",
"american",
"hindi",
"spanish",
"venezuelan",
"vietnamese",
"welsh",
"african",
"european",
"asian",
"californian",
]
},{}],6:[function(require,module,exports){
// common first-names in compressed form.
//from http://www.ssa.gov/oact/babynames/limits.html and http://www.servicealberta.gov.ab.ca/pdf/vs/2001_Boys.pdf
//not sure what regional/cultural/demographic bias this has. Probably a lot.
// 73% of people are represented in the top 1000 names
//used to reduce redundant named-entities in longer text. (don't spot the same person twice.)
//used to identify gender for coreference resolution
var main = []
//an ad-hoc prefix encoding for names. 2ms decompression of names
var male_names = {
"will": "iam,ie,ard,is,iams",
"fred": ",erick,die,rick,dy",
"marc": "us,,o,os,el",
"darr": "ell,yl,en,el,in",
"fran": "k,cis,cisco,klin,kie",
"terr": "y,ance,ence,ell",
"rand": "y,all,olph,al",
"brad": "ley,,ford,y",
"jeff": "rey,,ery,ry",
"john": ",ny,nie,athan",
"greg": "ory,,g,orio",
"mar": "k,tin,vin,io,shall,ty,lon,lin",
"car": "l,los,lton,roll,y,ey",
"ken": "neth,,t,ny,dall,drick",
"har": "old,ry,vey,ley,lan,rison",
"ste": "ven,phen,ve,wart,phan,rling",
"jer": "ry,emy,ome,emiah,maine,ald",
"mic": "hael,heal,ah,key,hel",
"dar": "yl,in,nell,win,ius",
"dan": "iel,ny,,e",
"wil": "bur,son,bert,fred,fredo",
"ric": "hard,ky,ardo,k,key",
"cli": "fford,nton,fton,nt,ff",
"cla": "rence,ude,yton,rk,y",
"ben": "jamin,,nie,ny,ito",
"rod": "ney,erick,olfo,ger,",
"rob": "ert,erto,bie,",
"gar": "y,ry,rett,land",
"sam": "uel,,my,mie",
"and": "rew,re,y,res",
"jos": "eph,e,hua,h",
"joe": ",l,y,sph",
"leo": "nard,n,,nardo",
"tom": ",my,as,mie",
"bry": "an,ant,ce,on",
"ant": "hony,onio,oine,on",
"jac": "k,ob,kson",
"cha": "rles,d,rlie,se",
"sha": "wn,ne,un",
"bre": "nt,tt,ndan,t",
"jes": "se,us,s",
"al": "bert,an,len,fred,exander,ex,vin,lan,fredo,berto,ejandro,fonso,ton,,onzo,i,varo",
"ro": "nald,ger,y,nnie,land,n,ss,osevelt,gelio,lando,man,cky,yce,scoe,ry",
"de": "nnis,rek,an,rrick,lbert,vin,wey,xter,wayne,metrius,nis,smond",
"ja": "mes,son,y,red,vier,ke,sper,mal,rrod",
"el": "mer,lis,bert,ias,ijah,don,i,ton,liot,liott,vin,wood",
"ma": "tthew,nuel,urice,thew,x,tt,lcolm,ck,son",
"do": "nald,uglas,n,nnie,ug,minic,yle,mingo,minick",
"er": "ic,nest,ik,nesto,ick,vin,nie,win",
"ra": "ymond,lph,y,mon,fael,ul,miro,phael",
"ed": "ward,win,die,gar,uardo,,mund,mond",
"co": "rey,ry,dy,lin,nrad,rnelius",
"le": "roy,wis,ster,land,vi",
"lo": "uis,nnie,renzo,ren,well,uie,u,gan",
"da": "vid,le,ve,mon,llas,mian,mien",
"jo": "nathan,n,rge,rdan,nathon,aquin",
"ru": "ssell,ben,dolph,dy,fus,ssel,sty",
"ke": "vin,ith,lvin,rmit",
"ar": "thur,nold,mando,turo,chie,mand",
"re": "ginald,x,ynaldo,uben,ggie",
"ge": "orge,rald,ne,rard,offrey,rardo",
"la": "rry,wrence,nce,urence,mar,mont",
"mo": "rris,ses,nte,ises,nty",
"ju": "an,stin,lio,lian,lius,nior",
"pe": "ter,dro,rry,te,rcy",
"tr": "avis,oy,evor,ent",
"he": "nry,rbert,rman,ctor,ath",
"no": "rman,el,ah,lan,rbert",
"em": "anuel,il,ilio,mett,manuel",
"wa": "lter,yne,rren,llace,de",
"mi": "ke,guel,lton,tchell,les",
"sa": "lvador,lvatore,ntiago,ul,ntos",
"ch": "ristopher,ris,ester,ristian,uck",
"pa": "ul,trick,blo,t",
"st": "anley,uart,an",
"hu": "gh,bert,go,mberto",
"br": "ian,uce,andon,ain",
"vi": "ctor,ncent,rgil,cente",
"ca": "lvin,meron,leb",
"gu": "y,illermo,stavo",
"lu": "is,ther,ke,cas",
"gr": "ant,ady,over,aham",
"ne": "il,lson,al,d",
"t": "homas,imothy,odd,ony,heodore,im,yler,ed,yrone,aylor,erence,immy,oby,eddy,yson",
"s": "cott,ean,idney,ergio,eth,pencer,herman,ylvester,imon,heldon,cotty,olomon",
"r": "yan",
"n": "icholas,athan,athaniel,ick,icolas",
"a": "dam,aron,drian,ustin,ngelo,braham,mos,bel,gustin,ugust,dolfo",
"b": "illy,obby,arry,ernard,ill,ob,yron,lake,ert,oyd,illie,laine,art,uddy,urton",
"e": "ugene,arl,verett,nrique,van,arnest,frain,than,steban",
"h": "oward,omer,orace,ans,al",
"p": "hillip,hilip,reston,hil,ierre",
"c": "raig,urtis,lyde,ecil,esar,edric,leveland,urt",
"j": "immy,im,immie",
"g": "lenn,ordon,len,ilbert,abriel,ilberto",
"m": "elvin,yron,erle,urray",
"k": "yle,arl,urt,irk,ristopher",
"o": "scar,tis,liver,rlando,mar,wen,rville,tto",
"l": "loyd,yle,ionel",
"f": "loyd,ernando,elix,elipe,orrest,abian,idel",
"w": "esley,endell,m,oodrow,inston",
"d": "ustin,uane,wayne,wight,rew,ylan",
"z": "achary",
"v": "ernon,an,ance",
"i": "an,van,saac,ra,rving,smael,gnacio,rvin",
"q": "uentin,uinton",
"x": "avier"
}
var female_names = {
"mari": "a,e,lyn,an,anne,na,ssa,bel,sa,sol,tza",
"kris": "ten,tin,tina,ti,tine,ty,ta,tie",
"jean": "ette,ne,nette,nie,ine,nine",
"chri": "stine,stina,sty,stie,sta,sti",
"marg": "aret,ie,arita,uerite,ret,o",
"ange": "la,lica,lina,lia,line",
"fran": "ces,cine,cisca",
"kath": "leen,erine,y,ryn,arine",
"sher": "ry,ri,yl,i,rie",
"caro": "l,lyn,line,le,lina",
"dian": "e,a,ne,na",
"jenn": "ifer,ie,y,a",
"luci": "lle,a,nda,le",
"kell": "y,i,ey,ie",
"rosa": ",lie,lind",
"jani": "ce,e,s,ne",
"stac": "y,ey,ie,i",
"shel": "ly,ley,ia",
"laur": "a,en,ie,el",
"trac": "y,ey,i,ie",
"jane": "t,,lle,tte",
"bett": "y,ie,e,ye",
"rose": "mary,marie,tta",
"joan": ",ne,n,na",
"mar": "y,tha,jorie,cia,lene,sha,yann,cella,ta,la,cy,tina",
"lor": "i,raine,etta,a,ena,ene,na,ie",
"sha": "ron,nnon,ri,wna,nna,na,una",
"dor": "othy,is,a,een,thy,othea",
"cla": "ra,udia,ire,rice,udette",
"eli": "zabeth,sa,sabeth,se,za",
"kar": "en,la,a,i,in",
"tam": "my,ara,i,mie,ika",
"ann": "a,,e,ie,ette",
"car": "men,rie,la,a,mela",
"mel": "issa,anie,inda",
"ali": "ce,cia,son,sha,sa",
"bri": "ttany,dget,ttney,dgette",
"lyn": "n,da,ne,ette",
"del": "ores,la,ia,oris",
"ter": "esa,ri,i",
"son": "ia,ya,ja,dra",
"deb": "orah,ra,bie,ora",
"jac": "queline,kie,quelyn,lyn",
"lat": "oya,asha,onya,isha",
"che": "ryl,lsea,ri,rie",
"vic": "toria,ki,kie,ky",
"sus": "an,ie,anne,ana",
"rob": "erta,yn",
"est": "her,elle,ella,er",
"lea": "h,,nne,nn",
"lil": "lian,lie,a,y",
"ma": "ureen,ttie,xine,bel,e,deline,ggie,mie,ble,ndy,ude,yra,nuela,vis,gdalena,tilda",
"jo": "yce,sephine,,di,dy,hanna,sefina,sie,celyn,lene,ni,die",
"be": "verly,rtha,atrice,rnice,th,ssie,cky,linda,ulah,rnadette,thany,tsy,atriz",
"ca": "therine,thy,ssandra,ndace,ndice,mille,itlin,ssie,thleen,llie",
"le": "slie,na,ona,ticia,igh,la,nora,ola,sley,ila",
"el": "aine,len,eanor,sie,la,ena,oise,vira,sa,va,ma",
"sa": "ndra,rah,ra,lly,mantha,brina,ndy,die,llie",
"mi": "chelle,ldred,chele,nnie,riam,sty,ndy,randa,llie",
"co": "nnie,lleen,nstance,urtney,ra,rinne,nsuelo,rnelia",
"ju": "lie,dith,dy,lia,anita,ana,stine",
"da": "wn,nielle,rlene,na,isy,rla,phne",
"re": "becca,nee,na,bekah,ba",
"al": "ma,lison,berta,exandra,yssa,ta",
"ra": "chel,mona,chael,quel,chelle",
"an": "drea,ita,a,gie,toinette,tonia",
"ge": "raldine,rtrude,orgia,nevieve,orgina",
"de": "nise,anna,siree,na,ana,e",
"ja": "smine,na,yne",
"lu": "cy,z,la,pe,ella,isa",
"je": "ssica,nifer,well,ri",
"ad": "a,rienne,die,ele,riana,eline",
"pa": "tricia,mela,ula,uline,tsy,m,tty,ulette,tti,trice,trica,ige",
"ke": "ndra,rri,isha,ri",
"mo": "nica,lly,nique,na,llie",
"lo": "uise,is,la",
"he": "len,ather,idi,nrietta,lene,lena",
"me": "gan,rcedes,redith,ghan,agan",
"wi": "lma,lla,nnie",
"ga": "il,yle,briela,brielle,le",
"er": "in,ica,ika,ma,nestine",
"ce": "cilia,lia,celia,leste,cile",
"ka": "tie,y,trina,yla,te",
"ol": "ga,ivia,lie,a",
"li": "nda,sa,ndsay,ndsey,zzie",
"na": "ncy,talie,omi,tasha,dine",
"la": "verne,na,donna,ra",
"vi": "rginia,vian,ola",
"ha": "rriet,nnah",
"pe": "ggy,arl,nny,tra",
"br": "enda,andi,ooke",
"ki": "mberly,m,mberley,rsten",
"au": "drey,tumn,dra",
"bo": "nnie,bbie,nita,bbi",
"do": "nna,lores,lly,minique",
"gl": "oria,adys,enda,enna",
"tr": "icia,ina,isha,udy",
"ta": "ra,nya,sha,bitha",
"ro": "sie,xanne,chelle,nda",
"am": "y,anda,ber,elia",
"fa": "ye,nnie,y",
"ni": "cole,na,chole,kki",
"ve": "ronica,ra,lma,rna",
"gr": "ace,etchen,aciela,acie",
"b": "arbara,lanca,arbra,ianca",
"r": "uth,ita,honda",
"s": "hirley,tephanie,ylvia,heila,uzanne,ue,tella,ophia,ilvia,ophie,tefanie,heena,ummer,elma,ocorro,ybil,imone",
"c": "ynthia,rystal,indy,harlene,ristina,leo",
"e": "velyn,mily,dna,dith,thel,mma,va,ileen,unice,ula,ssie,ffie,tta,ugenia",
"a": "shley,pril,gnes,rlene,imee,bigail,ida,bby,ileen",
"t": "heresa,ina,iffany,helma,onya,oni,herese,onia",
"i": "rene,da,rma,sabel,nez,ngrid,va,mogene,sabelle",
"w": "anda,endy,hitney",
"p": "hyllis,riscilla,olly",
"n": "orma,ellie,ora,ettie,ell",
"f": "lorence,elicia,lora,reda,ern,rieda",
"v": "alerie,anessa",
"j": "ill,illian",
"y": "vonne,olanda,vette",
"g": "ina,wendolyn,wen,oldie",
"l": "ydia",
"m": "yrtle,yra,uriel,yrna",
"h": "ilda",
"o": "pal,ra,felia",
"k": "rystal",
"d": "ixie,ina",
"u": "rsula"
}
var ambiguous = [
"casey",
"jamie",
"lee",
"jaime",
"jessie",
"morgan",
"rene",
"robin",
"devon",
"kerry",
"alexis",
"guadalupe",
"blair",
"kasey",
"jean",
"marion",
"aubrey",
"shelby",
"jan",
"shea",
"jade",
"kenyatta",
"kelsey",
"shay",
"lashawn",
"trinity",
"regan",
"jammie",
"cassidy",
"cheyenne",
"reagan",
"shiloh",
"marlo",
"andra",
"devan",
"rosario",
"lee"
]
var i, arr, i2, l, keys;
//add data into the main obj
//males
keys = Object.keys(male_names)
l = keys.length
for (i = 0; i < l; i++) {
arr = male_names[keys[i]].split(',')
for (i2 = 0; i2 < arr.length; i2++) {
main[keys[i] + arr[i2]] = "m"
}
}
//females
keys = Object.keys(female_names)
l = keys.length
for (i = 0; i < l; i++) {
arr = female_names[keys[i]].split(',')
for (i2 = 0; i2 < arr.length; i2++) {
main[keys[i] + arr[i2]] = "f"
}
}
//unisex names
l = ambiguous.length
for (i = 0; i < l; i += 1) {
main[ambiguous[i]] = "a"
}
module.exports = main;
// console.log(firstnames['spencer'])
// console.log(firstnames['jill'])
// console.log(firstnames['sue'])
// console.log(firstnames['jan'])
// console.log(JSON.stringify(Object.keys(firstnames).length, null, 2));
},{}],7:[function(require,module,exports){
//these are common person titles used in the lexicon and sentence segmentation methods
//they are also used to identify that a noun is a person
var main = [
//honourifics
"jr",
"mr",
"mrs",
"ms",
"dr",
"prof",
"sr",
"sen",
"corp",
"rep",
"gov",
"atty",
"supt",
"det",
"rev",
"col",
"gen",
"lt",
"cmdr",
"adm",
"capt",
"sgt",
"cpl",
"maj",
"miss",
"misses",
"mister",
"sir",
"esq",
"mstr",
"phd",
"adj",
"adv",
"asst",
"bldg",
"brig",
"comdr",
"hon",
"messrs",
"mlle",
"mme",
"op",
"ord",
"pvt",
"reps",
"res",
"sens",
"sfc",
"surg",
]
module.exports = main;
},{}],8:[function(require,module,exports){
//nouns with irregular plural/singular forms
//used in noun.inflect, and also in the lexicon.
//compressed with '_' to reduce some redundancy.
var main=[
["child", "_ren"],
["person", "people"],
["leaf", "leaves"],
["database", "_s"],
["quiz", "_zes"],
["child", "_ren"],
["stomach", "_s"],
["sex", "_es"],
["move", "_s"],
["shoe", "_s"],
["goose", "geese"],
["phenomenon", "phenomena"],
["barracks", "_"],
["deer", "_"],
["syllabus", "syllabi"],
["index", "indices"],
["appendix", "appendices"],
["criterion", "criteria"],
["man", "men"],
["sex", "_es"],
["rodeo", "_s"],
["epoch", "_s"],
["zero", "_s"],
["avocado", "_s"],
["halo", "_s"],
["tornado", "_s"],
["tuxedo", "_s"],
["sombrero", "_s"],
["addendum", "addenda"],
["alga", "_e"],
["alumna", "_e"],
["alumnus", "alumni"],
["bacillus", "bacilli"],
["cactus", "cacti"],
["beau", "_x"],
["château", "_x"],
["chateau", "_x"],
["tableau", "_x"],
["corpus", "corpora"],
["curriculum", "curricula"],
["echo", "_es"],
["embargo", "_es"],
["foot", "feet"],
["genus", "genera"],
["hippopotamus", "hippopotami"],
["larva", "_e"],
["libretto", "libretti"],
["loaf", "loaves"],
["matrix", "matrices"],
["memorandum", "memoranda"],
["mosquito", "_es"],
["opus", "opera"],
["ovum", "ova"],
["ox", "_en"],
["radius", "radii"],
["referendum", "referenda"],
["thief", "thieves"],
["tooth", "teeth"]
]
main = main.map(function (a) {
a[1] = a[1].replace('_', a[0])
return a
})
module.exports = main;
},{}],9:[function(require,module,exports){
//common terms that are multi-word, but one part-of-speech
//these should not include phrasal verbs, like 'looked out'. These are handled elsewhere.
module.exports = {
"of course": "RB",
"at least": "RB",
"no longer": "RB",
"sort of": "RB",
"at first": "RB",
"once again": "RB",
"once more": "RB",
"up to": "RB",
"by now": "RB",
"all but": "RB",
"just about": "RB",
"on board": "JJ",
"a lot": "RB",
"by far": "RB",
"at best": "RB",
"at large": "RB",
"for good": "RB",
"vice versa": "JJ",
"en route": "JJ",
"for sure": "RB",
"upside down": "JJ",
"at most": "RB",
"per se": "RB",
"at worst": "RB",
"upwards of": "RB",
"en masse": "RB",
"point blank": "RB",
"up front": "JJ",
"in situ": "JJ",
"in vitro": "JJ",
"ad hoc": "JJ",
"de facto": "JJ",
"ad infinitum": "JJ",
"ad nauseam": "RB",
"for keeps": "JJ",
"a priori": "FW",
"et cetera": "FW",
"off guard": "JJ",
"spot on": "JJ",
"ipso facto": "JJ",
"not withstanding": "RB",
"de jure": "RB",
"a la": "IN",
"ad hominem": "NN",
"par excellence": "RB",
"de trop": "RB",
"a posteriori": "RB",
"fed up": "JJ",
"brand new": "JJ",
"old fashioned": "JJ",
"bona fide": "JJ",
"well off": "JJ",
"far off": "JJ",
"straight forward": "JJ",
"hard up": "JJ",
"sui generis": "JJ",
"en suite": "JJ",
"avant garde": "JJ",
"sans serif": "JJ",
"gung ho": "JJ",
"super duper": "JJ",
"new york":"NN",
"new england":"NN",
"new hampshire":"NN",
"new delhi":"NN",
"new jersey":"NN",
"new mexico":"NN",
"united states":"NN",
"united kingdom":"NN",
"great britain":"NN",
"head start":"NN"
}
},{}],10:[function(require,module,exports){
//phrasal verbs are two words that really mean one verb.
//'beef up' is one verb, and not some direction of beefing.
//by @spencermountain, 2015 mit
//many credits to http://www.allmyphrasalverbs.com/
var verb_conjugate = require("../../parents/verb/conjugate/conjugate")
//start the list with some randoms
var main = [
"be onto",
"fall behind",
"fall through",
"fool with",
"get across",
"get along",
"get at",
"give way",
"hear from",
"hear of",
"lash into",
"make do",
"run across",
"set upon",
"take aback",
"keep from"
]
//if there's a phrasal verb "keep on", there's often a "keep off"
var opposites = {
"away": "back",
"in": "out",
"on": "off",
"over": "under",
"together": "apart",
"up": "down"
}
//forms that have in/out symmetry
var symmetric = {
"away": "blow,bounce,bring,call,come,cut,drop,fire,get,give,go,keep,pass,put,run,send,shoot,switch,take,tie,throw",
"in": "bang,barge,bash,beat,block,book,box,break,bring,burn,butt,carve,cash,check,come,cross,drop,fall,fence,fill,give,grow,hand,hang,head,jack,keep,leave,let,lock,log,move,opt,pack,peel,pull,put,rain,reach,ring,rub,send,set,settle,shut,sign,smash,snow,strike,take,try,turn,type,warm,wave,wean,wear,wheel",
"on": "add,call,carry,catch,count,feed,get,give,go,grind,head,hold,keep,lay,log,pass,pop,power,put,send,show,snap,switch,take,tell,try,turn,wait",
"over": "come,go,look,read,run,talk",
"together": "come,pull,put",
"up": "add,back,beat,bend,blow,boil,bottle,break,bring,buckle,bundle,call,carve,clean,cut,dress,fill,flag,fold,get,give,grind,grow,hang,hold,keep,let,load,lock,look,man,mark,melt,move,pack,pin,pipe,plump,pop,power,pull,put,rub,scale,scrape,send,set,settle,shake,show,sit,slow,smash,square,stand,strike,take,tear,tie,turn,use,wash,wind",
}
Object.keys(symmetric).forEach(function (k) {
symmetric[k].split(',').forEach(function (s) {
//add the given form
main.push(s + " " + k)
//add its opposite form
main.push(s + " " + opposites[k])
})
})
//forms that don't have in/out symmetry
var asymmetric = {
"about": "bring,fool,gad,go,root",
"after": "go,look,take",
"ahead": "get,go,press",
"along": "bring,move",
"apart": "fall,take",
"around": "ask,boss,bring,call,come,fool,get,horse,joke,lie,mess,play",
"away": "back,carry,file,frighten,hide,wash",
"back": "fall,fight,hit,hold,look,pay,stand,think",
"by": "drop,get,go,stop,swear,swing,tick,zip",
"down": "bog,calm,fall,hand,hunker,jot,knock,lie,narrow,note,pat,pour,run,tone,trickle,wear",
"for": "fend,file,gun,hanker,root,shoot",
"forth": "bring,come",
"forward": "come,look",
"in": "cave,chip,hone,jump,key,pencil,plug,rein,shade,sleep,stop,suck,tie,trade,tuck,usher,weigh,zero",
"into": "look,run",
"it": "go,have",
"off": "auction,be,beat,blast,block,brush,burn,buzz,cast,cool,drop,end,face,fall,fend,frighten,goof,jack,kick,knock,laugh,level,live,make,mouth,nod,pair,pay,peel,read,reel,ring,rip,round,sail,shave,shoot,sleep,slice,split,square,stave,stop,storm,strike,tear,tee,tick,tip,top,walk,work,write",
"on": "bank,bargain,egg,frown,hit,latch,pile,prattle,press,spring,spur,tack,urge,yammer",
"out": "act,ask,back,bail,bear,black,blank,bleed,blow,blurt,branch,buy,cancel,cut,eat,edge,farm,figure,find,fill,find,fish,fizzle,flake,flame,flare,flesh,flip,geek,get,help,hide,hold,iron,knock,lash,level,listen,lose,luck,make,max,miss,nerd,pan,pass,pick,pig,point,print,psych,rat,read,rent,root,rule,run,scout,see,sell,shout,single,sit,smoke,sort,spell,splash,stamp,start,storm,straighten,suss,time,tire,top,trip,trot,wash,watch,weird,whip,wimp,wipe,work,zone,zonk",
"over": "bend,bubble,do,fall,get,gloss,hold,keel,mull,pore,sleep,spill,think,tide,tip",
"round": "get,go",
"through": "go,run",
"to": "keep,see",
"up": "act,beef,board,bone,boot,brighten,build,buy,catch,cheer,cook,end,eye,face,fatten,feel,fess,finish,fire,firm,flame,flare,free,freeze,freshen,fry,fuel,gang,gear,goof,hack,ham,heat,hit,hole,hush,jazz,juice,lap,light,lighten,line,link,listen,live,loosen,make,mash,measure,mess,mix,mock,mop,muddle,open,own,pair,patch,pick,prop,psych,read,rough,rustle,save,shack,sign,size,slice,slip,snap,sober,spark,split,spruce,stack,start,stay,stir,stitch,straighten,string,suck,suit,sum,team,tee,think,tidy,tighten,toss,trade,trip,type,vacuum,wait,wake,warm,weigh,whip,wire,wise,word,write,zip",
}
Object.keys(asymmetric).forEach(function (k) {
asymmetric[k].split(',').forEach(function (s) {
main.push(s + " " + k)
})
})
//at his point all verbs are infinitive. lets make this explicit.
main = main.reduce(function (h, s) {
h[s] = "VBP"
return h
}, {})
//conjugate every phrasal verb. takes ~30ms
var tags = {
present: "VB",
past: "VBD",
future: "VBF",
gerund: "VBG",
infinitive: "VBP",
}
var cache = {} //cache individual verbs to speed it up
var split, verb, particle, phrasal;
Object.keys(main).forEach(function (s) {
split = s.split(' ')
verb = split[0]
particle = split[1]
if (cache[verb] === undefined) {
cache[verb] = verb_conjugate(verb)
}
Object.keys(cache[verb]).forEach(function (k) {
phrasal = cache[verb][k] + " " + particle
main[phrasal] = tags[k]
})
})
module.exports = main;
// console.log(JSON.stringify(phrasal_verbs, null, 2))
},{"../../parents/verb/conjugate/conjugate":39}],11:[function(require,module,exports){
//common nouns that have no plural form. These are suprisingly rare
//used in noun.inflect(), and added as nouns in lexicon
module.exports=[
"aircraft",
"bass",
"bison",
"fowl",
"halibut",
"moose",
"salmon",
"spacecraft",
"tuna",
"trout",
"advice",
"information",
"knowledge",
"trouble",
"enjoyment",
"fun",
"recreation",
"relaxation",
"meat",
"rice",
"bread",
"cake",
"coffee",
"ice",
"water",
"oil",
"grass",
"hair",
"fruit",
"wildlife",
"equipment",
"machinery",
"furniture",
"mail",
"luggage",
"jewelry",
"clothing",
"money",
"mathematics",
"economics",
"physics",
"civics",
"ethics",
"gymnastics",
"mumps",
"measles",
"news",
"tennis",
"baggage",
"currency",
"soap",
"toothpaste",
"food",
"sugar",
"butter",
"flour",
"research",
"leather",
"wool",
"wood",
"coal",
"weather",
"homework",
"cotton",
"silk",
"patience",
"impatience",
"vinegar",
"art",
"beef",
"blood",
"cash",
"chaos",
"cheese",
"chewing",
"conduct",
"confusion",
"education",
"electricity",
"entertainment",
"fiction",
"forgiveness",
"gold",
"gossip",
"ground",
"happiness",
"history",
"honey",
"hospitality",
"importance",
"justice",
"laughter",
"leisure",
"lightning",
"literature",
"luck",
"melancholy",
"milk",
"mist",
"music",
"noise",
"oxygen",
"paper",
"pay",
"peace",
"peanut",
"pepper",
"petrol",
"plastic",
"pork",
"power",
"pressure",
"rain",
"recognition",
"sadness",
"safety",
"salt",
"sand",
"scenery",
"shopping",
"silver",
"snow",
"softness",
"space",
"speed",
"steam",
"sunshine",
"tea",
"thunder",
"time",
"traffic",
"trousers",
"violence",
"warmth",
"wine",
"steel",
"soccer",
"hockey",
"golf",
"fish",
"gum",
"liquid",
"series",
"sheep",
"species",
"fahrenheit",
"celcius",
"kelvin",
"hertz"
]
},{}],12:[function(require,module,exports){
//terms that are "CD", a 'value' term
module.exports = [
//numbers
'zero',
'one',
'two',
'three',
'four',
'five',
'six',
'seven',
'eight',
'nine',
'ten',
'eleven',
'twelve',
'thirteen',
'fourteen',
'fifteen',
'sixteen',
'seventeen',
'eighteen',
'nineteen',
'twenty',
'thirty',
'forty',
'fifty',
'sixty',
'seventy',
'eighty',
'ninety',
'hundred',
'thousand',
'million',
'billion',
'trillion',
'quadrillion',
'quintillion',
'sextillion',
'septillion',
'octillion',
'nonillion',
'decillion',
//months
"january",
"february",
// "march",
"april",
// "may",
"june",
"july",
"august",
"september",
"october",
"november",
"december",
"jan", "feb", "mar", "apr", "jun", "jul", "aug", "sep", "oct", "nov", "dec", "sept", "sep",
//days
"monday",
"tuesday",
"wednesday",
"thursday",
"friday",
"saturday",
"sunday"
].reduce(function (h, s) {
h[s] = "CD"
return h
}, {})
},{}],13:[function(require,module,exports){
//most-frequent non-irregular verbs, to be conjugated for the lexicon
//this list is the seed, from which various forms are conjugated
module.exports = [
"collapse",
"stake",
"forsee",
"suck",
"answer",
"argue",
"tend",
"examine",
"depend",
"form",
"figure",
"mind",
"surround",
"suspect",
"reflect",
"wonder",
"hope",
"end",
"thank",
"file",
"regard",
"report",
"imagine",
"consider",
"ensure",
"cause",
"work",
"enter",
"stop",
"defeat",
"surge",
"launch",
"turn",
"like",
"control",
"relate",
"remember",
"join",
"listen",
"train",
"spring",
"enjoy",
"fail",
"recognize",
"obtain",
"learn",
"fill",
"announce",
"prevent",
"achieve",
"realize",
"involve",
"remove",
"aid",
"visit",
"test",
"prepare",
"ask",
"carry",
"suppose",
"determine",
"raise",
"love",
"use",
"pull",
"improve",
"contain",
"offer",
"talk",
"pick",
"care",
"express",
"remain",
"operate",
"close",
"add",
"mention",
"support",
"decide",
"walk",
"vary",
"demand",
"describe",
"agree",
"happen",
"allow",
"suffer",
"study",
"press",
"watch",
"seem",
"occur",
"contribute",
"claim",
"compare",
"apply",
"direct",
"discuss",
"indicate",
"require",
"change",
"fix",
"reach",
"prove",
"expect",
"exist",
"play",
"permit",
"kill",
"charge",
"increase",
"believe",
"create",
"continue",
"live",
"help",
"represent",
"edit",
"serve",
"appear",
"cover",
"maintain",
"start",
"stay",
"move",
"extend",
"design",
"supply",
"suggest",
"want",
"approach",
"call",
"include",
"try",
"receive",
"save",
"discover",
"marry",
"need",
"establish",
"keep",
"assume",
"attend",
"unite",
"explain",
"publish",
"accept",
"settle",
"reduce",
"do",
"look",
"interact",
"concern",
"labor",
"return",
"select",
"die",
"provide",
"seek",
"wish",
"finish",
"follow",
"disagree",
"produce",
"attack",
"attempt",
"brake",
"brush",
"burn",
"bang",
"bomb",
"budget",
"comfort",
"cook",
"copy",
"cough",
"crush",
"cry",
"check",
"claw",
"clip",
"combine",
"damage",
"desire",
"doubt",
"drain",
"dance",
"decrease",
"defect",
"deposit",
"drift",
"dip",
"dive",
"divorce",
"dream",
"exchange",
"envy",
"exert",
"exercise",
"export",
"fold",
"flood",
"focus",
"forecast",
"fracture",
"grip",
"guide",
"guard",
"guarantee",
"guess",
"hate",
"heat",
"handle",
"hire",
"host",
"hunt",
"hurry",
"import",
"judge",
"jump",
"jam",
"kick",
"kiss",
"knock",
"laugh",
"lift",
"lock",
"lecture",
"link",
"load",
"loan",
"lump",
"melt",
"message",
"murder",
"neglect",
"overlap",
"overtake",
"overuse",
"print",
"protest",
"pump",
"push",
"post",
"progress",
"promise",
"purchase",
"regret",
"request",
"reward",
"roll",
"rub",
"rent",
"repair",
"sail",
"scale",
"screw",
"shock",
"sleep",
"slip",
"smash",
"smell",
"smoke",
"sneeze",
"snow",
"surprise",
"scratch",
"search",
"share",
"shave",
"spit",
"splash",
"stain",
"stress",
"switch",
"taste",
"touch",
"trade",
"trick",
"twist",
"trap",
"travel",
"tune",
"undergo",
"undo",
"uplift",
"vote",
"wash",
"wave",
"whistle",
"wreck",
"yawn",
"betray",
"restrict",
"perform",
"worry",
"point",
"activate",
"fear",
"plan",
"note",
"face",
"predict",
"differ",
"deserve",
"torture",
"recall",
"count",
"admit",
"insist",
"lack",
"pass",
"belong",
"complain",
"constitute",
"rely",
"refuse",
"range",
"cite",
"flash",
"arrive",
"reveal",
"consist",
"observe",
"notice",
"trust",
"display",
"view",
"stare",
"acknowledge",
"owe",
"gaze",
"treat",
"account",
"gather",
"address",
"confirm",
"estimate",
"manage",
"participate",
"sneak",
"drop",
"mirror",
"experience",
"strive",
"arch",
"dislike",
"favor",
"earn",
"emphasize",
"match",
"question",
"emerge",
"encourage",
"matter",
"name",
"head",
"line",
"slam",
"list",
"warn",
"ignore",
"resemble",
"feature",
"place",
"reverse",
"accuse",
"spoil",
"retain",
"survive",
"praise",
"function",
"please",
"date",
"remind",
"deliver",
"echo",
"engage",
"deny",
"yield",
"center",
"gain",
"anticipate",
"reason",
"side",
"thrive",
"defy",
"dodge",
"enable",
"applaud",
"bear",
"persist",
"pose",
"reject",
"attract",
"await",
"inhibit",
"declare",
"process",
"risk",
"urge",
"value",
"block",
"confront",
"credit",
"cross",
"amuse",
"dare",
"resent",
"smile",
"gloss",
"threaten",
"collect",
"depict",
"dismiss",
"submit",
"benefit",
"step",
"deem",
"limit",
"sense",
"issue",
"embody",
"force",
"govern",
"replace",
"bother",
"cater",
"adopt",
"empower",
"outweigh",
"alter",
"enrich",
"influence",
"prohibit",
"pursue",
"warrant",
"convey",
"approve",
"reserve",
"rest",
"strain",
"wander",
"adjust",
"dress",
"market",
"mingle",
"disapprove",
"evaluate",
"flow",
"inhabit",
"pop",
"rule",
"depart",
"roam",
"assert",
"disappear",
"envision",
"pause",
"afford",
"challenge",
"grab",
"grumble",
"house",
"portray",
"revel",
"base",
"conduct",
"review",
"stem",
"crave",
"mark",
"store",
"target",
"unlock",
"weigh",
"resist",
"drag",
"pour",
"reckon",
"assign",
"cling",
"rank",
"attach",
"decline",
"destroy",
"interfere",
"paint",
"skip",
"sprinkle",
"wither",
"allege",
"retire",
"score",
"monitor",
"expand",
"honor",
"pack",
"assist",
"float",
"appeal",
"stretch",
"undermine",
"assemble",
"boast",
"bounce",
"grasp",
"install",
"borrow",
"crack",
"elect",
"shout",
"contrast",
"overcome",
"relax",
"relent",
"strengthen",
"conform",
"dump",
"pile",
"scare",
"relive",
"resort",
"rush",
"boost",
"cease",
"command",
"excel",
"plug",
"plunge",
"proclaim",
"discourage",
"endure",
"ruin",
"stumble",
"abandon",
"cheat",
"convince",
"merge",
"convert",
"harm",
"multiply",
"overwhelm",
"chew",
"invent",
"bury",
"wipe",
"added",
"took",
"define",
"goes",
"measure",
"enhance",
"distinguish",
"avoid",
//contractions
"don't",
"won't",
"what's" //somewhat ambiguous (what does|what are)
]
},{}],14:[function(require,module,exports){
//the parts of speech used by this library. mostly standard, but some changes.
module.exports = {
//verbs
"VB": {
"name": "verb, generic",
"parent": "verb",
"tag": "VB"
},
"VBD": {
"name": "past-tense verb",
"parent": "verb",
"tense": "past",
"tag": "VBD"
},
"VBN": {
"name": "past-participle verb",
"parent": "verb",
"tense": "past",
"tag": "VBN"
},
"VBP": {
"name": "infinitive verb",
"parent": "verb",
"tense": "present",
"tag": "VBP"
},
"VBF": {
"name": "future-tense verb",
"parent": "verb",
"tense": "future",
"tag": "VBF"
},
"VBZ": {
"name": "present-tense verb",
"tense": "present",
"parent": "verb",
"tag": "VBZ"
},
"CP": {
"name": "copula",
"parent": "verb",
"tag": "CP"
},
"VBG": {
"name": "gerund verb",
"parent": "verb",
"tag": "VBG"
},
//adjectives
"JJ": {
"name": "adjective, generic",
"parent": "adjective",
"tag": "JJ"
},
"JJR": {
"name": "comparative adjective",
"parent": "adjective",
"tag": "JJR"
},
"JJS": {
"name": "superlative adjective",
"parent": "adjective",
"tag": "JJS"
},
//adverbs
"RB": {
"name": "adverb",
"parent": "adverb",
"tag": "RB"
},
"RBR": {
"name": "comparative adverb",
"parent": "adverb",
"tag": "RBR"
},
"RBS": {
"name": "superlative adverb",
"parent": "adverb",
"tag": "RBS"
},
//nouns
"NN": {
"name": "noun, generic",
"parent": "noun",
"tag": "NN"
},
"NNP": {
"name": "singular proper noun",
"parent": "noun",
"tag": "NNP"
},
"NNA": {
"name": "noun, active",
"parent": "noun",
"tag": "NNA"
},
"NNPA": {
"name": "noun, acronym",
"parent": "noun",
"tag": "NNPA"
},
"NNPS": {
"name": "plural proper noun",
"parent": "noun",
"tag": "NNPS"
},
"NNAB": {
"name": "noun, abbreviation",
"parent": "noun",
"tag": "NNAB"
},
"NNS": {
"name": "plural noun",
"parent": "noun",
"tag": "NNS"
},
"NNO": {
"name": "possessive noun",
"parent": "noun",
"tag": "NNO"
},
"NNG": {
"name": "gerund noun",
"parent": "noun",
"tag": "VBG"
},
"PP": {
"name": "possessive pronoun",
"parent": "noun",
"tag": "PP"
},
//glue
"FW": {
"name": "foreign word",
"parent": "glue",
"tag": "FW"
},
"CD": {
"name": "cardinal value, generic",
"parent": "value",
"tag": "CD"
},
"DA": {
"name": "date",
"parent": "value",
"tag": "DA"
},
"NU": {
"name": "number",
"parent": "value",
"tag": "NU"
},
"IN": {
"name": "preposition",
"parent": "glue",
"tag": "IN"
},
"MD": {
"name": "modal verb",
"parent": "verb", //dunno
"tag": "MD"
},
"CC": {
"name": "co-ordating conjunction",
"parent": "glue",
"tag": "CC"
},
"PRP": {
"name": "personal pronoun",
"parent": "noun",
"tag": "PRP"
},
"DT": {
"name": "determiner",
"parent": "glue",
"tag": "DT"
},
"UH": {
"name": "interjection",
"parent": "glue",
"tag": "UH"
},
"EX": {
"name": "existential there",
"parent": "glue",
"tag": "EX"
}
}
},{}],15:[function(require,module,exports){
// word suffixes with a high pos signal, generated with wordnet
//by spencer kelly spencermountain@gmail.com 2014
var data = {
"NN": [
"ceae",
"inae",
"idae",
"leaf",
"rgan",
"eman",
"sman",
"star",
"boat",
"tube",
"rica",
"tica",
"nica",
"auce",
"tics",
"ency",
"ancy",
"poda",
"tude",
"xide",
"body",
"weed",
"tree",
"rrel",
"stem",
"cher",
"icer",
"erer",
"ader",
"ncer",
"izer",
"ayer",
"nner",
"ates",
"ales",
"ides",
"rmes",
"etes",
"llet",
"uage",
"ings",
"aphy",
"chid",
"tein",
"vein",
"hair",
"tris",
"unit",
"cake",
"nake",
"illa",
"ella",
"icle",
"ille",
"etle",
"scle",
"cell",
"bell",
"bill",
"palm",
"toma",
"game",
"lamp",
"bone",
"mann",
"ment",
"wood",
"book",
"nson",
"agon",
"odon",
"dron",
"iron",
"tion",
"itor",
"ator",
"root",
"cope",
"tera",
"hora",
"lora",
"bird",
"worm",
"fern",
"horn",
"wort",
"ourt",
"stry",
"etry",
"bush",
"ness",
"gist",
"rata",
"lata",
"tata",
"moth",
"lity",
"nity",
"sity",
"rity",
"city",
"dity",
"vity",
"drug",
"dium",
"llum",
"trum",
"inum",
"lium",
"tium",
"atum",
"rium",
"icum",
"anum",
"nium",
"orum",
"icus",
"opus",
"chus",
"ngus",
"thus",
"rius",
"rpus"
],
"JJ": [
"liac",
"siac",
"clad",
"deaf",
"xial",
"hial",
"chal",
"rpal",
"asal",
"rial",
"teal",
"oeal",
"vial",
"phal",
"sial",
"heal",
"rbal",
"neal",
"geal",
"dial",
"eval",
"bial",
"ugal",
"kian",
"izan",
"rtan",
"odan",
"llan",
"zian",
"eian",
"eyan",
"ndan",
"eban",
"near",
"unar",
"lear",
"liar",
"-day",
"-way",
"tech",
"sick",
"tuck",
"inct",
"unct",
"wide",
"endo",
"uddy",
"eedy",
"uted",
"aled",
"rred",
"oned",
"rted",
"obed",
"oped",
"ched",
"dded",
"cted",
"tied",
"eked",
"ayed",
"rked",
"teed",
"mmed",
"tred",
"awed",
"rbed",
"bbed",
"axed",
"bred",
"pied",
"cked",
"rced",
"ened",
"fied",
"lved",
"mned",
"kled",
"hted",
"lied",
"eted",
"rded",
"lued",
"rved",
"azed",
"oked",
"ghed",
"sked",
"emed",
"aded",
"ived",
"mbed",
"pted",
"zled",
"ored",
"pled",
"wned",
"afed",
"nied",
"aked",
"gued",
"oded",
"oved",
"oled",
"ymed",
"lled",
"bled",
"cled",
"eded",
"toed",
"ited",
"oyed",
"eyed",
"ured",
"omed",
"ixed",
"pped",
"ined",
"lted",
"iced",
"exed",
"nded",
"amed",
"owed",
"dged",
"nted",
"eged",
"nned",
"used",
"ibed",
"nced",
"umed",
"dled",
"died",
"rged",
"aped",
"oted",
"uled",
"ided",
"nked",
"aved",
"rled",
"rned",
"aned",
"rmed",
"lmed",
"aged",
"ized",
"eved",
"ofed",
"thed",
"ered",
"ared",
"ated",
"eled",
"sted",
"ewed",
"nsed",
"nged",
"lded",
"gged",
"osed",
"fled",
"shed",
"aced",
"ffed",
"tted",
"uced",
"iled",
"uded",
"ired",
"yzed",
"-fed",
"mped",
"iked",
"fted",
"imed",
"hree",
"llel",
"aten",
"lden",
"nken",
"apen",
"ozen",
"ober",
"-set",
"nvex",
"osey",
"laid",
"paid",
"xvii",
"xxii",
"-air",
"tair",
"icit",
"knit",
"nlit",
"xxiv",
"-six",
"-old",
"held",
"cile",
"ible",
"able",
"gile",
"full",
"-ply",
"bbly",
"ggly",
"zzly",
"-one",
"mane",
"mune",
"rung",
"uing",
"mant",
"yant",
"uant",
"pant",
"urnt",
"awny",
"eeny",
"ainy",
"orny",
"siny",
"tood",
"shod",
"-toe",
"d-on",
"-top",
"-for",
"odox",
"wept",
"eepy",
"oopy",
"hird",
"dern",
"worn",
"mart",
"ltry",
"oury",
"ngry",
"arse",
"bose",
"cose",
"mose",
"iose",
"gish",
"kish",
"pish",
"wish",
"vish",
"yish",
"owsy",
"ensy",
"easy",
"ifth",
"edth",
"urth",
"ixth",
"00th",
"ghth",
"ilty",
"orty",
"ifty",
"inty",
"ghty",
"kety",
"afty",
"irty",
"roud",
"true",
"wful",
"dful",
"rful",
"mful",
"gful",
"lful",
"hful",
"kful",
"iful",
"yful",
"sful",
"tive",
"cave",
"sive",
"five",
"cive",
"xxvi",
"urvy",
"nown",
"hewn",
"lown",
"-two",
"lowy",
"ctyl"
],
"VB": [
"wrap",
"hear",
"draw",
"rlay",
"away",
"elay",
"duce",
"esce",
"elch",
"ooch",
"pick",
"huck",
"back",
"hack",
"ruct",
"lict",
"nect",
"vict",
"eact",
"tect",
"vade",
"lude",
"vide",
"rude",
"cede",
"ceed",
"ivel",
"hten",
"rken",
"shen",
"open",
"quer",
"over",
"efer",
"eset",
"uiet",
"pret",
"ulge",
"lign",
"pugn",
"othe",
"rbid",
"raid",
"veil",
"vail",
"roil",
"join",
"dain",
"feit",
"mmit",
"erit",
"voke",
"make",
"weld",
"uild",
"idle",
"rgle",
"otle",
"rble",
"self",
"fill",
"till",
"eels",
"sult",
"pply",
"sume",
"dime",
"lame",
"lump",
"rump",
"vene",
"cook",
"look",
"from",
"elop",
"grow",
"adow",
"ploy",
"sorb",
"pare",
"uire",
"jure",
"lore",
"surf",
"narl",
"earn",
"ourn",
"hirr",
"tort",
"-fry",
"uise",
"lyse",
"sise",
"hise",
"tise",
"nise",
"lise",
"rise",
"anse",
"gise",
"owse",
"oosh",
"resh",
"cuss",
"uess",
"sess",
"vest",
"inst",
"gest",
"fest",
"xist",
"into",
"ccur",
"ieve",
"eive",
"olve",
"down",
"-dye",
"laze",
"lyze",
"raze",
"ooze"
],
"RB": [
"that",
"oubt",
"much",
"diem",
"high",
"atim",
"sely",
"nely",
"ibly",
"lely",
"dely",
"ally",
"gely",
"imly",
"tely",
"ully",
"ably",
"owly",
"vely",
"cely",
"mely",
"mply",
"ngly",
"exly",
"ffly",
"rmly",
"rely",
"uely",
"time",
"iori",
"oors",
"wise",
"orst",
"east",
"ways"
]
}
//convert it to an easier format
module.exports = Object.keys(data).reduce(function (h, k) {
data[k].forEach(function (w) {
h[w] = k
})
return h
}, {})
},{}],16:[function(require,module,exports){
//regex patterns and parts of speech],
module.exports= [
[".[cts]hy$", "JJ"],
[".[st]ty$", "JJ"],
[".[lnr]ize$", "VB"],
[".[gk]y$", "JJ"],
[".fies$", "VB"],
[".some$", "JJ"],
[".[nrtumcd]al$", "JJ"],
[".que$", "JJ"],
[".[tnl]ary$", "JJ"],
[".[di]est$", "JJS"],
["^(un|de|re)\\-[a-z]..", "VB"],
[".lar$", "JJ"],
["[bszmp]{2}y", "JJ"],
[".zes$", "VB"],
[".[icldtgrv]ent$", "JJ"],
[".[rln]ates$", "VBZ"],
[".[oe]ry$", "JJ"],
["[rdntkdhs]ly$", "RB"],
[".[lsrnpb]ian$", "JJ"],
[".[^aeiou]ial$", "JJ"],
[".[^aeiou]eal$", "JJ"],
[".[vrl]id$", "JJ"],
[".[ilk]er$", "JJR"],
[".ike$", "JJ"],
[".ends$", "VB"],
[".wards$", "RB"],
[".rmy$", "JJ"],
[".rol$", "NN"],
[".tors$", "NN"],
[".azy$", "JJ"],
[".where$", "RB"],
[".ify$", "VB"],
[".bound$", "JJ"],
[".ens$", "VB"],
[".oid$", "JJ"],
[".vice$", "NN"],
[".rough$", "JJ"],
[".mum$", "JJ"],
[".teen(th)?$", "CD"],
[".oses$", "VB"],
[".ishes$", "VB"],
[".ects$", "VB"],
[".tieth$", "CD"],
[".ices$", "NN"],
[".bles$", "VB"],
[".pose$", "VB"],
[".ions$", "NN"],
[".ean$", "JJ"],
[".[ia]sed$", "JJ"],
[".tized$", "VB"],
[".llen$", "JJ"],
[".fore$", "RB"],
[".ances$", "NN"],
[".gate$", "VB"],
[".nes$", "VB"],
[".less$", "RB"],
[".ried$", "JJ"],
[".gone$", "JJ"],
[".made$", "JJ"],
[".[pdltrkvyns]ing$", "JJ"],
[".tions$", "NN"],
[".tures$", "NN"],
[".ous$", "JJ"],
[".ports$", "NN"],
[". so$", "RB"],
[".ints$", "NN"],
[".[gt]led$", "JJ"],
["[aeiou].*ist$", "JJ"],
[".lked$", "VB"],
[".fully$", "RB"],
[".*ould$", "MD"],
["^-?[0-9]+(.[0-9]+)?$", "CD"],
["[a-z]*\\-[a-z]*\\-", "JJ"],
["[a-z]'s$", "NNO"],
[".'n$", "VB"],
[".'re$", "CP"],
[".'ll$", "MD"],
[".'t$", "VB"],
[".tches$", "VB"],
["^https?\:?\/\/[a-z0-9]", "CD"],//the colon is removed in normalisation
["^www\.[a-z0-9]", "CD"],
[".ize$", "VB"],
[".[^aeiou]ise$", "VB"],
[".[aeiou]te$", "VB"],
[".ea$", "NN"],
["[aeiou][pns]er$", "NN"],
[".ia$", "NN"],
[".sis$", "NN"],
[".[aeiou]na$", "NN"],
[".[^aeiou]ity$", "NN"],
[".[^aeiou]ium$", "NN"],
[".[^aeiou][ei]al$", "JJ"],
[".ffy$", "JJ"],
[".[^aeiou]ic$", "JJ"],
[".(gg|bb|zz)ly$", "JJ"],
[".[aeiou]my$", "JJ"],
[".[aeiou]ble$", "JJ"],
[".[^aeiou]ful$", "JJ"],
[".[^aeiou]ish$", "JJ"],
[".[^aeiou]ica$", "NN"],
["[aeiou][^aeiou]is$", "NN"],
["[^aeiou]ard$", "NN"],
["[^aeiou]ism$", "NN"],
[".[^aeiou]ity$", "NN"],
[".[^aeiou]ium$", "NN"],
[".[lstrn]us$", "NN"],
["..ic$", "JJ"],
["[aeiou][^aeiou]id$", "JJ"],
[".[^aeiou]ish$", "JJ"],
[".[^aeiou]ive$", "JJ"],
["[ea]{2}zy$", "JJ"],
].map(function(a) {
return {
reg: new RegExp(a[0], "i"),
pos: a[1]
}
})
},{}],17:[function(require,module,exports){
// convert british spellings into american ones
// built with patterns+exceptions from https://en.wikipedia.org/wiki/British_spelling
module.exports = function (str) {
var patterns = [
// ise -> ize
{
reg: /([^aeiou][iy])s(e|ed|es|ing)?$/,
repl: '$1z$2'
},
// our -> or
{
reg: /(..)our(ly|y|ite)?$/,
repl: '$1or$2'
},
// re -> er
{
reg: /([^cdnv])re(s)?$/,
repl: '$1er$2'
},
// xion -> tion
{
reg: /([aeiou])xion([ed])?$/,
repl: '$1tion$2'
},
//logue -> log
{
reg: /logue$/,
repl: 'log'
},
// ae -> e
{
reg: /([o|a])e/,
repl: 'e'
},
//eing -> ing
{
reg: /e(ing|able)$/,
repl: '$1'
},
// illful -> ilful
{
reg: /([aeiou]+[^aeiou]+[aeiou]+)ll(ful|ment|est|ing|or|er|ed)$/, //must be second-syllable
repl: '$1l$2'
}
]
for (var i = 0; i < patterns.length; i++) {
if (str.match(patterns[i].reg)) {
return str.replace(patterns[i].reg, patterns[i].repl)
}
}
return str
}
// console.log(americanize("synthesise")=="synthesize")
// console.log(americanize("synthesised")=="synthesized")
},{}],18:[function(require,module,exports){
// convert american spellings into british ones
// built with patterns+exceptions from https://en.wikipedia.org/wiki/British_spelling
// (some patterns are only safe to do in one direction)
module.exports = function (str) {
var patterns = [
// ise -> ize
{
reg: /([^aeiou][iy])z(e|ed|es|ing)?$/,
repl: '$1s$2'
},
// our -> or
// {
// reg: /(..)our(ly|y|ite)?$/,
// repl: '$1or$2',
// exceptions: []
// },
// re -> er
// {
// reg: /([^cdnv])re(s)?$/,
// repl: '$1er$2',
// exceptions: []
// },
// xion -> tion
// {
// reg: /([aeiou])xion([ed])?$/,
// repl: '$1tion$2',
// exceptions: []
// },
//logue -> log
// {
// reg: /logue$/,
// repl: 'log',
// exceptions: []
// },
// ae -> e
// {
// reg: /([o|a])e/,
// repl: 'e',
// exceptions: []
// },
//eing -> ing
// {
// reg: /e(ing|able)$/,
// repl: '$1',
// exceptions: []
// },
// illful -> ilful
{
reg: /([aeiou]+[^aeiou]+[aeiou]+)l(ful|ment|est|ing|or|er|ed)$/, //must be second-syllable
repl: '$1ll$2',
exceptions: []
}
]
for (var i = 0; i < patterns.length; i++) {
if (str.match(patterns[i].reg)) {
return str.replace(patterns[i].reg, patterns[i].repl)
}
}
return str
}
},{}],19:[function(require,module,exports){
//chop a string into pronounced syllables
module.exports = function (str) {
var all = []
//suffix fixes
var postprocess = function (arr) {
//trim whitespace
arr = arr.map(function (w) {
w = w.replace(/^ */, '')
w = w.replace(/ *$/, '')
return w
})
if (arr.length > 2) {
return arr
}
var ones = [
/^[^aeiou]?ion/,
/^[^aeiou]?ised/,
/^[^aeiou]?iled/
]
var l = arr.length
if (l > 1) {
var suffix = arr[l - 2] + arr[l - 1];
for (var i = 0; i < ones.length; i++) {
if (suffix.match(ones[i])) {
arr[l - 2] = arr[l - 2] + arr[l - 1];
arr.pop();
}
}
}
return arr
}
var doer = function (str) {
var vow = /[aeiouy]$/
if (!str) {
return
}
var chars = str.split('')
var before = "";
var after = "";
var current = "";
for (var i = 0; i < chars.length; i++) {
before = chars.slice(0, i).join('')
current = chars[i]
after = chars.slice(i + 1, chars.length).join('')
var candidate = before + chars[i]
//rules for syllables-
//it's a consonant that comes after a vowel
if (before.match(vow) && !current.match(vow)) {
if (after.match(/^e[sm]/)) {
candidate += "e"
after = after.replace(/^e/, '')
}
all.push(candidate)
return doer(after)
}
//unblended vowels ('noisy' vowel combinations)
if (candidate.match(/(eo|eu|ia|oa|ua|ui)$/i)) { //'io' is noisy, not in 'ion'
all.push(before)
all.push(current)
return doer(after)
}
}
//if still running, end last syllable
if (str.match(/[aiouy]/) || str.match(/ee$/)) { //allow silent trailing e
all.push(str)
} else {
all[all.length - 1] = (all[all.length - 1] || '') + str; //append it to the last one
}
}
str.split(/\s\-/).forEach(function (s) {
doer(s)
})
all = postprocess(all)
//for words like 'tree' and 'free'
if (all.length === 0) {
all = [str]
}
return all
}
// console.log(syllables("suddenly").length === 3)
// console.log(syllables("tree"))
//broken
// console.log(syllables("birchtree"))
},{}],20:[function(require,module,exports){
//split a string into all possible parts
module.exports = function (text, options) {
options = options || {}
var min_count = options.min_count || 1; // minimum hit-count
var max_size = options.max_size || 5; // maximum gram count
var REallowedChars = /[^a-zA-Z'\-]+/g; //Invalid characters are replaced with a whitespace
var i, j, k, textlen, s;
var keys = [null];
var results = [];
//max_size++;
for (i = 1; i <= max_size; i++) {
keys.push({});
}
// clean the text
text = text.replace(REallowedChars, " ").replace(/^\s+/, "").replace(/\s+$/, "");
text = text.toLowerCase()
// Create a hash
text = text.split(/\s+/);
for (i = 0, textlen = text.length; i < textlen; i++) {
s = text[i];
keys[1][s] = (keys[1][s] || 0) + 1;
for (j = 2; j <= max_size; j++) {
if (i + j <= textlen) {
s += " " + text[i + j - 1];
keys[j][s] = (keys[j][s] || 0) + 1;
} else {
break
}
}
}
// map to array
i = undefined;
for (k = 1; k <= max_size; k++) {
results[k] = [];
var key = keys[k];
for (i in key) {
if (key.hasOwnProperty(i) && key[i] >= min_count) {
results[k].push({
"word": i,
"count": key[i],
"size": k
})
}
}
}
results = results.filter(function (s) {
return s !== null
})
results = results.map(function (r) {
r = r.sort(function (a, b) {
return b.count - a.count
})
return r;
});
return results
}
// s = ngram("i really think that we all really think it's all good")
// console.log(s)
},{}],21:[function(require,module,exports){
//(Rule-based sentence boundary segmentation) - chop given text into its proper sentences.
// Ignore periods/questions/exclamations used in acronyms/abbreviations/numbers, etc.
// @spencermountain 2015 MIT
module.exports = function(text) {
var abbreviations = require("../../data/lexicon/abbreviations")
var sentences = [];
//first do a greedy-split..
var chunks = text.split(/(\S.+?[.\?!])(?=\s+|$|")/g);
//date abbrevs.
//these are added seperately because they are not nouns
abbreviations = abbreviations.concat(["jan", "feb", "mar", "apr", "jun", "jul", "aug", "sep", "oct", "nov", "dec", "sept", "sep"]);
//detection of non-sentence chunks
var abbrev_reg = new RegExp("\\b(" + abbreviations.join("|") + ")[.!?] ?$", "i");
var acronym_reg= new RegExp("[ |\.][A-Z]\.?$", "i")
var elipses_reg= new RegExp("\\.\\.\\.*$")
//loop through these chunks, and join the non-sentence chunks back together..
var chunks_length = chunks.length;
for (i = 0; i < chunks_length; i++) {
if (chunks[i]) {
//trim whitespace
chunks[i] = chunks[i].replace(/^\s+|\s+$/g, "");
//should this chunk be combined with the next one?
if (chunks[i+1] && chunks[i].match(abbrev_reg) || chunks[i].match(acronym_reg) || chunks[i].match(elipses_reg) ) {
chunks[i + 1] = ((chunks[i]||'') + " " + (chunks[i + 1]||'')).replace(/ +/g, " ");
} else if(chunks[i] && chunks[i].length>0){ //this chunk is a proper sentence..
sentences.push(chunks[i]);
chunks[i] = "";
}
}
}
//if we never got a sentence, return the given text
if (sentences.length === 0) {
return [text]
}
return sentences;
}
// console.log(sentence_parser('Tony is nice. He lives in Japan.').length === 2)
// console.log(sentence_parser('I like that Color').length === 1)
// console.log(sentence_parser("She was dead. He was ill.").length === 2)
// console.log(sentence_parser("i think it is good ... or else.").length == 1)
},{"../../data/lexicon/abbreviations":3}],22:[function(require,module,exports){
//split a string into 'words' - as intended to be most helpful for this library.
var sentence_parser = require("./sentence")
var multiples = require("../../data/lexicon/multiples")
//these expressions ought to be one token, not two, because they are a distinct POS together
var multi_words = Object.keys(multiples).map(function (m) {
return m.split(' ')
})
var normalise = function (str) {
if (!str) {
return ""
}
str = str.toLowerCase()
str = str.replace(/[,\.!:;\?\(\)]/, '')
str = str.replace(/’/g, "'")
str = str.replace(/"/g, "")
if (!str.match(/[a-z0-9]/i)) {
return ''
}
return str
}
var sentence_type = function (sentence) {
if (sentence.match(/\?$/)) {
return "interrogative";
} else if (sentence.match(/\!$/)) {
return "exclamative";
} else {
return "declarative";
}
}
//some multi-word tokens should be combined here
var combine_multiples = function (arr) {
var better = []
var normalised = arr.map(function (a) {
return normalise(a)
}) //cached results
for (var i = 0; i < arr.length; i++) {
for (var o = 0; o < multi_words.length; o++) {
if (arr[i + 1] && normalised[i] === multi_words[o][0] && normalised[i + 1] === multi_words[o][1]) { //
//we have a match
arr[i] = arr[i] + ' ' + arr[i + 1]
arr[i + 1] = null
break
}
}
better.push(arr[i])
}
return better.filter(function (w) {
return w
})
}
var tokenize = function (str) {
var sentences = sentence_parser(str)
return sentences.map(function (sentence) {
var arr = sentence.split(' ');
arr = combine_multiples(arr)
var tokens = arr.map(function (w, i) {
return {
text: w,
normalised: normalise(w),
title_case: (w.match(/^[A-Z][a-z]/) !== null), //use for merge-tokens
noun_capital: i > 0 && (w.match(/^[A-Z][a-z]/) !== null), //use for noun signal
punctuated: (w.match(/[,;:\(\)"]/) !== null) || undefined,
end: (i === (arr.length - 1)) || undefined,
start: (i === 0) || undefined
}
})
return {
sentence: sentence,
tokens: tokens,
type: sentence_type(sentence)
}
})
}
module.exports = tokenize
// console.log(tokenize("i live in new york")[0].tokens.length==4)
// console.log(tokenize("I speak optimistically of course.")[0].tokens.length==4)
// console.log(tokenize("Joe is 9")[0].tokens.length==3)
// console.log(tokenize("Joe in Toronto")[0].tokens.length==3)
// console.log(tokenize("I am mega-rich")[0].tokens.length==3)
},{"../../data/lexicon/multiples":9,"./sentence":21}],23:[function(require,module,exports){
// a hugely-ignorant, and widely subjective transliteration of latin, cryllic, greek unicode characters to english ascii.
//http://en.wikipedia.org/wiki/List_of_Unicode_characters
//https://docs.google.com/spreadsheet/ccc?key=0Ah46z755j7cVdFRDM1A2YVpwa1ZYWlpJM2pQZ003M0E
//approximate visual (not semantic) relationship between unicode and ascii characters
var compact = {
"2": "²ƻ",
"3": "³ƷƸƹƺǮǯЗҘҙӞӟӠӡȜȝ",
"5": "Ƽƽ",
"8": "Ȣȣ",
"!": "¡",
"?": "¿Ɂɂ",
"a": "ªÀÁÂÃÄÅàáâãäåĀāĂ㥹ǍǎǞǟǠǡǺǻȀȁȂȃȦȧȺΆΑΔΛάαλАДадѦѧӐӑӒӓƛɅ",
"b": "ßþƀƁƂƃƄƅɃΒβϐϦБВЪЬбвъьѢѣҌҍҔҕƥƾ",
"c": "¢©ÇçĆćĈĉĊċČčƆƇƈȻȼͻͼͽϲϹϽϾϿЄСсєҀҁҪҫ",
"d": "ÐĎďĐđƉƊȡƋƌǷ",
"e": "ÈÉÊËèéêëĒēĔĕĖėĘęĚěƎƏƐǝȄȅȆȇȨȩɆɇΈΕΞΣέεξϱϵ϶ЀЁЕЭеѐёҼҽҾҿӖӗӘәӚӛӬӭ",
"f": "ƑƒϜϝӺӻ",
"g": "ĜĝĞğĠġĢģƓǤǥǦǧǴǵ",
"h": "ĤĥĦħƕǶȞȟΉΗЂЊЋНнђћҢңҤҥҺһӉӊ",
"I": "ÌÍÎÏ",
"i": "ìíîïĨĩĪīĬĭĮįİıƖƗȈȉȊȋΊΐΪίιϊІЇії",
"j": "ĴĵǰȷɈɉϳЈј",
"k": "ĶķĸƘƙǨǩΚκЌЖКжкќҚқҜҝҞҟҠҡ",
"l": "ĹĺĻļĽľĿŀŁłƚƪǀǏǐȴȽΙӀӏ",
"m": "ΜϺϻМмӍӎ",
"n": "ÑñŃńŅņŇňʼnŊŋƝƞǸǹȠȵΝΠήηϞЍИЙЛПийлпѝҊҋӅӆӢӣӤӥπ",
"o": "ÒÓÔÕÖØðòóôõöøŌōŎŏŐőƟƠơǑǒǪǫǬǭǾǿȌȍȎȏȪȫȬȭȮȯȰȱΌΘΟΦΩδθοσόϕϘϙϬϭϴОФоѲѳѺѻѼѽӦӧӨөӪӫ¤ƍΏ",
"p": "ƤƿΡρϷϸϼРрҎҏÞ",
"q": "Ɋɋ",
"r": "ŔŕŖŗŘřƦȐȑȒȓɌɍЃГЯгяѓҐґҒғӶӷſ",
"s": "ŚśŜŝŞşŠšƧƨȘșȿςϚϛϟϨϩЅѕ",
"t": "ŢţŤťŦŧƫƬƭƮȚțȶȾΓΤτϮϯТт҂Ҭҭ",
"u": "µÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųƯưƱƲǓǔǕǖǗǘǙǚǛǜȔȕȖȗɄΰμυϋύϑЏЦЧцџҴҵҶҷҸҹӋӌӇӈ",
"v": "ƔνѴѵѶѷ",
"w": "ŴŵƜωώϖϢϣШЩшщѡѿ",
"x": "×ΧχϗϰХхҲҳӼӽӾӿ",
"y": "¥ÝýÿŶŷŸƳƴȲȳɎɏΎΥΨΫγψϒϓϔЎУучўѰѱҮүҰұӮӯӰӱӲӳ",
"z": "ŹźŻżŽžƩƵƶȤȥɀΖζ"
}
//decompress data into an array
var data = []
Object.keys(compact).forEach(function (k) {
compact[k].split('').forEach(function (s) {
data.push([s, k])
})
})
//convert array to two hashes
var normaler = {}
var greek = {}
data.forEach(function (arr) {
normaler[arr[0]] = arr[1]
greek[arr[1]] = arr[0]
})
var normalize = function (str, options) {
options = options || {}
options.percentage = options.percentage || 50
var arr = str.split('').map(function (s) {
var r = Math.random() * 100
if (normaler[s] && r < options.percentage) {
return normaler[s] || s
} else {
return s
}
})
return arr.join('')
}
var denormalize = function (str, options) {
options = options || {}
options.percentage = options.percentage || 50
var arr = str.split('').map(function (s) {
var r = Math.random() * 100
if (greek[s] && r < options.percentage) {
return greek[s] || s
} else {
return s
}
})
return arr.join('')
}
module.exports = {
normalize: normalize,
denormalize: denormalize
}
// s = "ӳžŽżźŹźӳžŽżźŹźӳžŽżźŹźӳžŽżźŹźӳžŽżźŹź"
// s = "Björk"
// console.log(normalize.normalize(s, {
// percentage: 100
// }))
// s = "The quick brown fox jumps over the lazy dog"
// console.log(normalize.denormalize(s, {
// percentage: 100
// }))
},{}],24:[function(require,module,exports){
//these are adjectives that can become comparative + superlative with out "most/more"
//its a whitelist for conjugation
//this data is shared between comparative/superlative methods
module.exports= [
"absurd",
"aggressive",
"alert",
"alive",
"awesome",
"beautiful",
"big",
"bitter",
"black",
"blue",
"bored",
"boring",
"brash",
"brave",
"brief",
"bright",
"broad",
"brown",
"calm",
"charming",
"cheap",
"clean",
"cold",
"cool",
"cruel",
"cute",
"damp",
"deep",
"dear",
"dead",
"dark",
"dirty",
"drunk",
"dull",
"eager",
"efficient",
"even",
"faint",
"fair",
"fanc",
"fast",
"fat",
"feeble",
"few",
"fierce",
"fine",
"flat",
"forgetful",
"frail",
"full",
"gentle",
"glib",
"great",
"green",
"gruesome",
"handsome",
"hard",
"harsh",
"high",
"hollow",
"hot",
"impolite",
"innocent",
"keen",
"kind",
"lame",
"lean",
"light",
"little",
"loose",
"long",
"loud",
"low",
"lush",
"macho",
"mean",
"meek",
"mellow",
"mundane",
"near",
"neat",
"new",
"nice",
"normal",
"odd",
"old",
"pale",
"pink",
"plain",
"poor",
"proud",
"purple",
"quick",
"rare",
"rapid",
"red",
"rich",
"ripe",
"rotten",
"round",
"rude",
"sad",
"safe",
"scarce",
"scared",
"shallow",
"sharp",
"short",
"shrill",
"simple",
"slim",
"slow",
"small",
"smart",
"smooth",
"soft",
"sore",
"sour",
"square",
"stale",
"steep",
"stiff",
"straight",
"strange",
"strong",
"sweet",
"swift",
"tall",
"tame",
"tart",
"tender",
"tense",
"thick",
"thin",
"tight",
"tough",
"vague",
"vast",
"vulgar",
"warm",
"weak",
"wet",
"white",
"wide",
"wild",
"wise",
"young",
"yellow",
"easy",
"narrow",
"late",
"early",
"soon",
"close",
"empty",
"dry",
"windy",
"noisy",
"thirsty",
"hungry",
"fresh",
"quiet",
"clear",
"heavy",
"happy",
"funny",
"lucky",
"pretty",
"important",
"interesting",
"attractive",
"dangerous",
"intellegent",
"pure",
"orange",
"large",
"firm",
"grand",
"formal",
"raw",
"weird",
"glad",
"mad",
"strict",
"tired",
"solid",
"extreme",
"mature",
"true",
"free",
"curly",
"angry"
].reduce(function(h,s){
h[s]=true
return h
},{})
},{}],25:[function(require,module,exports){
//turn 'quick' into 'quickly'
var main = function (str) {
var irregulars = {
"idle": "idly",
"public": "publicly",
"vague": "vaguely",
"day": "daily",
"icy": "icily",
"single": "singly",
"female": "womanly",
"male": "manly",
"simple": "simply",
"whole": "wholly",
"special": "especially",
"straight": "straight",
"wrong": "wrong",
"fast": "fast",
"hard": "hard",
"late": "late",
"early": "early",
"well": "well",
"best": "best",
"latter": "latter",
"bad": "badly"
}
var dont = {
"foreign": 1,
"black": 1,
"modern": 1,
"next": 1,
"difficult": 1,
"degenerate": 1,
"young": 1,
"awake": 1,
"back": 1,
"blue": 1,
"brown": 1,
"orange": 1,
"complex": 1,
"cool": 1,
"dirty": 1,
"done": 1,
"empty": 1,
"fat": 1,
"fertile": 1,
"frozen": 1,
"gold": 1,
"grey": 1,
"gray": 1,
"green": 1,
"medium": 1,
"parallel": 1,
"outdoor": 1,
"unknown": 1,
"undersized": 1,
"used": 1,
"welcome": 1,
"yellow": 1,
"white": 1,
"fixed": 1,
"mixed": 1,
"super": 1,
"guilty": 1,
"tiny": 1,
"able": 1,
"unable": 1,
"same": 1,
"adult": 1
}
var transforms = [{
reg: /al$/i,
repl: 'ally'
}, {
reg: /ly$/i,
repl: 'ly'
}, {
reg: /(.{3})y$/i,
repl: '$1ily'
}, {
reg: /que$/i,
repl: 'quely'
}, {
reg: /ue$/i,
repl: 'uly'
}, {
reg: /ic$/i,
repl: 'ically'
}, {
reg: /ble$/i,
repl: 'bly'
}, {
reg: /l$/i,
repl: 'ly'
}]
var not_matches = [
/airs$/,
/ll$/,
/ee.$/,
/ile$/
]
if (dont[str]) {
return null
}
if (irregulars[str]) {
return irregulars[str]
}
if (str.length <= 3) {
return null
}
var i;
for (i = 0; i < not_matches.length; i++) {
if (str.match(not_matches[i])) {
return null
}
}
for (i = 0; i < transforms.length; i++) {
if (str.match(transforms[i].reg)) {
return str.replace(transforms[i].reg, transforms[i].repl)
}
}
return str + 'ly'
}
module.exports = main;
// console.log(adj_to_adv('direct'))
},{}],26:[function(require,module,exports){
//turn 'quick' into 'quickly'
var convertables = require("./convertables")
var main = function (str) {
var irregulars = {
"grey": "greyer",
"gray": "grayer",
"green": "greener",
"yellow": "yellower",
"red": "redder",
"good": "better",
"well": "better",
"bad": "worse",
"sad": "sadder"
}
var dont = {
"overweight": 1,
"main": 1,
"nearby": 1,
"asleep": 1,
"weekly": 1,
"secret": 1,
"certain": 1
}
var transforms = [{
reg: /y$/i,
repl: 'ier'
}, {
reg: /([aeiou])t$/i,
repl: '$1tter'
}, {
reg: /([aeou])de$/i,
repl: '$1der'
}, {
reg: /nge$/i,
repl: 'nger'
}]
var matches = [
/ght$/,
/nge$/,
/ough$/,
/ain$/,
/uel$/,
/[au]ll$/,
/ow$/,
/old$/,
/oud$/,
/e[ae]p$/
]
var not_matches = [
/ary$/,
/ous$/
]
if (dont.hasOwnProperty(str)) {
return null
}
for (i = 0; i < transforms.length; i++) {
if (str.match(transforms[i].reg)) {
return str.replace(transforms[i].reg, transforms[i].repl)
}
}
if (convertables.hasOwnProperty(str)) {
if (str.match(/e$/)) {
return str + "r"
} else {
return str + "er"
}
}
if (irregulars.hasOwnProperty(str)) {
return irregulars[str]
}
var i;
for (i = 0; i < not_matches.length; i++) {
if (str.match(not_matches[i])) {
return "more " + str
}
}
for (i = 0; i < matches.length; i++) {
if (str.match(matches[i])) {
return str + "er"
}
}
return "more " + str
}
module.exports = main;
},{"./convertables":24}],27:[function(require,module,exports){
//convert cute to cuteness
module.exports = function (w) {
var irregulars = {
"clean": "cleanliness",
"naivety": "naivety"
};
if (!w) {
return "";
}
if (irregulars.hasOwnProperty(w)) {
return irregulars[w];
}
if (w.match(" ")) {
return w;
}
if (w.match(/w$/)) {
return w;
}
var transforms = [{
"reg": /y$/,
"repl": 'iness'
}, {
"reg": /le$/,
"repl": 'ility'
}, {
"reg": /ial$/,
"repl": 'y'
}, {
"reg": /al$/,
"repl": 'ality'
}, {
"reg": /ting$/,
"repl": 'ting'
}, {
"reg": /ring$/,
"repl": 'ring'
}, {
"reg": /bing$/,
"repl": 'bingness'
}, {
"reg": /sing$/,
"repl": 'se'
}, {
"reg": /ing$/,
"repl": 'ment'
}, {
"reg": /ess$/,
"repl": 'essness'
}, {
"reg": /ous$/,
"repl": 'ousness'
}, ]
for (var i = 0; i < transforms.length; i++) {
if (w.match(transforms[i].reg)) {
return w.replace(transforms[i].reg, transforms[i].repl);
}
}
if (w.match(/s$/)) {
return w;
}
return w + "ness";
};
},{}],28:[function(require,module,exports){
//turn 'quick' into 'quickest'
var convertables = require("./convertables")
module.exports = function (str) {
var irregulars = {
"nice": "nicest",
"late": "latest",
"hard": "hardest",
"inner": "innermost",
"outer": "outermost",
"far": "furthest",
"worse": "worst",
"bad": "worst",
"good": "best"
}
var dont = {
"overweight": 1,
"ready": 1
}
var transforms = [{
"reg": /y$/i,
"repl": 'iest'
}, {
"reg": /([aeiou])t$/i,
"repl": '$1ttest'
}, {
"reg": /([aeou])de$/i,
"repl": '$1dest'
}, {
"reg": /nge$/i,
"repl": 'ngest'
}]
var matches = [
/ght$/,
/nge$/,
/ough$/,
/ain$/,
/uel$/,
/[au]ll$/,
/ow$/,
/oud$/,
/...p$/
]
var not_matches = [
/ary$/
]
var generic_transformation = function (str) {
if (str.match(/e$/)) {
return str + "st"
} else {
return str + "est"
}
}
for (i = 0; i < transforms.length; i++) {
if (str.match(transforms[i].reg)) {
return str.replace(transforms[i].reg, transforms[i].repl)
}
}
if (convertables.hasOwnProperty(str)) {
return generic_transformation(str)
}
if (dont.hasOwnProperty(str)) {
return "most " + str
}
if (irregulars.hasOwnProperty(str)) {
return irregulars[str]
}
var i;
for (i = 0; i < not_matches.length; i++) {
if (str.match(not_matches[i])) {
return "most " + str
}
}
for (i = 0; i < matches.length; i++) {
if (str.match(matches[i])) {
return generic_transformation(str)
}
}
return "most " + str
}
},{"./convertables":24}],29:[function(require,module,exports){
//wrapper for Adjective's methods
var Adjective = function (str, sentence, word_i) {
var the = this
the.word = str || '';
var to_comparative = require("./conjugate/to_comparative")
var to_superlative = require("./conjugate/to_superlative")
var adj_to_adv = require("./conjugate/to_adverb")
var adj_to_noun = require("./conjugate/to_noun")
var parts_of_speech = require("../../data/parts_of_speech")
the.conjugate = function () {
return {
comparative: to_comparative(the.word),
superlative: to_superlative(the.word),
adverb: adj_to_adv(the.word),
noun: adj_to_noun(the.word)
}
}
the.which = (function () {
if (the.word.match(/..est$/)) {
return parts_of_speech['JJS']
}
if (the.word.match(/..er$/)) {
return parts_of_speech['JJR']
}
return parts_of_speech['JJ']
})()
return the;
};
module.exports = Adjective;
// console.log(new Adjective("crazy"))
},{"../../data/parts_of_speech":14,"./conjugate/to_adverb":25,"./conjugate/to_comparative":26,"./conjugate/to_noun":27,"./conjugate/to_superlative":28}],30:[function(require,module,exports){
//turns 'quickly' into 'quick'
module.exports = function (str) {
var irregulars = {
"idly": "idle",
"sporadically": "sporadic",
"basically": "basic",
"grammatically": "grammatical",
"alphabetically": "alphabetical",
"economically": "economical",
"conically": "conical",
"politically": "political",
"vertically": "vertical",
"practically": "practical",
"theoretically": "theoretical",
"critically": "critical",
"fantastically": "fantastic",
"mystically": "mystical",
"pornographically": "pornographic",
"fully": "full",
"jolly": "jolly",
"wholly": "whole"
}
var transforms = [{
"reg": /bly$/i,
"repl": 'ble'
}, {
"reg": /gically$/i,
"repl": 'gical'
}, {
"reg": /([rsdh])ically$/i,
"repl": '$1ical'
}, {
"reg": /ically$/i,
"repl": 'ic'
}, {
"reg": /uly$/i,
"repl": 'ue'
}, {
"reg": /ily$/i,
"repl": 'y'
}, {
"reg": /(.{3})ly$/i,
"repl": '$1'
}]
if (irregulars.hasOwnProperty(str)) {
return irregulars[str]
}
for (var i = 0; i < transforms.length; i++) {
if (str.match(transforms[i].reg)) {
return str.replace(transforms[i].reg, transforms[i].repl)
}
}
return str
}
// console.log(to_adjective('quickly') === 'quick')
// console.log(to_adjective('marvelously') === 'marvelous')
},{}],31:[function(require,module,exports){
//wrapper for Adverb's methods
var Adverb = function (str, sentence, word_i) {
var the = this
the.word = str || '';
var to_adjective = require("./conjugate/to_adjective")
var parts_of_speech = require("../../data/parts_of_speech")
the.conjugate = function () {
return {
adjective: to_adjective(the.word)
}
}
the.which = (function () {
if (the.word.match(/..est$/)) {
return parts_of_speech['RBS']
}
if (the.word.match(/..er$/)) {
return parts_of_speech['RBR']
}
return parts_of_speech['RB']
})()
return the;
}
module.exports = Adverb;
// console.log(new Adverb("suddenly").conjugate())
// console.log(adverbs.conjugate('powerfully'))
},{"../../data/parts_of_speech":14,"./conjugate/to_adjective":30}],32:[function(require,module,exports){
//converts nouns from plural and singular, and viceversases
//some regex borrowed from pksunkara/inflect
//https://github.com/pksunkara/inflect/blob/master/lib/defaults.js
var uncountables = require("../../../data/lexicon/uncountables")
var irregular_nouns = require("../../../data/lexicon/irregular_nouns")
var i;
//words that shouldn't ever inflect, for metaphysical reasons
uncountable_nouns = uncountables.reduce(function (h, a) {
h[a] = true
return h
}, {})
var titlecase = function (str) {
if (!str) {
return ''
}
return str.charAt(0).toUpperCase() + str.slice(1)
}
//these aren't nouns, but let's inflect them anyways
var irregulars = [
["he", "they"],
["she", "they"],
["this", "these"],
["that", "these"],
["mine", "ours"],
["hers", "theirs"],
["his", "theirs"],
["i", "we"],
["move", "_s"],
["myself", "ourselves"],
["yourself", "yourselves"],
["himself", "themselves"],
["herself", "themselves"],
["themself", "themselves"],
["its", "theirs"],
["theirs", "_"]
]
irregulars = irregulars.concat(irregular_nouns)
var pluralize_rules = [
[/(ax|test)is$/i, '$1es'],
[/(octop|vir|radi|nucle|fung|cact|stimul)us$/i, '$1i'],
[/(octop|vir)i$/i, '$1i'],
[/([rl])f$/i, '$1ves'],
[/(alias|status)$/i, '$1es'],
[/(bu)s$/i, '$1ses'],
[/(al|ad|at|er|et|ed|ad)o$/i, '$1oes'],
[/([ti])um$/i, '$1a'],
[/([ti])a$/i, '$1a'],
[/sis$/i, 'ses'],
[/(?:([^f])fe|([lr])f)$/i, '$1ves'],
[/(hive)$/i, '$1s'],
[/([^aeiouy]|qu)y$/i, '$1ies'],
[/(x|ch|ss|sh|s|z)$/i, '$1es'],
[/(matr|vert|ind|cort)(ix|ex)$/i, '$1ices'],
[/([m|l])ouse$/i, '$1ice'],
[/([m|l])ice$/i, '$1ice'],
[/^(ox)$/i, '$1en'],
[/^(oxen)$/i, '$1'],
[/(quiz)$/i, '$1zes'],
[/(antenn|formul|nebul|vertebr|vit)a$/i, '$1ae'],
[/(sis)$/i, 'ses'],
[/^(?!talis|.*hu)(.*)man$/i, '$1men'],
[/(.*)/i, '$1s']
].map(function (a) {
return {
reg: a[0],
repl: a[1]
}
})
var pluralize = function (str) {
var low = str.toLowerCase()
//uncountable
if (uncountable_nouns[low]) {
return str
}
//is it already plural?
if (is_plural(low) === true) {
return str
}
//irregular
var found = irregulars.filter(function (r) {
return r[0] === low
})
if (found[0]) {
if (titlecase(low) === str) { //handle capitalisation properly
return titlecase(found[0][1])
} else {
return found[0][1]
}
}
//inflect first word of preposition-phrase
if (str.match(/([a-z]*) (of|in|by|for) [a-z]/)) {
var first = (str.match(/^([a-z]*) (of|in|by|for) [a-z]/) || [])[1]
if (first) {
var better_first = pluralize(first)
return better_first + str.replace(first, '')
}
}
//regular
for (i = 0; i < pluralize_rules.length; i++) {
if (str.match(pluralize_rules[i].reg)) {
return str.replace(pluralize_rules[i].reg, pluralize_rules[i].repl)
}
}
}
var singularize_rules = [
[/([^v])ies$/i, '$1y'],
[/ises$/i, 'isis'],
[/ives$/i, 'ife'],
[/(antenn|formul|nebul|vertebr|vit)ae$/i, '$1a'],
[/(octop|vir|radi|nucle|fung|cact|stimul)(i)$/i, '$1us'],
[/(buffal|tomat|tornad)(oes)$/i, '$1o'],
[/((a)naly|(b)a|(d)iagno|(p)arenthe|(p)rogno|(s)ynop|(t)he)ses$/i, '$1sis'],
[/(vert|ind|cort)(ices)$/i, '$1ex'],
[/(matr|append)(ices)$/i, '$1ix'],
[/(x|ch|ss|sh|s|z|o)es$/i, '$1'],
[/men$/i, 'man'],
[/(n)ews$/i, '$1ews'],
[/([ti])a$/i, '$1um'],
[/([^f])ves$/i, '$1fe'],
[/([lr])ves$/i, '$1f'],
[/([^aeiouy]|qu)ies$/i, '$1y'],
[/(s)eries$/i, '$1eries'],
[/(m)ovies$/i, '$1ovie'],
[/([m|l])ice$/i, '$1ouse'],
[/(cris|ax|test)es$/i, '$1is'],
[/(alias|status)es$/i, '$1'],
[/(ss)$/i, '$1'],
[/(ics)$/i, "$1"],
[/s$/i, '']
].map(function (a) {
return {
reg: a[0],
repl: a[1]
}
})
var singularize = function (str) {
var low = str.toLowerCase()
//uncountable
if (uncountable_nouns[low]) {
return str
}
//is it already singular?
if (is_plural(low) === false) {
return str
}
//irregular
var found = irregulars.filter(function (r) {
return r[1] === low
})
if (found[0]) {
if (titlecase(low) === str) { //handle capitalisation properly
return titlecase(found[0][0])
} else {
return found[0][0]
}
}
//inflect first word of preposition-phrase
if (str.match(/([a-z]*) (of|in|by|for) [a-z]/)) {
var first = str.match(/^([a-z]*) (of|in|by|for) [a-z]/)
if (first && first[1]) {
var better_first = singularize(first[1])
return better_first + str.replace(first[1], '')
}
}
//regular
for (i = 0; i < singularize_rules.length; i++) {
if (str.match(singularize_rules[i].reg)) {
return str.replace(singularize_rules[i].reg, singularize_rules[i].repl)
}
}
return str
}
var is_plural = function (str) {
str = (str || '').toLowerCase()
//handle 'mayors of chicago'
var preposition = str.match(/([a-z]*) (of|in|by|for) [a-z]/)
if (preposition && preposition[1]) {
str = preposition[1]
}
// if it's a known irregular case
for (i = 0; i < irregulars.length; i++) {
if (irregulars[i][1] === str) {
return true
}
if (irregulars[i][0] === str) {
return false
}
}
//similar to plural/singularize rules, but not the same
var plural_indicators = [
/(^v)ies$/i,
/ises$/i,
/ives$/i,
/(antenn|formul|nebul|vertebr|vit)ae$/i,
/(octop|vir|radi|nucle|fung|cact|stimul)i$/i,
/(buffal|tomat|tornad)oes$/i,
/(analy|ba|diagno|parenthe|progno|synop|the)ses$/i,
/(vert|ind|cort)ices$/i,
/(matr|append)ices$/i,
/(x|ch|ss|sh|s|z|o)es$/i,
/men$/i,
/news$/i,
/.tia$/i,
/(^f)ves$/i,
/(lr)ves$/i,
/(^aeiouy|qu)ies$/i,
/(m|l)ice$/i,
/(cris|ax|test)es$/i,
/(alias|status)es$/i,
/ics$/i
]
for (i = 0; i < plural_indicators.length; i++) {
if (str.match(plural_indicators[i])) {
return true
}
}
//similar to plural/singularize rules, but not the same
var singular_indicators = [
/(ax|test)is$/i,
/(octop|vir|radi|nucle|fung|cact|stimul)us$/i,
/(octop|vir)i$/i,
/(rl)f$/i,
/(alias|status)$/i,
/(bu)s$/i,
/(al|ad|at|er|et|ed|ad)o$/i,
/(ti)um$/i,
/(ti)a$/i,
/sis$/i,
/(?:(^f)fe|(lr)f)$/i,
/hive$/i,
/(^aeiouy|qu)y$/i,
/(x|ch|ss|sh|z)$/i,
/(matr|vert|ind|cort)(ix|ex)$/i,
/(m|l)ouse$/i,
/(m|l)ice$/i,
/(antenn|formul|nebul|vertebr|vit)a$/i,
/.sis$/i,
/^(?!talis|.*hu)(.*)man$/i
]
for (i = 0; i < singular_indicators.length; i++) {
if (str.match(singular_indicators[i])) {
return false
}
}
// 'looks pretty plural' rules
if (str.match(/s$/) && !str.match(/ss$/) && str.length > 3) { //needs some lovin'
return true
}
return false
}
var inflect = function (str) {
if (uncountable_nouns[str]) { //uncountables shouldn't ever inflect
return {
plural: str,
singular: str
}
}
if (is_plural(str)) {
return {
plural: str,
singular: singularize(str)
}
} else {
return {
singular: str,
plural: pluralize(str)
}
}
}
module.exports = {
inflect: inflect,
is_plural: is_plural,
singularize: singularize,
pluralize: pluralize
}
// console.log(inflect.singularize('kisses')=="kiss")
// console.log(inflect.singularize('kiss')=="kiss")
// console.log(inflect.singularize('children')=="child")
// console.log(inflect.singularize('child')=="child")
// console.log(inflect.pluralize('gas')=="gases")
// console.log(inflect.pluralize('narrative')=="narratives")
// console.log(inflect.singularize('gases')=="gas")
// console.log(inflect.pluralize('video')=="videos")
// console.log(inflect.pluralize('photo')=="photos")
// console.log(inflect.pluralize('stomach')=="stomachs")
// console.log(inflect.pluralize('database')=="databases")
// console.log(inflect.pluralize('kiss')=="kisses")
// console.log(inflect.pluralize('towns')=="towns")
// console.log(inflect.pluralize('mayor of chicago')=="mayors of chicago")
// console.log(inflect.inflect('Index').plural=='Indices')
// console.log(inflect.is_plural('octopus')==false)
// console.log(inflect.is_plural('octopi')==true)
// console.log(inflect.is_plural('eyebrow')==false)
// console.log(inflect.is_plural('eyebrows')==true)
// console.log(inflect.is_plural('child')==false)
// console.log(inflect.is_plural('children')==true)
// console.log(inflect.singularize('mayors of chicago')=="mayor of chicago")
},{"../../../data/lexicon/irregular_nouns":8,"../../../data/lexicon/uncountables":11}],33:[function(require,module,exports){
//chooses an indefinite aricle 'a/an' for a word
module.exports = function (str) {
if (!str) {
return null
}
var irregulars = {
"hour": "an",
"heir": "an",
"heirloom": "an",
"honest": "an",
"honour": "an",
"honor": "an",
"uber": "an" //german u
}
var is_acronym = function (s) {
//no periods
if (s.length <= 5 && s.match(/^[A-Z]*$/)) {
return true
}
//with periods
if (s.length >= 4 && s.match(/^([A-Z]\.)*$/)) {
return true
}
return false
}
//pronounced letters of acronyms that get a 'an'
var an_acronyms = {
A: true,
E: true,
F: true,
H: true,
I: true,
L: true,
M: true,
N: true,
O: true,
R: true,
S: true,
X: true
}
//'a' regexes
var a_regexs = [
/^onc?e/i, //'wu' sound of 'o'
/^u[bcfhjkqrstn][aeiou]/i, // 'yu' sound for hard 'u'
/^eul/i
];
//begin business time
////////////////////
//explicit irregular forms
if (irregulars.hasOwnProperty(str)) {
return irregulars[str]
}
//spelled-out acronyms
if (is_acronym(str) && an_acronyms.hasOwnProperty(str.substr(0, 1))) {
return "an"
}
//'a' regexes
for (var i = 0; i < a_regexs.length; i++) {
if (str.match(a_regexs[i])) {
return "a"
}
}
//basic vowel-startings
if (str.match(/^[aeiou]/i)) {
return "an"
}
return "a"
}
// console.log(indefinite_article("wolf") === "a")
},{}],34:[function(require,module,exports){
//wrapper for noun's methods
var Noun = function (str, sentence, word_i) {
var the = this
var token, next;
if (sentence !== undefined && word_i !== undefined) {
token = sentence.tokens[word_i]
next = sentence.tokens[word_i + i]
}
the.word = str || '';
var parts_of_speech = require("../../data/parts_of_speech")
var firstnames = require("../../data/lexicon/firstnames")
var honourifics = require("../../data/lexicon/honourifics")
var inflect = require("./conjugate/inflect")
var indefinite_article = require("./indefinite_article")
//personal pronouns
var prps = {
"it": "PRP",
"they": "PRP",
"i": "PRP",
"them": "PRP",
"you": "PRP",
"she": "PRP",
"me": "PRP",
"he": "PRP",
"him": "PRP",
"her": "PRP",
"us": "PRP",
"we": "PRP",
"thou": "PRP"
}
var blacklist = {
"itself": 1,
"west": 1,
"western": 1,
"east": 1,
"eastern": 1,
"north": 1,
"northern": 1,
"south": 1,
"southern": 1,
"the": 1,
"one": 1,
"your": 1,
"my": 1,
"today": 1,
"yesterday": 1,
"tomorrow": 1,
"era": 1,
"century": 1,
"it": 1
}
//for resolution of obama -> he -> his
var posessives = {
"his": "he",
"her": "she",
"hers": "she",
"their": "they",
"them": "they",
"its": "it"
}
the.is_acronym = function () {
var s = the.word
//no periods
if (s.length <= 5 && s.match(/^[A-Z]*$/)) {
return true
}
//with periods
if (s.length >= 4 && s.match(/^([A-Z]\.)*$/)) {
return true
}
return false
}
the.is_entity = function () {
if (!token) {
return false
}
if (token.normalised.length < 3 || !token.normalised.match(/[a-z]/i)) {
return false
}
//prepositions
if (prps[token.normalised]) {
return false
}
//blacklist
if (blacklist[token.normalised]) {
return false
}
//discredit specific nouns forms
if (token.pos) {
if (token.pos.tag == "NNA") { //eg. 'singer'
return false
}
if (token.pos.tag == "NNO") { //eg. "spencer's"
return false
}
if (token.pos.tag == "NNG") { //eg. 'walking'
return false
}
if (token.pos.tag == "NNP") { //yes! eg. 'Edinburough'
return true
}
}
//distinct capital is very good signal
if (token.noun_capital) {
return true
}
//multiple-word nouns are very good signal
if (token.normalised.match(/ /)) {
return true
}
//if it has an acronym/abbreviation, like 'business ltd.'
if (token.normalised.match(/\./)) {
return true
}
//appears to be a non-capital acronym, and not just caps-lock
if (token.normalised.length < 5 && token.text.match(/^[A-Z]*$/)) {
return true
}
//acronyms are a-ok
if (the.is_acronym()) {
return true
}
//else, be conservative
return false
}
the.conjugate = function () {
return inflect.inflect(the.word)
}
the.is_plural = function () {
return inflect.is_plural(the.word)
}
the.article = function () {
if (the.is_plural()) {
return "the"
} else {
return indefinite_article(the.word)
}
}
the.pluralize = function () {
return inflect.pluralize(the.word)
}
the.singularize = function () {
return inflect.singularize(the.word)
}
//uses common first-name list + honourifics to guess if this noun is the name of a person
the.is_person = function () {
var i, l;
//remove things that are often named after people
var blacklist = [
"center",
"centre",
"memorial",
"school",
"bridge",
"university",
"house",
"college",
"square",
"park",
"foundation",
"institute",
"club",
"museum",
"arena",
"stadium",
"ss",
"of",
"the",
"for",
"and",
"&",
"co",
"sons"
]
l = blacklist.length
for (i = 0; i < l; i++) {
if (the.word.match(new RegExp("\\b" + blacklist[i] + "\\b", "i"))) {
return false
}
}
//see if noun has an honourific, like 'jr.'
l = honourifics.length;
for (i = 0; i < l; i++) {
if (the.word.match(new RegExp("\\b" + honourifics[i] + "\\.?\\b", 'i'))) {
return true
}
}
//see if noun has a known first-name
var names = the.word.split(' ').map(function (a) {
return a.toLowerCase()
})
if (firstnames[names[0]]) {
return true
}
//(test middle name too, if there's one)
if (names.length > 2 && firstnames[names[1]]) {
return true
}
//if it has an initial between two words
if (the.word.match(/[a-z]{3,20} [a-z]\.? [a-z]{3,20}/i)) {
return true
}
return false
}
//decides if it deserves a he, she, they, or it
the.pronoun = function () {
//if it's a person try to classify male/female
if (the.is_person()) {
var names = the.word.split(' ').map(function (a) {
return a.toLowerCase()
})
if (firstnames[names[0]] === "m" || firstnames[names[1]] == "m") {
return "he"
}
if (firstnames[names[0]] === "f" || firstnames[names[1]] == "f") {
return "she"
}
//test some honourifics
if (the.word.match(/^(mrs|miss|ms|misses|mme|mlle)\.? /, 'i')) {
return "she"
}
if (the.word.match(/\b(mr|mister|sr|jr)\b/, 'i')) {
return "he"
}
//if it's a known unisex name, don't try guess it. be safe.
if (firstnames[names[0]] === "a" || firstnames[names[1]] == "a") {
return "they"
}
//if we think it's a person, but still don't know the gender, do a little guessing
if (names[0].match(/[aeiy]$/)) { //if it ends in a 'ee or ah', female
return "she"
}
if (names[0].match(/[ou]$/)) { //if it ends in a 'oh or uh', male
return "he"
}
if (names[0].match(/(nn|ll|tt)/)) { //if it has double-consonants, female
return "she"
}
//fallback to 'singular-they'
return "they"
}
//not a person
if (the.is_plural()) {
return "they"
}
return "it"
}
//list of pronouns that refer to this named noun. "[obama] is cool, [he] is nice."
the.referenced_by = function () {
//if it's named-noun, look forward for the pronouns pointing to it -> '... he'
if (token && token.pos.tag !== "PRP" && token.pos.tag !== "PP") {
var prp = the.pronoun()
//look at rest of sentence
var interested = sentence.tokens.slice(word_i + 1, sentence.tokens.length)
//add next sentence too, could go further..
if (sentence.next) {
interested = interested.concat(sentence.next.tokens)
}
//find the matching pronouns, and break if another noun overwrites it
var matches = []
for (var i = 0; i < interested.length; i++) {
if (interested[i].pos.tag === "PRP" && (interested[i].normalised === prp || posessives[interested[i].normalised] === prp)) {
//this pronoun points at our noun
matches.push(interested[i])
} else if (interested[i].pos.tag === "PP" && posessives[interested[i].normalised] === prp) {
//this posessive pronoun ('his/her') points at our noun
matches.push(interested[i])
} else if (interested[i].pos.parent === "noun" && interested[i].analysis.pronoun() === prp) {
//this noun stops our further pursuit
break
}
}
return matches
}
return []
}
// a pronoun that points at a noun mentioned previously '[he] is nice'
the.reference_to = function () {
//if it's a pronoun, look backwards for the first mention '[obama]... <-.. [he]'
if (token && (token.pos.tag === "PRP" || token.pos.tag === "PP")) {
var prp = token.normalised
var possessives={
"his":"he",
"her":"she",
"their":"they"
}
if(possessives[prp]!==undefined){//support possessives
prp=possessives[prp]
}
//look at starting of this sentence
var interested = sentence.tokens.slice(0, word_i)
//add previous sentence, if applicable
if (sentence.last) {
interested = sentence.last.tokens.concat(interested)
}
//reverse the terms to loop through backward..
interested = interested.reverse()
for (var i = 0; i < interested.length; i++) {
//it's a match
if (interested[i].pos.parent === "noun" && interested[i].pos.tag !== "PRP" && interested[i].analysis.pronoun() === prp) {
return interested[i]
}
}
}
}
//specifically which pos it is
the.which = (function () {
//posessive
if (the.word.match(/'s$/)) {
return parts_of_speech['NNO']
}
//plural
// if (the.is_plural) {
// return parts_of_speech['NNS']
// }
//generic
return parts_of_speech['NN']
})()
return the;
}
module.exports = Noun;
// console.log(new Noun('farmhouse').is_entity())
// console.log(new Noun("FBI").is_acronym())
// console.log(new Noun("Tony Danza").is_person())
// console.log(new Noun("Tony Danza").pronoun()=="he")
// console.log(new Noun("Tanya Danza").pronoun()=="she")
// console.log(new Noun("mrs. Taya Danza").pronoun()=="she")
// console.log(new Noun("Gool Tanya Danza").pronoun()=="she")
// console.log(new Noun("illi G. Danza").pronoun()=="she")
// console.log(new Noun("horses").pronoun()=="they")
},{"../../data/lexicon/firstnames":6,"../../data/lexicon/honourifics":7,"../../data/parts_of_speech":14,"./conjugate/inflect":32,"./indefinite_article":33}],35:[function(require,module,exports){
//Parents are classes for each main part of speech, with appropriate methods
//load files if server-side, otherwise assume these are prepended already
var Adjective = require("./adjective/index");
var Noun = require("./noun/index");
var Adverb = require("./adverb/index");
var Verb = require("./verb/index");
var Value = require("./value/index");
var parents = {
adjective: function(str, next, last, token) {
return new Adjective(str, next, last, token)
},
noun: function(str, next, last, token) {
return new Noun(str, next, last, token)
},
adverb: function(str, next, last, token) {
return new Adverb(str, next, last, token)
},
verb: function(str, next, last, token) {
return new Verb(str, next, last, token)
},
value: function(str, next, last, token) {
return new Value(str, next, last, token)
},
glue: function(str, next, last, token) {
return {}
}
}
module.exports = parents;
},{"./adjective/index":29,"./adverb/index":31,"./noun/index":34,"./value/index":37,"./verb/index":44}],36:[function(require,module,exports){
// #generates properly-formatted dates from free-text date forms
// #by spencer kelly 2014
var months = "(january|february|march|april|may|june|july|august|september|october|november|december|jan|feb|mar|apr|aug|sept|oct|nov|dec),?";
var days = "([0-9]{1,2}),?";
var years = "([12][0-9]{3})";
var to_obj = function (arr, places) {
return Object.keys(places).reduce(function (h, k) {
h[k] = arr[places[k]];
return h;
}, {});
}
var regexes = [{
reg: String(months) + " " + String(days) + "-" + String(days) + " " + String(years),
example: "March 7th-11th 1987",
process: function (arr) {
if (!arr) {
arr = [];
}
var places = {
month: 1,
day: 2,
to_day: 3,
year: 4
};
return to_obj(arr, places);
}
}, {
reg: String(days) + " of " + String(months) + " to " + String(days) + " of " + String(months) + " " + String(years),
example: "28th of September to 5th of October 2008",
process: function (arr) {
if (!arr) {
arr = [];
}
var places = {
day: 1,
month: 2,
to_day: 3,
to_month: 4,
to_year: 5
};
return to_obj(arr, places);
}
}, {
reg: String(months) + " " + String(days) + " to " + String(months) + " " + String(days) + " " + String(years),
example: "March 7th to june 11th 1987",
process: function (arr) {
if (!arr) {
arr = [];
}
var places = {
month: 1,
day: 2,
to_month: 3,
to_day: 4,
year: 5,
to_year: 5
};
return to_obj(arr, places);
}
}, {
reg: "between " + String(days) + " " + String(months) + " and " + String(days) + " " + String(months) + " " + String(years),
example: "between 13 February and 15 February 1945",
process: function (arr) {
if (!arr) {
arr = [];
}
var places = {
day: 1,
month: 2,
to_day: 3,
to_month: 4,
year: 5,
to_year: 5
};
return to_obj(arr, places);
}
}, {
reg: "between " + String(months) + " " + String(days) + " and " + String(months) + " " + String(days) + " " + String(years),
example: "between March 7th and june 11th 1987",
process: function (arr) {
if (!arr) {
arr = [];
}
var places = {
month: 1,
day: 2,
to_month: 3,
to_day: 4,
year: 5,
to_year: 5
};
return to_obj(arr, places);
}
}, {
reg: String(months) + " " + String(days) + " " + String(years),
example: "March 1st 1987",
process: function (arr) {
if (!arr) {
arr = [];
}
var places = {
month: 1,
day: 2,
year: 3
};
return to_obj(arr, places);
}
}, {
reg: String(days) + " - " + String(days) + " of " + String(months) + " " + String(years),
example: "3rd - 5th of March 1969",
process: function (arr) {
if (!arr) {
arr = [];
}
var places = {
day: 1,
to_day: 2,
month: 3,
year: 4
};
return to_obj(arr, places);
}
}, {
reg: String(days) + " of " + String(months) + " " + String(years),
example: "3rd of March 1969",
process: function (arr) {
if (!arr) {
arr = [];
}
var places = {
day: 1,
month: 2,
year: 3
};
return to_obj(arr, places);
}
}, {
reg: String(months) + " " + years + ",? to " + String(months) + " " + String(years),
example: "September 1939 to April 1945",
process: function (arr) {
if (!arr) {
arr = [];
}
var places = {
month: 1,
year: 2,
to_month: 3,
to_year: 4
};
return to_obj(arr, places);
}
}, {
reg: String(months) + " " + String(years),
example: "March 1969",
process: function (arr) {
if (!arr) {
arr = [];
}
var places = {
month: 1,
year: 2
};
return to_obj(arr, places);
}
}, {
reg: String(months) + " " + days,
example: "March 18th",
process: function (arr) {
if (!arr) {
arr = [];
}
var places = {
month: 1,
day: 2
};
return to_obj(arr, places);
}
}, {
reg: String(days) + " of " + months,
example: "18th of March",
process: function (arr) {
if (!arr) {
arr = [];
}
var places = {
month: 2,
day: 1
};
return to_obj(arr, places);
}
}, {
reg: years + " ?- ?" + String(years),
example: "1997-1998",
process: function (arr) {
if (!arr) {
arr = [];
}
var places = {
year: 1,
to_year: 2
};
return to_obj(arr, places);
}
}, {
reg: years,
example: "1998",
process: function (arr) {
if (!arr) {
arr = [];
}
var places = {
year: 1
};
return to_obj(arr, places);
}
}].map(function (o) {
o.reg = new RegExp(o.reg, "g");
return o;
});
//0 based months, 1 based days...
var months_obj = {
january: 0,
february: 1,
march: 2,
april: 3,
may: 4,
june: 5,
july: 6,
august: 7,
september: 8,
october: 9,
november: 10,
december: 11,
jan: 0,
feb: 1,
mar: 2,
apr: 3,
aug: 7,
sept: 8,
oct: 9,
nov: 10,
dec: 11
};
//thirty days hath september...
var last_dates = [31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31];
var preprocess = function (str) {
str = str.toLowerCase();
str = str.replace(/([0-9])(th|rd|st)/g, '$1');
return str;
};
var postprocess = function (obj, options) {
var d;
d = new Date();
options = options || {};
obj.year = parseInt(obj.year, 10) || undefined;
obj.day = parseInt(obj.day, 10) || undefined;
obj.to_day = parseInt(obj.to_day, 10) || undefined;
obj.to_year = parseInt(obj.to_year, 10) || undefined;
obj.month = months_obj[obj.month];
obj.to_month = months_obj[obj.to_month];
//swap to_month and month
if (obj.to_month !== undefined && obj.month === undefined) {
obj.month = obj.to_month;
}
if (obj.to_month === undefined && obj.month !== undefined) {
obj.to_month = obj.month;
}
//swap to_year and year
if (obj.to_year && !obj.year) {
obj.year = obj.to_year;
}
if (!obj.to_year && obj.year && obj.to_month !== undefined) {
obj.to_year = obj.year;
}
if (options.assume_year && !obj.year) {
obj.year = d.getFullYear();
}
//make sure date is in that month..
if (obj.day !== undefined && (obj.day > 31 || (obj.month !== undefined && obj.day > last_dates[obj.month]))) {
obj.day = undefined;
}
//make sure to date is after from date. fail everything if so...
//todo: do this smarter
if (obj.to_month !== undefined && obj.to_month < obj.month) {
return {}
}
if (obj.to_year && obj.to_year < obj.year) {
obj.year = undefined;
obj.to_year = undefined;
}
//make sure date is in reasonable range (very opinionated)
if (obj.year > 2090 || obj.year < 1200) {
obj.year = undefined;
obj.to_year = undefined;
}
//format result better
obj = {
day: obj.day,
month: obj.month,
year: obj.year,
to: {
day: obj.to_day,
month: obj.to_month,
year: obj.to_year
}
};
//add javascript date objects, if you can
if (obj.year && obj.day && obj.month !== undefined) {
obj.date_object = new Date();
obj.date_object.setYear(obj.year);
obj.date_object.setMonth(obj.month);
obj.date_object.setDate(obj.day);
}
if (obj.to.year && obj.to.day && obj.to.month !== undefined) {
obj.to.date_object = new Date();
obj.to.date_object.setYear(obj.to.year);
obj.to.date_object.setMonth(obj.to.month);
obj.to.date_object.setDate(obj.to.day);
}
//if we have enough data to return a result..
if (obj.year || obj.month !== undefined) {
return obj;
}
return {};
};
//pass through sequence of regexes until tempate is matched..
module.exports = function (str, options) {
options = options || {};
str = preprocess(str)
var arr, good, clone_reg, obj;
var l = regexes.length;
for (var i = 0; i < l; i += 1) {
obj = regexes[i]
if (str.match(obj.reg)) {
clone_reg = new RegExp(obj.reg.source, "i"); //this avoids a memory-leak
arr = clone_reg.exec(str);
good = obj.process(arr);
return postprocess(good, options);
}
}
};
// console.log(date_extractor("1998"))
// console.log(date_extractor("1999"))
},{}],37:[function(require,module,exports){
//wrapper for value's methods
var Value = function (str, sentence, word_i) {
var the = this
the.word = str || '';
var to_number = require("./to_number")
var date_extractor = require("./date_extractor")
var parts_of_speech = require("../../data/parts_of_speech")
the.date = function (options) {
options = options || {}
return date_extractor(the.word, options)
}
the.is_date = function () {
var months = /(january|february|march|april|may|june|july|august|september|october|november|december|jan|feb|mar|apr|aug|sept|oct|nov|dec)/i
var times = /1?[0-9]:[0-9]{2}/
var days = /\b(monday|tuesday|wednesday|thursday|friday|saturday|sunday|mon|tues|wed|thurs|fri|sat|sun)\b/i
if (the.word.match(months) || the.word.match(times) || the.word.match(days)) {
return true
}
return false
}
the.number = function () {
if (the.is_date()) {
return null
}
return to_number(the.word)
}
the.which = (function () {
if (the.date()) {
return parts_of_speech['DA']
}
if (the.number()) {
return parts_of_speech['NU']
}
return parts_of_speech['CD']
})()
return the;
};
module.exports = Value;
// console.log(new Value("fifty five").number())
// console.log(new Value("june 5th 1998").date())
},{"../../data/parts_of_speech":14,"./date_extractor":36,"./to_number":38}],38:[function(require,module,exports){
// converts spoken numbers into integers "fifty seven point eight" -> 57.8
//
// Spoken numbers take the following format
// [sixty five] (thousand) [sixty five] (hundred) [sixty five]
// aka: [one/teen/ten] (multiple) [one/teen/ten] (multiple) ...
// combile the [one/teen/ten]s as 'current_sum', then multiply it by its following multiple
// multiple not repeat
"use strict";
//these sets of numbers each have different rules
//[tenth, hundreth, thousandth..] are ambiguous because they could be ordinal like fifth, or decimal like one-one-hundredth, so are ignored
var ones = {
'a': 1,
'zero': 0,
'one': 1,
'two': 2,
'three': 3,
'four': 4,
'five': 5,
'six': 6,
'seven': 7,
'eight': 8,
'nine': 9,
"first": 1,
"second": 2,
"third": 3,
"fourth": 4,
"fifth": 5,
"sixth": 6,
"seventh": 7,
"eighth": 8,
"ninth": 9
}
var teens = {
'ten': 10,
'eleven': 11,
'twelve': 12,
'thirteen': 13,
'fourteen': 14,
'fifteen': 15,
'sixteen': 16,
'seventeen': 17,
'eighteen': 18,
'nineteen': 19,
"eleventh": 11,
"twelfth": 12,
"thirteenth": 13,
"fourteenth": 14,
"fifteenth": 15,
"sixteenth": 16,
"seventeenth": 17,
"eighteenth": 18,
"nineteenth": 19
}
var tens = {
'twenty': 20,
'thirty': 30,
'forty': 40,
'fifty': 50,
'sixty': 60,
'seventy': 70,
'eighty': 80,
'ninety': 90,
"twentieth": 20,
"thirtieth": 30,
"fourtieth": 40,
"fiftieth": 50,
"sixtieth": 60,
"seventieth": 70,
"eightieth": 80,
"ninetieth": 90
}
var multiple = {
'hundred': 100,
'grand': 1000,
'thousand': 1000,
'million': 1000000,
'billion': 1000000000,
'trillion': 1000000000000,
'quadrillion': 1000000000000000,
'quintillion': 1000000000000000000,
'sextillion': 1000000000000000000000,
'septillion': 1000000000000000000000000,
'octillion': 1000000000000000000000000000,
'nonillion': 1000000000000000000000000000000,
'decillion': 1000000000000000000000000000000000
}
// var decimal_multiple={'tenth':0.1, 'hundredth':0.01, 'thousandth':0.001, 'millionth':0.000001,'billionth':0.000000001};
var main = function (s) {
//remember these concerns for possible errors
var ones_done = false
var teens_done = false
var tens_done = false
var multiple_done = {}
var total = 0
var global_multiplier = 1
//pretty-printed numbers
s = s.replace(/, ?/g, '')
//parse-out currency
s = s.replace(/[$£€]/, '')
//try to finish-fast
if (s.match(/[0-9]\.[0-9]/) && parseFloat(s) == s) {
return parseFloat(s)
}
if (parseInt(s, 10) == s) {
return parseInt(s, 10)
}
//try to die fast. (phone numbers or times)
if (s.match(/[0-9][\-:][0-9]/)) {
return null
}
//support global multipliers, like 'half-million' by doing 'million' then multiplying by 0.5
var mults = [{
reg: /^(minus|negative)[\s\-]/i,
mult: -1
}, {
reg: /^(a\s)?half[\s\-](of\s)?/i,
mult: 0.5
}, {
reg: /^(a\s)?quarter[\s\-]/i,
mult: 0.25
}]
for (i = 0; i < mults.length; i++) {
if (s.match(mults[i].reg)) {
global_multiplier = mults[i].mult
s = s.replace(mults[i].reg, '')
break;
}
}
//do each word in turn..
var words = s.toString().split(/[\s\-]+/);
var w, x;
var current_sum = 0;
var local_multiplier = 1
var decimal_mode = false
for (var i = 0; i < words.length; i++) {
w = words[i]
//skip 'and' eg. five hundred and twelve
if (w == "and") {
continue;
}
//..we're doing decimals now
if (w == "point" || w == "decimal") {
if (decimal_mode) {
return null
} //two point one point six
decimal_mode = true
total += current_sum
current_sum = 0
ones_done = false
local_multiplier = 0.1
continue;
}
//handle special rules following a decimal
if (decimal_mode) {
x = null
//allow consecutive ones in decimals eg. 'two point zero five nine'
if (ones[w] !== undefined) {
x = ones[w]
}
if (teens[w] !== undefined) {
x = teens[w]
}
if (parseInt(w, 10) == w) {
x = parseInt(w, 10)
}
if (!x) {
return null
}
if (x < 10) {
total += x * local_multiplier
local_multiplier = local_multiplier * 0.1 // next number is next decimal place
current_sum = 0
continue;
}
//two-digit decimals eg. 'two point sixteen'
if (x < 100) {
total += x * (local_multiplier * 0.1)
local_multiplier = local_multiplier * 0.01 // next number is next decimal place
current_sum = 0
continue;
}
}
//if it's already an actual number
if (w.match(/^[0-9]\.[0-9]$/)) {
current_sum += parseFloat(w)
continue;
}
if (parseInt(w, 10) == w) {
current_sum += parseInt(w, 10)
continue;
}
//ones rules
if (ones[w] !== undefined) {
if (ones_done) {
return null
} // eg. five seven
if (teens_done) {
return null
} // eg. five seventeen
ones_done = true
current_sum += ones[w]
continue;
}
//teens rules
if (teens[w]) {
if (ones_done) {
return null
} // eg. five seventeen
if (teens_done) {
return null
} // eg. fifteen seventeen
if (tens_done) {
return null
} // eg. sixty fifteen
teens_done = true
current_sum += teens[w]
continue;
}
//tens rules
if (tens[w]) {
if (ones_done) {
return null
} // eg. five seventy
if (teens_done) {
return null
} // eg. fiveteen seventy
if (tens_done) {
return null
} // eg. twenty seventy
tens_done = true
current_sum += tens[w]
continue;
}
//multiple rules
if (multiple[w]) {
if (multiple_done[w]) {
return null
} // eg. five hundred six hundred
multiple_done[w] = true
//reset our concerns. allow 'five hundred five'
ones_done = false
teens_done = false
tens_done = false
//case of 'hundred million', (2 consecutive multipliers)
if (current_sum === 0) {
total = total || 1 //dont ever multiply by 0
total *= multiple[w]
} else {
current_sum *= multiple[w]
total += current_sum
}
current_sum = 0
continue;
}
//if word is not a known thing now, die
return null
}
if (current_sum) {
total += (current_sum || 1) * local_multiplier
}
//combine with global multiplier, like 'minus' or 'half'
total = total * global_multiplier
return total
}
//kick it into module
module.exports = main;
// console.log(to_number("sixteen hundred"))
// console.log(to_number("a hundred"))
// console.log(to_number("four point seven seven"))
},{}],39:[function(require,module,exports){
//turn a verb into its other grammatical forms.
var verb_to_doer = require("./to_doer")
var verb_irregulars = require("./verb_irregulars")
var verb_rules = require("./verb_rules")
var suffix_rules = require("./suffix_rules")
//this method is the slowest in the whole library, basically TODO:whaaa
var predict = function (w) {
var endsWith = function (str, suffix) {
return str.indexOf(suffix, str.length - suffix.length) !== -1;
}
var arr = Object.keys(suffix_rules);
for (i = 0; i < arr.length; i++) {
if (endsWith(w, arr[i])) {
return suffix_rules[arr[i]]
}
}
return "infinitive"
}
//fallback to this transformation if it has an unknown prefix
var fallback = function (w) {
var infinitive;
if (w.length > 4) {
infinitive = w.replace(/ed$/, '');
} else {
infinitive = w.replace(/d$/, '');
}
var present, past, gerund, doer;
if (w.match(/[^aeiou]$/)) {
gerund = w + "ing"
past = w + "ed"
if (w.match(/ss$/)) {
present = w + "es" //'passes'
} else {
present = w + "s"
}
doer = verb_to_doer(infinitive)
} else {
gerund = w.replace(/[aeiou]$/, 'ing')
past = w.replace(/[aeiou]$/, 'ed')
present = w.replace(/[aeiou]$/, 'es')
doer = verb_to_doer(infinitive)
}
return {
infinitive: infinitive,
present: present,
past: past,
gerund: gerund,
doer: doer,
future: "will " + infinitive
}
}
//make sure object has all forms
var fufill = function (obj, prefix) {
if (!obj.infinitive) {
return obj
}
if (!obj.gerund) {
obj.gerund = obj.infinitive + 'ing'
}
if (!obj.doer) {
obj.doer = verb_to_doer(obj.infinitive)
}
if (!obj.present) {
obj.present = obj.infinitive + 's'
}
if (!obj.past) {
obj.past = obj.infinitive + 'ed'
}
//add the prefix to all forms, if it exists
if (prefix) {
Object.keys(obj).forEach(function (k) {
obj[k] = prefix + obj[k]
})
}
//future is 'will'+infinitive
if (!obj.future) {
obj.future = "will " + obj.infinitive
}
//perfect is 'have'+past-tense
if (!obj.perfect) {
obj.perfect = "have " + obj.past
}
//pluperfect is 'had'+past-tense
if (!obj.pluperfect) {
obj.pluperfect = "had " + obj.past
}
//future perfect is 'will have'+past-tense
if (!obj.future_perfect) {
obj.future_perfect = "will have " + obj.past
}
return obj
}
var main = function (w) {
if (w === undefined) {
return {}
}
//for phrasal verbs ('look out'), conjugate look, then append 'out'
var phrasal_reg = new RegExp("^(.*?) (in|out|on|off|behind|way|with|of|do|away|across|ahead|back|over|under|together|apart|up|upon|aback|down|about|before|after|around|to|forth|round|through|along|onto)$", 'i')
if (w.match(' ') && w.match(phrasal_reg)) {
var split = w.match(phrasal_reg, '')
var phrasal_verb = split[1]
var particle = split[2]
var result = main(phrasal_verb) //recursive
delete result["doer"]
Object.keys(result).forEach(function (k) {
if (result[k]) {
result[k] += " " + particle
}
})
return result
}
//for pluperfect ('had tried') remove 'had' and call it past-tense
if (w.match(/^had [a-z]/i)) {
w = w.replace(/^had /i, '')
}
//for perfect ('have tried') remove 'have' and call it past-tense
if (w.match(/^have [a-z]/i)) {
w = w.replace(/^have /i, '')
}
//for future perfect ('will have tried') remove 'will have' and call it past-tense
if (w.match(/^will have [a-z]/i)) {
w = w.replace(/^will have /i, '')
}
//chop it if it's future-tense
w = w.replace(/^will /i, '')
//un-prefix the verb, and add it in later
var prefix = (w.match(/^(over|under|re|anti|full)\-?/i) || [])[0]
var verb = w.replace(/^(over|under|re|anti|full)\-?/i, '')
//check irregulars
var obj = {};
var l = verb_irregulars.length
var x, i;
for (i = 0; i < l; i++) {
x = verb_irregulars[i]
if (verb === x.present || verb === x.gerund || verb === x.past || verb === x.infinitive) {
obj = JSON.parse(JSON.stringify(verb_irregulars[i])); // object 'clone' hack, to avoid mem leak
return fufill(obj, prefix)
}
}
//guess the tense, so we know which transormation to make
var predicted = predict(w) || 'infinitive'
//check against suffix rules
l = verb_rules[predicted].length
var r, keys;
for (i = 0; i < l; i++) {
r = verb_rules[predicted][i];
if (w.match(r.reg)) {
obj[predicted] = w;
keys= Object.keys(r.repl)
for(var o=0; o<keys.length; o++){
if (keys[o] === predicted) {
obj[keys[o]] = w
} else {
obj[keys[o]] = w.replace(r.reg, r.repl[keys[o]])
}
}
return fufill(obj);
}
}
//produce a generic transformation
return fallback(w)
};
module.exports = main;
// console.log(module.exports("walking"))
// console.log(module.exports("overtook"))
// console.log(module.exports("watch out"))
// console.log(module.exports("watch"))
// console.log(module.exports("smash"))
// console.log(module.exports("word"))
// // broken
// console.log(module.exports("read"))
// console.log(module.exports("free"))
// console.log(module.exports("flesh"))
// console.log(module.exports("branch"))
// console.log(module.exports("spred"))
// console.log(module.exports("bog"))
// console.log(module.exports("nod"))
// console.log(module.exports("had tried"))
// console.log(module.exports("have tried"))
},{"./suffix_rules":40,"./to_doer":41,"./verb_irregulars":42,"./verb_rules":43}],40:[function(require,module,exports){
//generated from test data
var compact = {
"gerund": [
"ing"
],
"infinitive": [
"ate",
"ize",
"tion",
"rify",
"ress",
"ify",
"age",
"nce",
"ect",
"ise",
"ine",
"ish",
"ace",
"ash",
"ure",
"tch",
"end",
"ack",
"and",
"ute",
"ade",
"ock",
"ite",
"ase",
"ose",
"use",
"ive",
"int",
"nge",
"lay",
"est",
"ain",
"ant",
"eed",
"er",
"le"
],
"past": [
"ed",
"lt",
"nt",
"pt",
"ew",
"ld"
],
"present": [
"rks",
"cks",
"nks",
"ngs",
"mps",
"tes",
"zes",
"ers",
"les",
"acks",
"ends",
"ands",
"ocks",
"lays",
"eads",
"lls",
"els",
"ils",
"ows",
"nds",
"ays",
"ams",
"ars",
"ops",
"ffs",
"als",
"urs",
"lds",
"ews",
"ips",
"es",
"ts",
"ns",
"s"
]
}
var suffix_rules = {}
var keys = Object.keys(compact)
var l = keys.length;
var l2, i;
for (i = 0; i < l; i++) {
l2 = compact[keys[i]].length
for (var o = 0; o < l2; o++) {
suffix_rules[compact[keys[i]][o]] = keys[i]
}
}
module.exports = suffix_rules;
},{}],41:[function(require,module,exports){
//somone who does this present-tense verb
//turn 'walk' into 'walker'
module.exports = function (str) {
str = str || ''
var irregulars = {
"tie": "tier",
"dream": "dreamer",
"sail": "sailer",
"run": "runner",
"rub": "rubber",
"begin": "beginner",
"win": "winner",
"claim": "claimant",
"deal": "dealer",
"spin": "spinner"
}
var dont = {
"aid": 1,
"fail": 1,
"appear": 1,
"happen": 1,
"seem": 1,
"try": 1,
"say": 1,
"marry": 1,
"be": 1,
"forbid": 1,
"understand": 1,
"bet": 1
}
var transforms = [{
"reg": /e$/i,
"repl": 'er'
}, {
"reg": /([aeiou])([mlgp])$/i,
"repl": '$1$2$2er'
}, {
"reg": /([rlf])y$/i,
"repl": '$1ier'
}, {
"reg": /^(.?.[aeiou])t$/i,
"repl": '$1tter'
}]
if (dont.hasOwnProperty(str)) {
return null
}
if (irregulars.hasOwnProperty(str)) {
return irregulars[str]
}
for (var i = 0; i < transforms.length; i++) {
if (str.match(transforms[i].reg)) {
return str.replace(transforms[i].reg, transforms[i].repl)
}
}
return str + "er"
}
// console.log(verb_to_doer('set'))
// console.log(verb_to_doer('sweep'))
// console.log(verb_to_doer('watch'))
},{}],42:[function(require,module,exports){
var types = [
'infinitive',
'gerund',
'past',
'present',
'doer',
'future'
]
//list of verb irregular verb forms, compacted to save space. ('_' -> infinitive )
var compact = [
[
"arise",
"arising",
"arose",
"_s",
"_r"
],
[
"babysit",
"_ting",
"babysat",
"_s",
"_ter"
],
[
"be",
"_ing",
"was",
"is",
""
],
[
"beat",
"_ing",
"_",
"_s",
"_er"
],
[
"become",
"becoming",
"became",
"_s",
"_r"
],
[
"bend",
"_ing",
"bent",
"_s",
"_er"
],
[
"begin",
"_ning",
"began",
"_s",
"_ner"
],
[
"bet",
"_ting",
"_",
"_s",
"_ter"
],
[
"bind",
"_ing",
"bound",
"_s",
"_er"
],
[
"bite",
"biting",
"bit",
"_s",
"_r"
],
[
"bleed",
"_ing",
"bled",
"_s",
"_er"
],
[
"blow",
"_ing",
"blew",
"_s",
"_er"
],
[
"break",
"_ing",
"broke",
"_s",
"_er"
],
[
"breed",
"_ing",
"bred",
"_s",
"_er"
],
[
"bring",
"_ing",
"brought",
"_s",
"_er"
],
[
"broadcast",
"_ing",
"_",
"_s",
"_er"
],
[
"build",
"_ing",
"built",
"_s",
"_er"
],
[
"buy",
"_ing",
"bought",
"_s",
"_er"
],
[
"catch",
"_ing",
"caught",
"_es",
"_er"
],
[
"choose",
"choosing",
"chose",
"_s",
"_r"
],
[
"come",
"coming",
"came",
"_s",
"_r"
],
[
"cost",
"_ing",
"_",
"_s",
"_er"
],
[
"cut",
"_ting",
"_",
"_s",
"_ter"
],
[
"deal",
"_ing",
"_t",
"_s",
"_er"
],
[
"dig",
"_ging",
"dug",
"_s",
"_ger"
],
[
"do",
"_ing",
"did",
"_es",
"_er"
],
[
"draw",
"_ing",
"drew",
"_s",
"_er"
],
[
"drink",
"_ing",
"drank",
"_s",
"_er"
],
[
"drive",
"driving",
"drove",
"_s",
"_r"
],
[
"eat",
"_ing",
"ate",
"_s",
"_er"
],
[
"fall",
"_ing",
"fell",
"_s",
"_er"
],
[
"feed",
"_ing",
"fed",
"_s",
"_er"
],
[
"feel",
"_ing",
"felt",
"_s",
"_er"
],
[
"fight",
"_ing",
"fought",
"_s",
"_er"
],
[
"find",
"_ing",
"found",
"_s",
"_er"
],
[
"fly",
"_ing",
"flew",
"_s",
"flier"
],
[
"forbid",
"_ing",
"forbade",
"_s",
],
[
"forget",
"_ing",
"forgot",
"_s",
"_er"
],
[
"forgive",
"forgiving",
"forgave",
"_s",
"_r"
],
[
"freeze",
"freezing",
"froze",
"_s",
"_r"
],
[
"get",
"_ting",
"got",
"_s",
"_ter"
],
[
"give",
"giving",
"gave",
"_s",
"_r"
],
[
"go",
"_ing",
"went",
"_es",
"_er"
],
[
"grow",
"_ing",
"grew",
"_s",
"_er"
],
[
"hang",
"_ing",
"hung",
"_s",
"_er"
],
[
"have",
"having",
"had",
"has",
],
[
"hear",
"_ing",
"_d",
"_s",
"_er"
],
[
"hide",
"hiding",
"hid",
"_s",
"_r"
],
[
"hit",
"_ting",
"_",
"_s",
"_ter"
],
[
"hold",
"_ing",
"held",
"_s",
"_er"
],
[
"hurt",
"_ing",
"_",
"_s",
"_er"
],
[
"know",
"_ing",
"knew",
"_s",
"_er"
],
[
"relay",
"_ing",
"_ed",
"_s",
"_er"
],
[
"lay",
"_ing",
"laid",
"_s",
"_er"
],
[
"lead",
"_ing",
"led",
"_s",
"_er"
],
[
"leave",
"leaving",
"left",
"_s",
"_r"
],
[
"lend",
"_ing",
"lent",
"_s",
"_er"
],
[
"let",
"_ting",
"_",
"_s",
"_ter"
],
[
"lie",
"lying",
"lay",
"_s",
"_r"
],
[
"light",
"_ing",
"lit",
"_s",
"_er"
],
[
"lose",
"losing",
"lost",
"_s",
"_r"
],
[
"make",
"making",
"made",
"_s",
"_r"
],
[
"mean",
"_ing",
"_t",
"_s",
"_er"
],
[
"meet",
"_ing",
"met",
"_s",
"_er"
],
[
"pay",
"_ing",
"paid",
"_s",
"_er"
],
[
"put",
"_ting",
"_",
"_s",
"_ter"
],
[
"quit",
"_ting",
"_",
"_s",
"_ter"
],
[
"read",
"_ing",
"_",
"_s",
"_er"
],
[
"ride",
"riding",
"rode",
"_s",
"_r"
],
[
"ring",
"_ing",
"rang",
"_s",
"_er"
],
[
"rise",
"rising",
"rose",
"_s",
"_r"
],
[
"run",
"_ning",
"ran",
"_s",
"_ner"
],
[
"say",
"_ing",
"said",
"_s",
],
[
"see",
"_ing",
"saw",
"_s",
"_r"
],
[
"sell",
"_ing",
"sold",
"_s",
"_er"
],
[
"send",
"_ing",
"sent",
"_s",
"_er"
],
[
"set",
"_ting",
"_",
"_s",
"_ter"
],
[
"shake",
"shaking",
"shook",
"_s",
"_r"
],
[
"shine",
"shining",
"shone",
"_s",
"_r"
],
[
"shoot",
"_ing",
"shot",
"_s",
"_er"
],
[
"show",
"_ing",
"_ed",
"_s",
"_er"
],
[
"shut",
"_ting",
"_",
"_s",
"_ter"
],
[
"sing",
"_ing",
"sang",
"_s",
"_er"
],
[
"sink",
"_ing",
"sank",
"_s",
"_er"
],
[
"sit",
"_ting",
"sat",
"_s",
"_ter"
],
[
"slide",
"sliding",
"slid",
"_s",
"_r"
],
[
"speak",
"_ing",
"spoke",
"_s",
"_er"
],
[
"spend",
"_ing",
"spent",
"_s",
"_er"
],
[
"spin",
"_ning",
"spun",
"_s",
"_ner"
],
[
"spread",
"_ing",
"_",
"_s",
"_er"
],
[
"stand",
"_ing",
"stood",
"_s",
"_er"
],
[
"steal",
"_ing",
"stole",
"_s",
"_er"
],
[
"stick",
"_ing",
"stuck",
"_s",
"_er"
],
[
"sting",
"_ing",
"stung",
"_s",
"_er"
],
[
"strike",
"striking",
"struck",
"_s",
"_r"
],
[
"swear",
"_ing",
"swore",
"_s",
"_er"
],
[
"swim",
"_ing",
"swam",
"_s",
"_mer"
],
[
"swing",
"_ing",
"swung",
"_s",
"_er"
],
[
"take",
"taking",
"took",
"_s",
"_r"
],
[
"teach",
"_ing",
"taught",
"_s",
"_er"
],
[
"tear",
"_ing",
"tore",
"_s",
"_er"
],
[
"tell",
"_ing",
"told",
"_s",
"_er"
],
[
"think",
"_ing",
"thought",
"_s",
"_er"
],
[
"throw",
"_ing",
"threw",
"_s",
"_er"
],
[
"understand",
"_ing",
"understood",
"_s",
],
[
"wake",
"waking",
"woke",
"_s",
"_r"
],
[
"wear",
"_ing",
"wore",
"_s",
"_er"
],
[
"win",
"_ning",
"won",
"_s",
"_ner"
],
[
"withdraw",
"_ing",
"withdrew",
"_s",
"_er"
],
[
"write",
"writing",
"wrote",
"_s",
"_r"
],
[
"tie",
"tying",
"_d",
"_s",
"_r"
],
[
"obey",
"_ing",
"_ed",
"_s",
"_er"
],
[
"ski",
"_ing",
"_ied",
"_s",
"_er"
],
[
"boil",
"_ing",
"_ed",
"_s",
"_er"
],
[
"miss",
"_ing",
"_ed",
"_",
"_er"
],
[
"act",
"_ing",
"_ed",
"_s",
"_or"
],
[
"compete",
"competing",
"_d",
"_s",
"competitor"
],
[
"being",
"are",
"were",
"are",
],
[
"imply",
"_ing",
"implied",
"implies",
"implier"
],
[
"ice",
"icing",
"_d",
"_s",
"_r"
],
[
"develop",
"_ing",
"_",
"_s",
"_er"
],
[
"wait",
"_ing",
"_ed",
"_s",
"_er"
],
[
"aim",
"_ing",
"_ed",
"_s",
"_er"
],
[
"spill",
"_ing",
"spilt",
"_s",
"_er"
],
[
"drop",
"_ping",
"_ped",
"_s",
"_per"
],
[
"head",
"_ing",
"_ed",
"_s",
"_er"
],
[
"log",
"_ging",
"_ged",
"_s",
"_ger"
],
[
"rub",
"_bing",
"_bed",
"_s",
"_ber"
],
[
"smash",
"_ing",
"_ed",
"_es",
"_er"
],
[
"add",
"_ing",
"_ed",
"_s",
"_er"
],
[
"word",
"_ing",
"_ed",
"_s",
"_er"
],
[
"suit",
"_ing",
"_ed",
"_s",
"_er"
],
[
"be",
"am",
"was",
"am",
""
]
]
//expand compact version out
module.exports = compact.map(function (arr) {
var obj = {}
for (var i = 0; i < arr.length; i++) {
obj[types[i]] = arr[i].replace(/_/, arr[0])
}
return obj
})
// console.log(JSON.stringify(verb_irregulars, null, 2));
},{}],43:[function(require,module,exports){
// regex rules for each part of speech that convert it to all other parts of speech.
// used in combination with the generic 'fallback' method
var verb_rules = {
"infinitive": [
[
"(eed)$",
{
"pr": "$1s",
"g": "$1ing",
"pa": "$1ed",
"do": "$1er"
}
],
[
"(e)(ep)$",
{
"pr": "$1$2s",
"g": "$1$2ing",
"pa": "$1pt",
"do": "$1$2er"
}
],
[
"(a[tg]|i[zn]|ur|nc|gl|is)e$",
{
"pr": "$1es",
"g": "$1ing",
"pa": "$1ed"
}
],
[
"([i|f|rr])y$",
{
"pr": "$1ies",
"g": "$1ying",
"pa": "$1ied"
}
],
[
"([td]er)$",
{
"pr": "$1s",
"g": "$1ing",
"pa": "$1ed"
}
],
[
"([bd]l)e$",
{
"pr": "$1es",
"g": "$1ing",
"pa": "$1ed"
}
],
[
"(ish|tch|ess)$",
{
"pr": "$1es",
"g": "$1ing",
"pa": "$1ed"
}
],
[
"(ion|end|e[nc]t)$",
{
"pr": "$1s",
"g": "$1ing",
"pa": "$1ed"
}
],
[
"(om)e$",
{
"pr": "$1es",
"g": "$1ing",
"pa": "ame"
}
],
[
"([aeiu])([pt])$",
{
"pr": "$1$2s",
"g": "$1$2$2ing",
"pa": "$1$2"
}
],
[
"(er)$",
{
"pr": "$1s",
"g": "$1ing",
"pa": "$1ed"
}
],
[
"(en)$",
{
"pr": "$1s",
"g": "$1ing",
"pa": "$1ed"
}
]
],
"present": [
[
"(ies)$",
{
"in": "y",
"g": "ying",
"pa": "ied"
}
],
[
"(tch|sh)es$",
{
"in": "$1",
"g": "$1ing",
"pa": "$1ed"
}
],
[
"(ss)es$",
{
"in": "$1",
"g": "$1ing",
"pa": "$1ed"
}
],
[
"([tzlshicgrvdnkmu])es$",
{
"in": "$1e",
"g": "$1ing",
"pa": "$1ed"
}
],
[
"(n[dtk]|c[kt]|[eo]n|i[nl]|er|a[ytrl])s$",
{
"in": "$1",
"g": "$1ing",
"pa": "$1ed"
}
],
[
"(ow)s$",
{
"in": "$1",
"g": "$1ing",
"pa": "ew"
}
],
[
"(op)s$",
{
"in": "$1",
"g": "$1ping",
"pa": "$1ped"
}
],
[
"([eirs])ts$",
{
"in": "$1t",
"g": "$1tting",
"pa": "$1tted"
}
],
[
"(ll)s$",
{
"in": "$1",
"g": "$1ing",
"pa": "$1ed"
}
],
[
"(el)s$",
{
"in": "$1",
"g": "$1ling",
"pa": "$1led"
}
],
[
"(ip)es$",
{
"in": "$1e",
"g": "$1ing",
"pa": "$1ed"
}
],
[
"ss$",
{
"in": "ss",
"g": "ssing",
"pa": "ssed"
}
],
[
"s$",
{
"in": "",
"g": "ing",
"pa": "ed"
}
]
],
"gerund": [
[
"pping$",
{
"in": "p",
"pr": "ps",
"pa": "pped"
}
],
[
"lling$",
{
"in": "ll",
"pr": "lls",
"pa": "lled"
}
],
[
"tting$",
{
"in": "t",
"pr": "ts",
"pa": "t"
}
],
[
"ssing$",
{
"in": "ss",
"pr": "sses",
"pa": "ssed"
}
],
[
"gging$",
{
"in": "g",
"pr": "gs",
"pa": "gged"
}
],
[
"([^aeiou])ying$",
{
"in": "$1y",
"pr": "$1ies",
"pa": "$1ied",
"do": "$1ier"
}
],
[
"(i.)ing$",
{
"in": "$1e",
"pr": "$1es",
"pa": "$1ed"
}
],
[
"(u[rtcb]|[bdtpkg]l|n[cg]|a[gdkvtc]|[ua]s|[dr]g|yz|o[rlsp]|cre)ing$",
{
"in": "$1e",
"pr": "$1es",
"pa": "$1ed"
}
],
[
"(ch|sh)ing$",
{
"in": "$1",
"pr": "$1es",
"pa": "$1ed"
}
],
[
"(..)ing$",
{
"in": "$1",
"pr": "$1s",
"pa": "$1ed"
}
]
],
"past": [
[
"(ued)$",
{
"pr": "ues",
"g": "uing",
"pa": "ued",
"do": "uer"
}
],
[
"(e|i)lled$",
{
"pr": "$1lls",
"g": "$1lling",
"pa": "$1lled",
"do": "$1ller"
}
],
[
"(sh|ch)ed$",
{
"in": "$1",
"pr": "$1es",
"g": "$1ing",
"do": "$1er"
}
],
[
"(tl|gl)ed$",
{
"in": "$1e",
"pr": "$1es",
"g": "$1ing",
"do": "$1er"
}
],
[
"(ss)ed$",
{
"in": "$1",
"pr": "$1es",
"g": "$1ing",
"do": "$1er"
}
],
[
"pped$",
{
"in": "p",
"pr": "ps",
"g": "pping",
"do": "pper"
}
],
[
"tted$",
{
"in": "t",
"pr": "ts",
"g": "tting",
"do": "tter"
}
],
[
"gged$",
{
"in": "g",
"pr": "gs",
"g": "gging",
"do": "gger"
}
],
[
"(h|ion|n[dt]|ai.|[cs]t|pp|all|ss|tt|int|ail|ld|en|oo.|er|k|pp|w|ou.|rt|ght|rm)ed$",
{
"in": "$1",
"pr": "$1s",
"g": "$1ing",
"do": "$1er"
}
],
[
"(..[^aeiou])ed$",
{
"in": "$1e",
"pr": "$1es",
"g": "$1ing",
"do": "$1er"
}
],
[
"ied$",
{
"in": "y",
"pr": "ies",
"g": "ying",
"do": "ier"
}
],
[
"(.o)ed$",
{
"in": "$1o",
"pr": "$1os",
"g": "$1oing",
"do": "$1oer"
}
],
[
"(.i)ed$",
{
"in": "$1",
"pr": "$1s",
"g": "$1ing",
"do": "$1er"
}
],
[
"([rl])ew$",
{
"in": "$1ow",
"pr": "$1ows",
"g": "$1owing"
}
],
[
"([pl])t$",
{
"in": "$1t",
"pr": "$1ts",
"g": "$1ting"
}
]
]
}
//unpack compressed form
verb_rules=Object.keys(verb_rules).reduce(function(h,k){
h[k]=verb_rules[k].map(function(a){
var obj={
reg:new RegExp(a[0],"i"),
repl:{
infinitive:a[1]["in"],
present:a[1]["pr"],
past:a[1]["pa"],
gerund:a[1]["g"]
}
}
if(a[1]["do"]){
obj.repl.doer=a[1]["do"]
}
return obj
})
return h
},{})
module.exports = verb_rules;
// console.log(JSON.stringify(verb_rules, null, 2));
},{}],44:[function(require,module,exports){
//wrapper for verb's methods
var Verb = function (str, sentence, word_i) {
var the = this
var token, next;
if (sentence !== undefined && word_i !== undefined) {
token = sentence.tokens[word_i]
next = sentence.tokens[word_i + i]
}
the.word = str || '';
var verb_conjugate = require("./conjugate/conjugate")
var parts_of_speech = require("../../data/parts_of_speech")
var copulas = {
"is": "CP",
"will be": "CP",
"will": "CP",
"are": "CP",
"was": "CP",
"were": "CP"
}
var modals = {
"can": "MD",
"may": "MD",
"could": "MD",
"might": "MD",
"will": "MD",
"ought to": "MD",
"would": "MD",
"must": "MD",
"shall": "MD",
"should": "MD"
}
var tenses = {
past: "VBD",
participle: "VBN",
infinitive: "VBP",
present: "VBZ",
gerund: "VBG"
}
the.conjugate = function () {
return verb_conjugate(the.word)
}
the.to_past = function () {
if (the.form === "gerund") {
return the.word
}
return verb_conjugate(the.word).past
}
the.to_present = function () {
return verb_conjugate(the.word).present
}
the.to_future = function () {
return "will " + verb_conjugate(the.word).infinitive
}
//which conjugation
the.form = (function () {
//don't choose infinitive if infinitive==present
var order = [
"past",
"present",
"gerund",
"infinitive"
]
var forms = verb_conjugate(the.word)
for (var i = 0; i < order.length; i++) {
if (forms[order[i]] === the.word) {
return order[i]
}
}
})()
//past/present/future //wahh?!
the.tense = (function () {
if (the.word.match(/\bwill\b/)) {
return "future"
}
if (the.form === "present") {
return "present"
}
if (the.form === "past") {
return "past"
}
return "present"
})()
//the most accurate part_of_speech
the.which = (function () {
if (copulas[the.word]) {
return parts_of_speech['CP']
}
if (the.word.match(/([aeiou][^aeiouwyrlm])ing$/)) {
return parts_of_speech['VBG']
}
var form = the.form
return parts_of_speech[tenses[form]]
})()
//is this verb negative already?
the.negative = function () {
if (the.word.match(/n't$/)) {
return true
}
if ((modals[the.word] || copulas[the.word]) && next && next.normalised === "not") {
return true
}
return false
}
return the;
}
module.exports = Verb;
// console.log(new Verb("will"))
// console.log(new Verb("stalking").tense)
},{"../../data/parts_of_speech":14,"./conjugate/conjugate":39}],45:[function(require,module,exports){
var lexicon = require("./data/lexicon")
var values = require("./data/lexicon/values")
var tokenize = require("./methods/tokenization/tokenize");
var parts_of_speech = require("./data/parts_of_speech")
var word_rules = require("./data/word_rules")
var wordnet_suffixes = require("./data/unambiguous_suffixes")
var Sentence = require("./sentence")
var Section = require("./section")
var parents = require("./parents/parents")
//possible 2nd part in a phrasal verb
var particles = ["in", "out", "on", "off", "behind", "way", "with", "of", "do", "away", "across", "ahead", "back", "over", "under", "together", "apart", "up", "upon", "aback", "down", "about", "before", "after", "around", "to", "forth", "round", "through", "along", "onto"]
particles = particles.reduce(function (h, s) {
h[s] = true
return h
}, {})
var merge_tokens = function (a, b) {
a.text += " " + b.text
a.normalised += " " + b.normalised
a.pos_reason += "|" + b.pos_reason
a.start = a.start || b.start
a.noun_capital = (a.noun_capital && b.noun_capital)
a.punctuated = a.punctuated || b.punctuated
a.end = a.end || b.end
return a
}
//combine adjacent neighbours, and special cases
var combine_tags = function (sentence) {
var arr = sentence.tokens || []
for (var i = 0; i <= arr.length; i++) {
var next = arr[i + 1]
if (arr[i] && next) {
var tag = arr[i].pos.tag
//'joe smith' are both NN, for example
if (tag === next.pos.tag && arr[i].punctuated !== true && arr[i].noun_capital == next.noun_capital) {
arr[i + 1] = merge_tokens(arr[i], arr[i + 1])
arr[i] = null
}
//merge NNP and NN, like firstname, lastname
else if ((tag === "NNP" && next.pos.tag === "NN") || (tag === "NN" && next.pos.tag === "NNP")) {
arr[i + 1] = merge_tokens(arr[i], arr[i + 1])
arr[i] = null
arr[i + 1].pos = parts_of_speech['NNP']
}
//merge dates manually, which often have punctuation
else if (tag === "CD" && next.pos.tag === "CD") {
arr[i + 1] = merge_tokens(arr[i], arr[i + 1])
arr[i] = null
}
//merge abbreviations with nouns manually, eg. "Joe jr."
else if ((tag === "NNAB" && next.pos.parent === "noun") || (arr[i].pos.parent === "noun" && next.pos.tag === "NNAB")) {
arr[i + 1] = merge_tokens(arr[i], arr[i + 1])
arr[i] = null
}
//'will walk' -> future-tense verb
else if (arr[i].normalised === "will" && next.pos.parent === "verb") {
arr[i + 1] = merge_tokens(arr[i], arr[i + 1])
arr[i] = null
}
//'hundred and fifty', 'march the 5th'
else if (tag === "CD" && (next.normalised === "and" || next.normalised === "the") && arr[i + 2] && arr[i + 2].pos.tag === "CD") {
arr[i + 1] = merge_tokens(arr[i], arr[i + 1])
arr[i] = null
}
//capitals surrounding a preposition 'United States of America'
else if (tag == "NN" && arr[i].noun_capital && (next.normalised == "of" || next.normalised == "and") && arr[i + 2] && arr[i + 2].noun_capital) {
arr[i + 1] = merge_tokens(arr[i], arr[i + 1])
arr[i] = null
arr[i + 2] = merge_tokens(arr[i + 1], arr[i + 2])
arr[i + 1] = null
}
//capitals surrounding two prepositions 'Phantom of the Opera'
else if (arr[i].noun_capital && next.normalised == "of" && arr[i + 2] && arr[i + 2].pos.tag == "DT" && arr[i + 3] && arr[i + 3].noun_capital) {
arr[i + 1] = merge_tokens(arr[i], arr[i + 1])
arr[i] = null
arr[i + 2] = merge_tokens(arr[i + 1], arr[i + 2])
arr[i + 1] = null
arr[i + 3] = merge_tokens(arr[i + 2], arr[i + 3])
arr[i + 2] = null
}
}
}
sentence.tokens = arr.filter(function (r) {
return r
})
return sentence
}
//some prepositions are clumped onto the back of a verb "looked for", "looks at"
//they should be combined with the verb, sometimes.
//does not handle seperated phrasal verbs ('take the coat off' -> 'take off')
var combine_phrasal_verbs = function (sentence) {
var arr = sentence.tokens || []
for (var i = 1; i < arr.length; i++) {
if (particles[arr[i].normalised]) {
//it matches a known phrasal-verb
if (lexicon[arr[i - 1].normalised + " " + arr[i].normalised]) {
// console.log(arr[i-1].normalised + " " + arr[i].normalised)
arr[i] = merge_tokens(arr[i - 1], arr[i])
arr[i - 1] = null
}
}
}
sentence.tokens = arr.filter(function (r) {
return r
})
return sentence
}
var lexicon_pass = function (w) {
if (lexicon.hasOwnProperty(w)) {
return parts_of_speech[lexicon[w]]
}
//try to match it without a prefix - eg. outworked -> worked
if (w.match(/^(over|under|out|-|un|re|en).{4}/)) {
var attempt = w.replace(/^(over|under|out|.*?-|un|re|en)/, '')
return parts_of_speech[lexicon[attempt]]
}
}
var rules_pass = function (w) {
for (var i = 0; i < word_rules.length; i++) {
if (w.length > 4 && w.match(word_rules[i].reg)) {
return parts_of_speech[word_rules[i].pos]
}
}
}
var fourth_pass = function (token, i, sentence) {
var last = sentence.tokens[i - 1]
var next = sentence.tokens[i + 1]
var strong_determiners = {
"the": 1,
"a": 1,
"an": 1
}
//resolve ambiguous 'march','april','may' with dates
if ((token.normalised == "march" || token.normalised == "april" || token.normalised == "may") && ((next && next.pos.tag == "CD") || (last && last.pos.tag == "CD"))) {
token.pos = parts_of_speech['CD']
token.pos_reason = "may_is_date"
}
//if it's before a modal verb, it's a noun -> lkjsdf would
if (next && token.pos.parent !== "noun" && token.pos.parent !== "glue" && next.pos.tag === "MD") {
token.pos = parts_of_speech['NN']
token.pos_reason = "before_modal"
}
//if it's after the word 'will' its probably a verb/adverb
if (last && last.normalised == "will" && !last.punctuated && token.pos.parent == "noun" && token.pos.tag !== "PRP" && token.pos.tag !== "PP") {
token.pos = parts_of_speech['VB']
token.pos_reason = "after_will"
}
//if it's after the word 'i' its probably a verb/adverb
if (last && last.normalised == "i" && !last.punctuated && token.pos.parent == "noun") {
token.pos = parts_of_speech['VB']
token.pos_reason = "after_i"
}
//if it's after an adverb, it's not a noun -> quickly acked
//support form 'atleast he is..'
if (last && token.pos.parent === "noun" && token.pos.tag !== "PRP" && token.pos.tag !== "PP" && last.pos.tag === "RB" && !last.start) {
token.pos = parts_of_speech['VB']
token.pos_reason = "after_adverb"
}
//no consecutive, unpunctuated adjectives -> real good
if (next && token.pos.parent === "adjective" && next.pos.parent === "adjective" && !token.punctuated) {
token.pos = parts_of_speech['RB']
token.pos_reason = "consecutive_adjectives"
}
//if it's after a determiner, it's not a verb -> the walk
if (last && token.pos.parent === "verb" && strong_determiners[last.pos.normalised] && token.pos.tag != "CP") {
token.pos = parts_of_speech['NN']
token.pos_reason = "determiner-verb"
}
//copulas are followed by a determiner ("are a .."), or an adjective ("are good")
if (last && last.pos.tag === "CP" && token.pos.tag !== "DT" && token.pos.tag !== "RB" && token.pos.tag !== "PRP" && token.pos.parent !== "adjective" && token.pos.parent !== "value") {
token.pos = parts_of_speech['JJ']
token.pos_reason = "copula-adjective"
}
//copula, adverb, verb -> copula adverb adjective -> is very lkjsdf
if (last && next && last.pos.tag === "CP" && token.pos.tag === "RB" && next.pos.parent === "verb") {
sentence.tokens[i + 1].pos = parts_of_speech['JJ']
sentence.tokens[i + 1].pos_reason = "copula-adverb-adjective"
}
// the city [verb] him.
if (next && next.pos.tag == "PRP" && token.pos.tag !== "PP" && token.pos.parent == "noun" && !token.punctuated) {
token.pos = parts_of_speech['VB']
token.pos_reason = "before_[him|her|it]"
}
//the misled worker -> misled is an adjective, not vb
if (last && next && last.pos.tag === "DT" && next.pos.parent === "noun" && token.pos.parent === "verb") {
token.pos = parts_of_speech['JJ']
token.pos_reason = "determiner-adjective-noun"
}
//where's he gone -> gone=VB, not JJ
if (last && last.pos.tag === "PRP" && token.pos.tag === "JJ") {
token.pos = parts_of_speech['VB']
token.pos_reason = "adjective-after-pronoun"
}
return token
}
//add a 'quiet' token for contractions so we can represent their grammar
var handle_contractions = function (arr) {
var contractions = {
"i'd": ["i", "would"],
"she'd": ["she", "would"],
"he'd": ["he", "would"],
"they'd": ["they", "would"],
"we'd": ["we", "would"],
"i'll": ["i", "will"],
"she'll": ["she", "will"],
"he'll": ["he", "will"],
"they'll": ["they", "will"],
"we'll": ["we", "will"],
"i've": ["i", "have"],
"they've": ["they", "have"],
"we've": ["we", "have"],
"should've": ["should", "have"],
"would've": ["would", "have"],
"could've": ["could", "have"],
"must've": ["must", "have"],
"i'm": ["i", "am"],
"we're": ["we", "are"],
"they're": ["they", "are"],
"cannot": ["can", "not"]
}
var before, after, fix;
for (var i = 0; i < arr.length; i++) {
if (contractions.hasOwnProperty(arr[i].normalised)) {
before = arr.slice(0, i)
after = arr.slice(i + 1, arr.length)
fix = [{
text: arr[i].text,
normalised: contractions[arr[i].normalised][0],
start: arr[i].start
}, {
text: "",
normalised: contractions[arr[i].normalised][1],
start: undefined
}]
arr = before.concat(fix)
arr = arr.concat(after)
return handle_contractions(arr) //recursive
}
}
return arr
}
//these contractions require (some) grammatical knowledge to disambig properly (e.g "he's"=> ['he is', 'he was']
var handle_ambiguous_contractions = function (arr) {
var ambiguous_contractions = {
"he's": "he",
"she's": "she",
"it's": "it",
"who's": "who",
"what's": "what",
"where's": "where",
"when's": "when",
"why's": "why",
"how's": "how"
}
var before, after, fix;
for (var i = 0; i < arr.length; i++) {
if (ambiguous_contractions.hasOwnProperty(arr[i].normalised)) {
before = arr.slice(0, i)
after = arr.slice(i + 1, arr.length)
//choose which verb this contraction should have..
var chosen = "is"
//look for the next verb, and if it's past-tense (he's walked -> he has walked)
for (var o = i + 1; o < arr.length; o++) {
if (arr[o] && arr[o].pos && arr[o].pos.tag == "VBD") { //past tense
chosen = "has"
break
}
}
fix = [{
text: arr[i].text,
normalised: ambiguous_contractions[arr[i].normalised], //the "he" part
start: arr[i].start,
pos: parts_of_speech[lexicon[ambiguous_contractions[arr[i].normalised]]],
pos_reason: "ambiguous_contraction"
}, {
text: "",
normalised: chosen, //is,was,or have
start: undefined,
pos: parts_of_speech[lexicon[chosen]],
pos_reason: "silent_contraction"
}]
arr = before.concat(fix)
arr = arr.concat(after)
return handle_ambiguous_contractions(arr) //recursive
}
}
return arr
}
////////////////
///party-time//
var main = function (text, options) {
options = options || {}
if (!text || !text.match(/[a-z0-9]/i)) {
return new Section([])
}
var sentences = tokenize(text);
sentences.forEach(function (sentence) {
//first, let's handle the capitalisation-of-the-first-word issue
var first = sentence.tokens[0]
if (first) {
//if second word is a noun-capital, give more sympathy to this capital
if (sentence.tokens[1] && sentence.tokens[1].noun_capital && !lexicon_pass(first.normalised)) {
sentence.tokens[0].noun_capital = true;
}
}
//smart handling of contractions
sentence.tokens = handle_contractions(sentence.tokens)
//first pass, word-level clues
sentence.tokens = sentence.tokens.map(function (token) {
//it has a capital and isn't a month, etc.
if (token.noun_capital && !values[token.normalised]) {
token.pos = parts_of_speech['NN']
token.pos_reason = "noun_capitalised"
return token
}
//known words list
var lex = lexicon_pass(token.normalised)
if (lex) {
token.pos = lex;
token.pos_reason = "lexicon"
//if it's an abbreviation, forgive the punctuation (eg. 'dr.')
if (token.pos.tag === "NNAB") {
token.punctuated = false
}
return token
}
//handle punctuation like ' -- '
if (!token.normalised) {
token.pos = parts_of_speech['UH']
token.pos_reason = "wordless_string"
return token
}
// suffix pos signals from wordnet
var len = token.normalised.length
if (len > 4) {
var suffix = token.normalised.substr(len - 4, len - 1)
if (wordnet_suffixes.hasOwnProperty(suffix)) {
token.pos = parts_of_speech[wordnet_suffixes[suffix]]
token.pos_reason = "wordnet suffix"
return token
}
}
// suffix regexes for words
var r = rules_pass(token.normalised);
if (r) {
token.pos = r;
token.pos_reason = "regex suffix"
return token
}
//see if it's a number
if (parseFloat(token.normalised)) {
token.pos = parts_of_speech['CD']
token.pos_reason = "parsefloat"
return token
}
return token
})
//second pass, wrangle results a bit
sentence.tokens = sentence.tokens.map(function (token, i) {
//set ambiguous 'ed' endings as either verb/adjective
if (token.pos_reason !== "lexicon" && token.normalised.match(/.ed$/)) {
token.pos = parts_of_speech['VB']
token.pos_reason = "ed"
}
return token
})
//split-out more difficult contractions, like "he's"->["he is", "he was"]
// (now that we have enough pos data to do this)
sentence.tokens = handle_ambiguous_contractions(sentence.tokens)
//third pass, seek verb or noun phrases after their signals
var need = null
var reason = ''
sentence.tokens = sentence.tokens.map(function (token, i) {
var next = sentence.tokens[i + 1]
if (token.pos) {
//suggest noun after some determiners (a|the), posessive pronouns (her|my|its)
if (token.normalised == "the" || token.normalised == "a" || token.normalised == "an" || token.pos.tag === "PP") {
need = 'noun'
reason = token.pos.name
return token //proceed
}
//suggest verb after personal pronouns (he|she|they), modal verbs (would|could|should)
if (token.pos.tag === "PRP" && token.pos.tag !== "PP" || token.pos.tag === "MD") {
need = 'verb'
reason = token.pos.name
return token //proceed
}
}
//satisfy need on a conflict, and fix a likely error
if (token.pos) {
if (need == "verb" && token.pos.parent == "noun" && (!next || (next.pos && next.pos.parent != "noun"))) {
if (!next || !next.pos || next.pos.parent != need) { //ensure need not satisfied on the next one
token.pos = parts_of_speech['VB']
token.pos_reason = "signal from " + reason
need = null
}
}
if (need == "noun" && token.pos.parent == "verb" && (!next || (next.pos && next.pos.parent != "verb"))) {
if (!next || !next.pos || next.pos.parent != need) { //ensure need not satisfied on the next one
token.pos = parts_of_speech["NN"]
token.pos_reason = "signal from " + reason
need = null
}
}
}
//satisfy need with an unknown pos
if (need && !token.pos) {
if (!next || !next.pos || next.pos.parent != need) { //ensure need not satisfied on the next one
token.pos = parts_of_speech[need]
token.pos_reason = "signal from " + reason
need = null
}
}
//set them back as satisfied..
if (need === 'verb' && token.pos && token.pos.parent === 'verb') {
need = null
}
if (need === 'noun' && token.pos && token.pos.parent === 'noun') {
need = null
}
return token
})
//third pass, identify missing clauses, fallback to noun
var has = {}
sentence.tokens.forEach(function (token) {
if (token.pos) {
has[token.pos.parent] = true
}
})
sentence.tokens = sentence.tokens.map(function (token, i) {
if (!token.pos) {
//if there is no verb in the sentence, and there needs to be.
if (has['adjective'] && has['noun'] && !has['verb']) {
token.pos = parts_of_speech['VB']
token.pos_reason = "need one verb"
has['verb'] = true
return token
}
//fallback to a noun
token.pos = parts_of_speech['NN']
token.pos_reason = "noun fallback"
}
return token
})
//fourth pass, error correction
sentence.tokens = sentence.tokens.map(function (token, i) {
return fourth_pass(token, i, sentence)
})
//run the fourth-pass again!
sentence.tokens = sentence.tokens.map(function (token, i) {
return fourth_pass(token, i, sentence)
})
})
//combine neighbours
if (!options.dont_combine) {
sentences = sentences.map(function (s) {
return combine_tags(s)
})
sentences = sentences.map(function (s) {
return combine_phrasal_verbs(s)
})
}
//make them Sentence objects
sentences = sentences.map(function (s) {
var sentence = new Sentence(s.tokens)
sentence.type = s.type
return sentence
})
//add analysis on each token
sentences = sentences.map(function (s) {
s.tokens = s.tokens.map(function (token, i) {
token.analysis = parents[token.pos.parent](token.normalised, s, i)
return token
})
return s
})
//add next-last references
sentences = sentences.map(function (sentence, i) {
sentence.last = sentences[i - 1]
sentence.next = sentences[i + 1]
return sentence
})
//return a Section object, with its methods
return new Section(sentences)
}
module.exports = main;
// console.log( pos("Geroge Clooney walked, quietly into a bank. It was cold.") )
// console.log( pos("it is a three-hundred and one").tags() )
// console.log( pos("funny funny funny funny").sentences[0].tokens )
// pos("In March 2009, while Secretary of State for Energy and Climate Change, Miliband attended the UK premiere of climate-change film The Age of Stupid, where he was ambushed").sentences[0].tokens.map(function(t){console.log(t.pos.tag + " "+t.text)})
// pos("the Energy and Climate Change, Miliband").sentences[0].tokens.map(function(t){console.log(t.pos.tag + " "+t.text)})
// console.log(pos("Energy and Climate Change, Miliband").sentences[0].tokens)
// console.log(pos("http://google.com").sentences[0].tokens)
// console.log(pos("may live").tags())
// console.log(pos("may 7th live").tags())
// console.log(pos("She and Marc Emery married on July 23, 2006.").tags())
// console.log(pos("Toronto is fun. Spencer and heather quickly walked. it was cool").sentences[0].referables())
// console.log(pos("a hundred").sentences[0].tokens)
// console.log(pos("Tony Reagan skates").sentences[0].tokens)
// console.log(pos("She and Marc Emery married on July 23, 2006").sentences[0].tokens)
// console.log(pos("Tony Hawk walked quickly to the store.").sentences[0].tokens)
// console.log(pos("jahn j. jacobheimer").sentences[0].tokens[0].analysis.is_person())
// pos("Dr. Conrad Murray recieved a guilty verdict").sentences[0].tokens.map(function(t){console.log(t.pos.tag + " "+t.text)})
// pos("the Phantom of the Opera").sentences[0].tokens.map(function(t){console.log(t.pos.tag + " "+t.text)})
// pos("Tony Hawk is nice").sentences[0].tokens.map(function(t){console.log(t.pos.tag + " "+t.text)})
// pos("tony hawk is nice").sentences[0].tokens.map(function(t){console.log(t.pos.tag + " "+t.text)})
// console.log(pos("look after a kid").sentences[0].tags())
// pos("Sather tried to stop the deal, but when he found out that Gretzky").sentences[0].tokens.map(function(t){console.log(t.pos.tag + " "+t.text+" "+t.pos_reason)})
// pos("Gretzky had tried skating").sentences[0].tokens.map(function(t){console.log(t.pos.tag + " "+t.text+" "+t.pos_reason)})
// pos("Sally and Tom fight a lot. She thinks he is her friend.").sentences[0].tokens.map(function(t){console.log(t.pos.tag + " "+t.text+" "+t.pos_reason)})
// console.log(pos("i think Tony Danza is cool. He rocks and he is golden.").sentences[0].tokens[2].analysis.referenced_by())
// console.log(pos("i think Tony Danza is cool and he is golden.").sentences[0].tokens[6].analysis.reference_to())
// console.log(pos("Tina grabbed her shoes. She is lovely.").sentences[0].tokens[0].analysis.referenced_by())
// console.log(pos("Sally and Tom fight a lot. She thinks he is her friend.").sentences[0].tokens[0].analysis.referenced_by())
// console.log(pos("it's gotten the best features").sentences[0].tokens[1].normalised=="has") //bug
// console.log(pos("he's fun").sentences[0].tokens[1].normalised=="is")
},{"./data/lexicon":2,"./data/lexicon/values":12,"./data/parts_of_speech":14,"./data/unambiguous_suffixes":15,"./data/word_rules":16,"./methods/tokenization/tokenize":22,"./parents/parents":35,"./section":46,"./sentence":47}],46:[function(require,module,exports){
//a section is a block of text, with an arbitrary number of sentences
//these methods are just wrappers around the ones in sentence.js
var Section = function(sentences) {
var the = this
the.sentences = sentences || [];
the.text = function() {
return the.sentences.map(function(s) {
return s.text()
}).join(' ')
}
the.tense = function() {
return the.sentences.map(function(s) {
return s.tense()
})
}
//pluck out wanted data from sentences
the.nouns = function() {
return the.sentences.map(function(s) {
return s.nouns()
}).reduce(function(arr, a) {
return arr.concat(a)
}, [])
}
the.entities = function(options) {
return the.sentences.map(function(s) {
return s.entities(options)
}).reduce(function(arr, a) {
return arr.concat(a)
}, [])
}
the.people = function() {
return the.sentences.map(function(s) {
return s.people()
}).reduce(function(arr, a) {
return arr.concat(a)
}, [])
}
the.adjectives = function() {
return the.sentences.map(function(s) {
return s.adjectives()
}).reduce(function(arr, a) {
return arr.concat(a)
}, [])
}
the.verbs = function() {
return the.sentences.map(function(s) {
return s.verbs()
}).reduce(function(arr, a) {
return arr.concat(a)
}, [])
}
the.adverbs = function() {
return the.sentences.map(function(s) {
return s.adverbs()
}).reduce(function(arr, a) {
return arr.concat(a)
}, [])
}
the.values = function() {
return the.sentences.map(function(s) {
return s.values()
}).reduce(function(arr, a) {
return arr.concat(a)
}, [])
}
the.tags = function() {
return the.sentences.map(function(s) {
return s.tags()
})
}
//transform the sentences
the.negate = function() {
the.sentences = the.sentences.map(function(s) {
return s.negate()
})
return the
}
the.to_past = function() {
the.sentences = the.sentences.map(function(s) {
return s.to_past()
})
return the
}
the.to_present = function() {
the.sentences = the.sentences.map(function(s) {
return s.to_present()
})
return the
}
the.to_future = function() {
the.sentences = the.sentences.map(function(s) {
return s.to_future()
})
return the
}
}
module.exports = Section;
},{}],47:[function(require,module,exports){
// methods that hang on a parsed set of words
// accepts parsed tokens
var Sentence = function(tokens) {
var the = this
the.tokens = tokens || [];
var capitalise = function(s) {
return s.charAt(0).toUpperCase() + s.slice(1);
}
the.tense = function() {
var verbs = the.tokens.filter(function(token) {
return token.pos.parent === "verb"
})
return verbs.map(function(v) {
return v.analysis.tense
})
}
the.to_past = function() {
the.tokens = the.tokens.map(function(token) {
if (token.pos.parent === "verb") {
token.text = token.analysis.to_past()
token.normalised = token.text
}
return token
})
return the
}
the.to_present = function() {
the.tokens = the.tokens.map(function(token) {
if (token.pos.parent === "verb") {
token.text = token.analysis.to_present()
token.normalised = token.text
}
return token
})
return the
}
the.to_future = function() {
the.tokens = the.tokens.map(function(token) {
if (token.pos.parent === "verb") {
token.text = token.analysis.to_future()
token.normalised = token.text
}
return token
})
return the
}
the.insert = function(token, i) {
if (i && token) {
the.tokens.splice(i, 0, token);
}
}
//negate makes the sentence mean the opposite thing.
the.negate = function() {
//these are cheap ways to negate the meaning
// ('none' is ambiguous because it could mean (all or some) )
var logic_negate = {
//some logical ones work
"everyone": "no one",
"everybody": "nobody",
"someone": "no one",
"somebody": "nobody",
// everything:"nothing",
"always": "never",
//copulas
"is": "isn't",
"are": "aren't",
"was": "wasn't",
"will": "won't",
//modals
"didn't": "did",
"wouldn't": "would",
"couldn't": "could",
"shouldn't": "should",
"can't": "can",
"won't": "will",
"mustn't": "must",
"shan't": "shall",
"shant": "shall",
"did": "didn't",
"would": "wouldn't",
"could": "couldn't",
"should": "shouldn't",
"can": "can't",
"must": "mustn't"
}
//loop through each term..
for (var i = 0; i < the.tokens.length; i++) {
var tok = the.tokens[i]
//turn 'is' into 'isn't', etc - make sure 'is' isnt followed by a 'not', too
if (logic_negate[tok.normalised] && (!the.tokens[i + 1] || the.tokens[i + 1].normalised != "not")) {
tok.text = logic_negate[tok.normalised]
tok.normalised = logic_negate[tok.normalised]
if (tok.capitalised) {
tok.text = capitalise(tok.text)
}
return the
}
// find the first verb..
if (tok.pos.parent == "verb") {
// if verb is already negative, make it not negative
if (tok.analysis.negative()) {
if (the.tokens[i + 1] && the.tokens[i + 1].normalised == "not") {
the.tokens.splice(i + 1, 1)
}
return the
}
//turn future-tense 'will go' into "won't go"
if (tok.normalised.match(/^will /i)) {
tok.text = tok.text.replace(/^will /i, "won't ")
tok.normalised = tok.text
if (tok.capitalised) {
tok.text = capitalise(tok.text)
}
return the
}
// - INFINITIVE-
// 'i walk' -> "i don't walk"
if (tok.analysis.form == "infinitive" && tok.analysis.form != "future") {
tok.text = "don't " + (tok.analysis.conjugate().infinitive || tok.text)
tok.normalised = tok.text.toLowerCase()
return the
}
// - GERUND-
// if verb is gerund, 'walking' -> "not walking"
if (tok.analysis.form == "gerund") {
tok.text = "not " + tok.text
tok.normalised = tok.text.toLowerCase()
return the
}
// - PAST-
// if verb is past-tense, 'he walked' -> "he did't walk"
if (tok.analysis.tense == "past") {
tok.text = "didn't " + (tok.analysis.conjugate().infinitive || tok.text)
tok.normalised = tok.text.toLowerCase()
return the
}
// - PRESENT-
// if verb is present-tense, 'he walks' -> "he doesn't walk"
if (tok.analysis.tense == "present") {
tok.text = "doesn't " + (tok.analysis.conjugate().infinitive || tok.text)
tok.normalised = tok.text.toLowerCase()
return the
}
// - FUTURE-
// if verb is future-tense, 'will go' -> won't go. easy-peasy
if (tok.analysis.tense == "future") {
if (tok.normalised == "will") {
tok.normalised = "won't"
tok.text = "won't"
} else {
tok.text = tok.text.replace(/^will /i, "won't ")
tok.normalised = tok.normalised.replace(/^will /i, "won't ")
}
if (tok.capitalised) {
tok.text = capitalise(tok.text);
}
return the
}
return the
}
}
return the
}
the.entities = function(options) {
var spots = []
options = options || {}
the.tokens.forEach(function(token) {
if (token.pos.parent === "noun" && token.analysis.is_entity()) {
spots.push(token)
}
})
if (options.ignore_gerund) {
spots = spots.filter(function(t) {
return t.pos.tag !== "VBG"
})
}
return spots
}
//noun-entities that look like person names..
the.people = function(){
return the.entities({}).filter(function(o){
return o.analysis.is_person()
})
}
the.text = function() {
return the.tokens.map(function(s) {
return s.text
}).join(' ')
}
//sugar 'grab' methods
the.verbs = function() {
return the.tokens.filter(function(t) {
return t.pos.parent == "verb"
})
}
the.adverbs = function() {
return the.tokens.filter(function(t) {
return t.pos.parent == "adverb"
})
}
the.nouns = function() {
return the.tokens.filter(function(t) {
return t.pos.parent == "noun"
})
}
the.adjectives = function() {
return the.tokens.filter(function(t) {
return t.pos.parent == "adjective"
})
}
the.values = function() {
return the.tokens.filter(function(t) {
return t.pos.parent == "value"
})
}
the.tags = function() {
return the.tokens.map(function(t) {
return t.pos.tag
})
}
//find the 'it', 'he', 'she', and 'they' of this sentence
//these are the words that get 'exported' to be used in other sentences
the.referables=function(){
var pronouns={
he:undefined,
she:undefined,
they:undefined,
it:undefined
}
the.tokens.forEach(function(t){
if(t.pos.parent=="noun" && t.pos.tag!="PRP"){
pronouns[t.analysis.pronoun()]=t
}
})
return pronouns
}
return the
}
module.exports = Sentence;
},{}],48:[function(require,module,exports){
//just a wrapper for text -> entities
//most of this logic is in ./parents/noun
var pos = require("./pos");
var main = function (text, options) {
options = options || {}
//collect 'entities' from all nouns
var sentences = pos(text, options).sentences
var arr = sentences.reduce(function (arr, s) {
return arr.concat(s.entities(options))
}, [])
//for people, remove instances of 'george', and 'bush' after 'george bush'.
var ignore = {}
arr = arr.filter(function (o) {
//add tokens to blacklist
if (o.analysis.is_person()) {
o.normalised.split(' ').forEach(function (s) {
ignore[s] = true
})
}
if (ignore[o.normalised]) {
return false
}
return true
})
return arr
}
module.exports = main;
// console.log(spot("Tony Hawk is cool. Tony eats all day.").map(function(s){return s}))
// console.log(spot("Tony eats all day. Tony Hawk is cool.").map(function(s){return s}))
// console.log(spot("My Hawk is cool").map(function(s){return s.normalised}))
},{"./pos":45}]},{},[1]);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment